Thibault Reuille, security researcher at OpenDNS
Thibault Reuille, security researcher at OpenDNS

The average reading speed for adults is 300 words a minute. To put this in context, it would take an average reader about 11,866 minutes to read War and Peace from cover to cover.

Why is this relevant to security?

When it comes to security research, manual monitoring and analysis of the increasingly large-scale and changing data sets involved is no longer humanly possible. An average organisation would need banks of people reading network log to comprehensively monitor their growing data logs. Assuming they're able to efficiently interpret what they're reading, they couldn't be expected to cross-reference their intelligence against what their colleagues have learned to see the full picture.

Joining the dots – visually

Using smart data visualisation combined with intelligent data mining eradicates the need to physically read data sets. Connections can be drawn between data points and the necessary observations can also be made through visualisation that may not be obvious in text.

The security field offers an endless number of applicable uses for the visualisation of loosely related data. Firewall, intrusion detection and prevention systems (IDS/IPS), and malware infection alerts could, for instance, be visualized to expose a malicious actor's previously unrecognized activity patterns. Data visualisation can simplify the current state of a complex IT system in an accurate and elegant fashion.

Getting visual

Also called frame networks, semantic networks can represent any desired relationship between defined concepts or entities in the form of a visualisation. Such networks consist of nodes which represent the entities being examined, and edges (the connections between the nodes) that describe the relationships between the entities. A semantic network representing a company's IT environment might consist of nodes that represent various types of server characteristics and environments (HTTP, Mail, NTP, SSH ...), and edges that specify relationships and their attributes (channels, ports, traffic, bandwidth, etc.)

But during the creation of any semantic network it is up to the user to define the entities and relationships. The nodes and edges of a semantic network, taken together, are called its domain and represent the model of the underlying information.

There is more than one way to model any given problem, but it is always best to approach the problem with the available data in mind. When a model has been decided upon, the source data should be parsed so as to populate a relational data set that follows the model.

Driven by data
Having designed the model and extracted the data, the next logical step is to derive insights from the shape of the resulting semantic network. Consider the following example:

(Image courtesy of OpenDNS, via OpenGraphiti)

This image represents email traffic inside a company. Each node represents an employee and the connections signify emails sent between them. Three realities are instantly identified: Firstly, there are three main central clusters – so suggesting that the company is located in three offices. Secondly, ‘data dust' is present throughout the image – suggesting the messaging of inactive addresses, potentially spam.  Lastly, certain nodes are connected in a group, displaying some sort of hierarchy in the communication (for example: managers, help desks, or mailing lists).

The bigger picture via a smaller window
In this era of big data, databases with billions of entries from security devices are increasingly common. Segmenting data in three ways can break the problem up into smaller, manageable chunks:

Entity grouping: Create nodes that represent groups of entities rather than individual entities, such as team nodes instead of employee nodes. Researchers can then vary the level of detail according to the type of security data being used – firewall logs, traffic files, etc. This allows the security team to see the whole model, without having to load all its constituent information upfront.

Sampling: Another way to limit the size of the data set is by selecting either a random or focused subset of the data.

Parallelisation: Rather than showing all the traffic, interpreting a random fraction of employee email will typically create a very similar image on which designs can be developed, paying heed to the sampling used.

Within every interconnected organisation there is a rich vein of intelligent data. The challenge is understanding what it's telling you. However, the reward for doing so holds the secret to solving complex security problems. Assuming you don't have any super readers in your organisation, you'll need smart data visualisation, combined with intelligent data mining, to create, or even prevent, your own fireworks.

Contributed by Thibault Reuille, a security researcher at OpenDNS and creator of OpenGraphiti, an open-source 3D data visualisation engine.