Visualizing combined DNS and NetFlow data for Threat Hunting

By Alex Sangers

For readers that are interested in Threat Hunting, DNS data, NetFlow data or data visualizations in cyber security applications.

As the threat landscape is becoming more and more diverse with increasing number of unknown unknowns, threat hunting is becoming increasingly important to complement real-time monitoring and (anomaly) detection . Threat hunters proactively and iteratively search through network data and logs to detect cyber threats that evade existing security solutions, while taking the adversary’s perspective in doing so. The main objective of threat hunting is to reduce the time needed to find traces of attacks that already have compromised the IT environment. We worked with a multidisciplinary TNO team on innovative visualization techniques that improve the process of threat hunting. This post aims to inspire you with innovative data visualization possibilities for more effective threat hunting.

A hunt starts with the combination of data availability and an educated guess based on threat intelligence, security developments and previous experience about some type of malicious activity that might have gone unnoticed in the IT environment of an organization. Based on this educated guess, a hunter defines a hypothesis that outlines a possible attack (steps). Next, he or she gathers and correlates relevant data sources that may help to detect (parts of) an attack. The hunter can then explore the data via various tools and business intelligence. The combination of the hypothesis and the data exploration might uncover data patterns that reveal parts of an attack path, which can then be used to extend detection capability.

Threat hunting is a labor-intensive task that requires a lot of experience, and cyber security and data expertise. This makes the process of threat hunting time-consuming and expensive. Data visualization can help threat hunters to faster explore relevant data sets. Our human visual system is very powerful and we can interpret information in images very fast, especially in a human-machine interaction. Examples of threat hunting data exploration aspects that might be facilitated by data visualizations are countless: getting to know distributions of the data, time dependencies, filtering of relevant data and correlations between data features.

To experiment with data visualizations for threat hunting, we collected anonymized operational DNS and NetFlow network data of an organization. Despite the anonymization we could combine the DNS with NetFlow data using IP addresses and time stamps, while ensuring that the internal host names are not traceable to the original machine names or any individuals. The connectivity behavior of the anonymized internal hosts to external domains can be derived from the combined dataset. It also includes information such as packet counts, bytes, protocol, port number and other fields. Next to these standard NetFlow and DNS fields, the data is enriched with features derived from the data such as whether a domain seems to be randomly generated (DGA), and enriched with public data such as Autonomous System Numbers (ASNs) and countries corresponding to the external IP addresses.

Hypothesis: information theft with keyword exchange via generated subdomains.

Based on the available data and discussion with an experienced threat hunter from an insurance company, we developed the threat hunting hypothesis that an adversary has infected a victim’s endpoint with malware. The adversary is limited in amount of resources and time it will use to scan for information on the network. Periodically, the adversary posts a new set of search keywords on a website with a frequently changing generated subdomain using a domain name generation algorithm (DGA), e.g. niesrheiuacvni.example.org. The malware regularly retrieves the new set of keywords from this website and follows instructions such as scanning the organization’s network for sensitive information.

We will explore the collected combined NetFlow and DNS dataset to investigate this hypothesis. Furthermore, with the hunting hypothesis in mind, we are interested for the following characteristics in the data.

a) External domains with randomly generated subdomains that are frequently visited by only a limited number of internal hosts. The connections to these domains are frequent, but low in volume.

b) As one or more suspicious external domains have been identified in step a), the connectivity behavior of internal hosts to that external domain can be analyzed. We are interested in the internal hosts that are connecting frequently, possibly periodically and low in volume to one or more of these domains.

While it is hard to visualize multiple dimensions in one type of data visualization, a common approach is to use different types of visualizations side by side, where each provides a different view on the data, typically showing a different subset of dimensions. The views together aim to provide a complete image of the data. These views can be linked: selections, filtering, or zooming done in one visualization are reflected in all others, so the views stay consistent to one another. 
 
We developed a visualization dashboard to facilitate the evaluation of threat hunting hypothesis as described above. The dashboard consists of a so-called Parallel Coordinates Plot linked with a Treemap, followed by a histogram linked with a time series plot. Because the combined and enriched DNS and NetFlow data have many features, the Parallel Coordinates Plot is a great visualization technique to allow for multiple dimension in one view. This visualization allows for easy filtering and playing around with multiple dimensions in the data. And if you have found a suspicious combination of values in the Parallel Coordinates Plot, you could add those ranges to your SIEM and automate the detection.

Have a look at the movie to see how the visualization dashboard helps to quickly explore and filter the data, suiting a threat hunter hunting for the hypothesis of information theft with keyword exchange via generated subdomains.

Visualizations are essential to explore and interpret the huge amount of data to get to know distributions of the data, filtering of relevant data and correlations between data features. An interactive graphical dashboard can facilitate the evaluation of a threat hunting hypothesis, with innovative visualization techniques that have hardly been used in security monitoring practices so far. In the end, data visualizations are very useful to find traces of attacks in the IT environment. Easier and faster!

This work was done in a TNO research project in the context of the Vraaggestuurd Programma HTSM/Cyber Risk Management & System Resilience 2018.

Inspired by this blog and our research activities? We appreciate your feedback! For suggestions or to team up with our cyber security research, please contact alex.sangers@tno.nl for further information.