ExtraHop open sources 16 million rows of threat domain data

Cloud-native network detection and response (NDR) specialist ExtraHop hopes to give security researchers and defenders a little extra help when it comes to defending against malware and botnet operations, by making its entire 16 million row domain generated by algorithm (DGA) dataset publicly available on GitHub.

So-called DGAs are programs that use algorithmic generation to come up with large numbers of new domain names that threat actors can use to deliver malware quickly and efficiently without having their servers deny-listed by counter-security measures, or taken down by defenders. Essentially, they give threat actors the ability to rapidly switch domains on the fly while an attack is in progress.

DGAs have been around for some time. One of their first large-scale use cases was the distribution of the Conficker worm almost 15 years ago. This incident, which originated in Ukraine, famously disrupted public sector systems in the UK, including those of the House of Commons, the Ministry of Defence (MoD), Manchester City Council and Sheffield Teaching Hospitals NHS Foundation Trust.

Tried and tested though DGAs may be, their utility to cyber criminals has not diminished in the past decade-and-a-half, according to ExtraHop director of data science, Todd Kemmerling.

“Giving threat actors the ability to operate undetected and an uptick in these types of attacks, DGAs are increasingly considered a major threat to businesses today,” he said.

Up to now, detecting and mitigating DGAs has been a challenging task for defenders – the usual process involves analysing the algorithm used by the malware and monitoring DNS requests, before implementing techniques to identify and block malicious domain names.

ExtraHop says it will be able to make this process much less painful by making its dataset – which was originally compiled as an element of its enterprise Reveal(x) NDR platform – available for anybody to look at and if they wish, to incorporate into their own machine learning (ML) classifier models to more quickly identify DGAs and stop attacks enabled by them.

The organisation claims that the Reveal(X) product can already do this with more than 98% accuracy, and as such it hopes that in making its data more widely available under an open source licence, security teams will be able to identify malicious activity on their networks before it becomes a serious problem.

The firm initially began compiling the dataset after becoming dissatisfied with the performance of previous DGA identification models. Its teams tried several different ways to improve these models before hitting on a simpler-than-expected method to conduct feature engineering – the process of extracting features from raw data – that vastly improved the accuracy of the output. This process is outlined in more technical detail on ExtraHop’s blog.

“As we began developing a model for detecting DGAs, it became apparent there was a lack of public datasets accessible to security teams with a wide-ranging set of resources. With this dataset, we are filling that gap, giving any security team access to the pivotal data needed to detect DGAs swiftly,” said Kemmerling.