Enhancing data quality in real-time threat intelligence systems using machine learning

Ariel Rodriguez, Koji Okamura

Research output: Contribution to journalArticlepeer-review

4 Citations (Scopus)


In this research, we aim to expand the utility of keyword filtering on text-based data in the domain of cyber threat intelligence. Existing research-based cyber threat intelligence systems and production systems often utilize keyword filtering as a method to obtain training data for a classification model or as a classifier in itself. This method is known to have concerns with false-positives that affect data quality and thus can produce downstream issues for security analysts that utilize these types of systems. We propose a method to classify open-source intelligence data into a cybersecurity-related information stream and subsequently increase the quality of that stream using an unsupervised clustering method. Our method expands on keyword filtering techniques by introducing a word2vec generated associated words list which assists in the classification of ambiguous posts to reduce false-positives while still retrieving large scope data. We then use k-means clustering on positively classified entries to identify and remove clusters that are not relevant to threats. We further explore this method by investigating the effects of using segmentation based on data characteristics to achieve better classification. Together these methods are able to create a higher quality cyber threat-related data stream that can be applied to existing text-based threat intelligence systems that use keyword filtering methods.

Original languageEnglish
Article number91
JournalSocial Network Analysis and Mining
Issue number1
Publication statusPublished - Dec 2020

All Science Journal Classification (ASJC) codes

  • Information Systems
  • Communication
  • Media Technology
  • Human-Computer Interaction
  • Computer Science Applications


Dive into the research topics of 'Enhancing data quality in real-time threat intelligence systems using machine learning'. Together they form a unique fingerprint.

Cite this