Enhancing data quality in real-time threat intelligence systems using machine learning

Ariel Rodriguez, Koji Okamura

研究成果: ジャーナルへの寄稿学術誌査読

4 被引用数 (Scopus)


In this research, we aim to expand the utility of keyword filtering on text-based data in the domain of cyber threat intelligence. Existing research-based cyber threat intelligence systems and production systems often utilize keyword filtering as a method to obtain training data for a classification model or as a classifier in itself. This method is known to have concerns with false-positives that affect data quality and thus can produce downstream issues for security analysts that utilize these types of systems. We propose a method to classify open-source intelligence data into a cybersecurity-related information stream and subsequently increase the quality of that stream using an unsupervised clustering method. Our method expands on keyword filtering techniques by introducing a word2vec generated associated words list which assists in the classification of ambiguous posts to reduce false-positives while still retrieving large scope data. We then use k-means clustering on positively classified entries to identify and remove clusters that are not relevant to threats. We further explore this method by investigating the effects of using segmentation based on data characteristics to achieve better classification. Together these methods are able to create a higher quality cyber threat-related data stream that can be applied to existing text-based threat intelligence systems that use keyword filtering methods.

ジャーナルSocial Network Analysis and Mining
出版ステータス出版済み - 12月 2020

!!!All Science Journal Classification (ASJC) codes

  • 情報システム
  • 通信
  • メディア記述
  • 人間とコンピュータの相互作用
  • コンピュータ サイエンスの応用


「Enhancing data quality in real-time threat intelligence systems using machine learning」の研究トピックを掘り下げます。これらがまとまってユニークなフィンガープリントを構成します。