Recent cyber attacks have rapidly evolved in both technical sophistication and tactical depth, exposing the limitations of existing countermeasures. According to Microsoft’s Digital Defense Report 2024[1], attack groups have significantly increased their use of multi-stage attacks, combining multiple TTPs sequentially instead of relying on a single technique. Notably, high-velocity intrusions—where attackers achieve privilege escalation and lateral movement within minutes after initial penetration—have surged. Furthermore, major security reports highlight that key attack groups are flexibly altering their attack chains and technique combinations, employing attack techniques that tactically and strategically surpass existing defense approaches. This is pointed out as a critical point requiring renewed vigilance[2,3].
Accordingly, our research team has been studying an enhanced cyber threat prediction system that can comprehensively utilize diverse attack information. This approach overcomes the limitations of existing simple pattern-based attack detection and attack prediction methods that rely on limited information. As part of this research, we will introduce a new cyber threat prediction study combining Large Language Models (LLMs) and Retrieval-Augmented Generation (RAG) in a blog series. This article will first examine existing cyber threat prediction methods and the concept of RAG, followed by an explanation of the cyber threat prediction approach based on this foundation.
Existing Research on Cyber Threat Prediction
Early cyber threat prediction primarily relied on estimating future attacks based on past attack information. The paper “Predictive Blacklisting as an Implicit Recommendation System”[4], presented at IEEE INFOCOM 2010, proposed a new approach that moved beyond simply blocking IPs that had attacked in the past, instead predicting IPs with a high likelihood of future attacks. This research marked a significant turning point by reframing attack logs as a recommendation system problem. It analyzed correlations between attackers and victim organizations, introducing a method that actively predicted future attacks beyond simple blocking. However, this approach had limitations due to its excessive reliance on a single metric: IP addresses. Attackers can easily change IPs, and IPs alone cannot capture the technical context or strategic intent of an attack, making it highly likely that meaningful results cannot be derived. Therefore, while this method could capture superficial patterns, it was insufficient for predicting the actual strategic behavior or tactics of attackers.

Fig 1. Architecture of DeepLog[5]
Afterward, cyber threat prediction began to be understood as a continuous sequence of attack events, leading to the emergence of sequence model-based prediction research. Among these, DeepLog[5], presented at ACM CCS 2017, proposed an approach that trains a Long Short-Term Memory (LSTM) model on system logs to predict subsequent events. While DeepLog broke away from traditional IP-based prediction by understanding attacks as sequences and enabling early detection of future abnormal behavior, it had critical limitations: the long-term dependency problem inherent to LSTMs and an inability to understand the structural relationships between techniques outlined in MITRE ATT&CK. Subsequently, with the advancement of Transformers, cyber threat prediction research also adopted Transformers. Research was conducted to predict subsequent cyber attacks by solving the long-term dependency problem and understanding the structure of MITRE ATT&CK. Recently, a method was introduced that predicts the next technique by dividing MITRE ATT&CK’s TTP (Tactics, Techniques, Procedures) into tokens and utilizing Graph Neural Networks (GNNs) to understand the structure [6].
Despite this, all such methods have limitations in effectively predicting attacks that continuously diversify and employ tactical and strategic approaches. The need for cyber threat prediction that moves beyond relying solely on fragmented information has grown steadily, necessitating the comprehensive consideration of vast attack analysis data such as the TTP relationships within MITRE ATT&CK and integrated security reports.
LLM with RAG
Before introducing LLM and RAG for predicting cyber threats, let’s briefly explore what RAG is, as it’s being applied in various recent LLM research studies.
LLM and Hallucination
LLMs have made remarkable progress in generating natural sentences based on vast amounts of data, but a structural issue known as hallucination has been consistently pointed out. Hallucination refers to the phenomenon where LLMs create information that is not factually verified as if it were true, stemming from their inherent nature of performing probability-based language prediction. This hallucination has caused numerous problems in real-world scenarios. LLMs have frequently created non-existent papers to cite in response to user queries or fabricated false product reviews, confusing users. There have even been cases where LLMs submitted fabricated legal precedents, leading to legal issues, and provided customers with false discount information, causing actual financial losses to businesses[7].
Retrieval-Augmented Generation(RAG)

Fig 2. Answer generation process of Retrieval-Augmented Generation[8]
To overcome hallucination problem in LLMs, Facebook AI Research proposed RAG in their 2020 paper “Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks”[9]. RAG is based on the idea that models should retrieve knowledge they don’t possess from external sources. It first searches external datasets for information relevant to the user’s question and then generates an answer based on this retrieved information. This enables LLMs to provide more reliable responses grounded in actual source documents, rather than solely relying on parameters learned during training. The core of RAG, as shown in Figure 2, is the process of finding documents semantically closest to the user’s query—specifically, similarity-based retrieval. When a user inputs a question, RAG utilizes cosine similarity or dot product to find the most similar document within a document database, where documents are stored as embeddings in vector form. The LLM then references these retrieved documents to generate the final response. This structure enables RAG to provide significantly more accurate and evidence-based responses compared to simple generative models.
The effectiveness of RAG has been demonstrated in numerous studies. The 2021 paper “Retrieval Augmentation Reduces Hallucination in Conversation”[10] reported that BART, one of the existing LLMs, recorded a hallucination rate of 68.2%. However, when RAG was utilized, the hallucination rate dropped to 9.6%, representing an approximately 86% reduction in hallucination. Furthermore, RAG demonstrated significant effectiveness even when combined with LLMs trained on small datasets. According to Meta’s “Atlas: Few-shot Learning with Retrieval Augmented Language Models” [11], combining ATLAS-11B (with 1.1 billion model parameters) and RAG outperformed GPT-3 175B (with 17.5 billion model parameters) by at least 3% and up to 13% on key benchmarks. Therefore, through such research, RAG has established itself as a method that mitigates hallucination issues by granting LLMs the ability to look up and state evidence, while also delivering very high accuracy on knowledge-based tasks.
LLM and RAG for Cyber Threat Prediction
Then, how are LLMs and RAGs structured for cyber threat prediction? As mentioned earlier, recent cyber threat prediction techniques have diversified and increasingly utilize multi-stage attack methods rather than single-stage attacks. Consequently, the purpose of employing LLMs and RAGs is to comprehensively evaluate diverse attack information.

Fig 3. Overview of LLM and RAG for Cyber Threat Prediction
LLM and RAG for predicting cyber threats can be broadly divided into data pre-processing, RAG dataset construction, and LLM querying, as shown in Figure 3. Let’s examine how data is processed and future attacks are anticipated in each step.
Data Pre-processing
The data pre-processing stage begins by restructuring the input “*.pcap” files into meaningful network units suitable for analysis—such as Sessions, Flows, and Time-windows—enabling effective analysis of attack techniques. This restructuring process not only allows direct interpretation of vast packet data but also facilitates efficient retrieval when extracting similar information from RAG. Subsequently, the “*.pcap” data, now divided into smaller units, undergoes analysis for elements such as IP information, packet length, transmission intervals, and the presence of abnormal scans. This processed data is then utilized in LLM queries and RAG information extraction.
RAG Dataset Construction and Extraction

Fig 4. Example of RAG Dataset Construction and Use[12]
During the RAG dataset construction phase, various security materials are collected and organized into a search-optimized database format, similar to Figure 4, to provide the LLM with accurate and structured knowledge necessary for threat analysis. First, to build the foundational dataset for RAG, we collect MITRE ATT&CK TTPs data defining each attack. This includes Tactics, Techniques, and Sub-Techniques information, along with descriptions of each technique, attack rationale, and campaign examples. This enables the RAG dataset to rapidly search for the technique most closely matching characteristics observed in a specific network session.
Additionally, to understand various attack cases and attack flows, we will collect security reports, breach incident analysis articles, and other materials on a large scale. These documents aim to provide the LLM with sufficient information related to the current packet, including the attacker’s sequence of actions, the combination of techniques used, and the basis for complex attacks. However, the original texts of security reports and similar documents may have issues such as using different technical terminology expressions or not being mapped to MITRE ATT&CK TTPs. Therefore, it is essential to perform normalization and structuring tasks. Key normalization and structuring tasks include mapping attack techniques within security reports to MITRE ATT&CK TTPs, structuring attack techniques used by each attack group, and further reconstructing them by attack phase.
The natural language data constructed in this manner is represented as vectors through the embedding process and stored in vector databases such as FAISS and Chroma. These stored vectors provide the LLM with the most relevant documents through similarity-based search, based on the user’s query. The LLM then acquires various pieces of information—such as which observed network behaviors resemble past attack cases, or which MITRE ATT&CK techniques the input session connects to—and uses this as a basis for inferring attacks that are highly likely to unfold next.
LLM Query
Finally, in the LLM query stage, the analysis results from the preceding preprocessing phase and the relevant attack knowledge retrieved via RAG are combined to form a single prompt. This prompt includes information about the current session, along with attacks similar to the current session retrieved from RAG and similar attack cases from past attack groups, which are then delivered to the LLM. Afterward, the LLM comprehensively infers, based on the provided information, which attack and techniques the current packet flow resembles, what attack flows it led to in the past, and what possible attacks could follow. It then predicts future possible attacks and presents them as answers to the user.
Conclusion
Existing cyber threat prediction techniques rely on single-indicator or single-event-based analysis, limiting their ability to capture complex attack patterns. In contrast, an approach combining LLM and RAG integrates diverse attack information into a structured format and extends it through evidence-based reasoning. This enables effective identification of attack context and tactical connections that conventional methods overlook. Specifically, by combining real-time network session analysis based on “.pcap” files with threat intelligence obtained through RAG searches, and then enabling an LLM to reason over this combined data, a new level of threat intelligence becomes possible. This allows not only for the semantic analysis of current attacks but also for predicting subsequent attack stages that are highly likely to unfold.
In the next article, building upon the concepts introduced here, we will cover the overall architecture and specific implementation process of a cyber threat prediction framework utilizing LLMs and RAG. We will also introduce a cyber threat prediction method leveraging XAI-based reasoning, previously presented in our detection research. We look forward to your continued interest.
References
[1] Microsoft, “Microsoft Digital Defense Report 2024”, 2024
[2] CrowdStrike, “Global Threat Report 2024”, 2024
[3] Madiant, “M-Trends 2024”, 2024
[4] Soldo, F., Le, A., Markopoulou, A., “Predictive Blacklisting as an Implicit Recommendation System”, IEEE INFOCOM, 2010
[5] Du, M., Li, F., Zheng, G., Srikumar, V., “DeepLog: Anomaly Detection and Diagnosis from System Logs through Deep Learning”, In Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security (CCS ’17), 2017
[6] Singh, C., Dhanraj, M., Huang, K., “KillChainGraph: ML Framework for Predicting and Mapping ATT&CK Techniques”, arXiv, 2025
[7] MBN, “AI 믿고 결정했다가 수십억 손실… ‘AI 환각’ 주의보”, 2025
[8] K2view, “RAG architecture: The generative AI enabler”, 2025
[9] Lewis, P., et al., “Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks”, NeurlPS, 2020
[10] Shuster K., et al., “Retrieval Augmentation Reduces Hallucination in Conversation”, Findings of ACL & EMNLP, 2021
[11] Izacard, G., et al., “Atlas: Few-shot Learning with Retrieval Augmented Language Models”, arXiv, 2021
[12] Naver Cloud Platform Forum, “(1부) RAG란 무엇인가”, 2024

KAIST 사이버보안연구센터 AI 기술보안팀 연구원으로 AI 및 LLM을 이용한 보안 연구를 수행하고 있다.