First steps in Project T9 (Building an AI training dataset based on modelling the latest cyberattacks)

Earlier this year, we began our journey with the T9 Project, as outlined in our post titled “Beginning of Journey to T9 Project“. Now, the first release of T9 Data is almost here. First, we will briefly introduce the background and purpose of the T9 Project, explain why we selected the specific attacks/data for release, and describe how we built the environment.

Why T9?

The T9 Project was initiated due to several reasons: First, there is a lack of high-quality training datasets for cyberattack detection AI models (security AI models). Second, existing training datasets do not reflect the latest cyberattack trends. Third, creating cyberattack training datasets requires significant time and effort. To address these limitations, we regularly replicate the latest cyberattacks (such as Apache Log4Shell, CryptoWire Ransomware, etc.) to create an automated environment for executing attacks and collecting logs (packets, system activity logs, etc.). These logs are then used to develop and improve security AI models.

Table 1. T9 Project 2024 Attack List (2024-01)

T9 Attack IDDomainName / Method
1T1-24-01-S-N-CLNetworkApache Log4Shell
2T2-24-01-S-N-CLNetworkSMBGhost
3T3-24-01-S-N-CLNetworkApache ActiveMQ Deserialization
4T4-24-01-S-E-MEnd PointCryptoWire Ransomware
5T5-24-01-S-E-LMEnd PointXMRing Miner
6T6-24-01-S-E-FHEnd PointSu Brute-Force
7T7-24-01-M-NE-CLMNetwork
End Point
Apache Log4Shell + XMRing Miner
8T8-24-01-M-NE-CFHLNetwork
End Point
Apache ActiveMQ + Su Brute-Force
9T9-24-01-M-NE-CLMNetwork
End Point
SMBGhost + CryptoWire Ransomware

The attacks listed in Table 1 are recent incidents that have caused significant social disruption or have occurred within the last two years. Some attacks, like Apache Log4Shell and SMBGhost, are older but were selected due to their substantial social impact and consequences.

The attack domains in Table 1 are categorized based on where they can be detected: “Network” refers to attacks detectable through network packet analysis, while “End Point” refers to attacks detectable through system logs collected from the host. Not all attacks can be detected solely by using network or host information. For example, Apache Log4Shell attacks can be detected on the host where the command is executed but are more effectively detected using network packet data. Therefore, we categorized the attack areas accordingly.

Now, let’s focus on a crucial aspect of the T9 Project: the automatic collection of attack logs. The collection process depends on the configured attack environment, utilizing Docker for network attacks and VirtualBox for endpoint or combination attacks (a single attack is a T9 attack, while a combination attack involves two or more single attacks). For log collection, we used packet dump applications such as tcpdump and pktmon to collect packets for network attacks, and Sysmon by Microsoft to collect system logs for endpoint attacks.

Figure 1. Example of T9 Project attack data collection

Figure 1 illustrates the process of collecting attack data. While the details may vary depending on the deployment, the overall process remains similar. When ‘run.py’ is executed to perform an attack, the virtual environment is launched first, followed by the initiation of the log collector. After the attack is executed, the log collection stops, and the logs are sent to the host. Details about the configuration of the environment and log collection for each attack (2024-01 attack list) can be found on our T9 website (https://t9project.dev/). You may also download the minimally collected raw attack data from the website.

Welcome to the T9 website!

The T9 Project website comprises four main sections: Home, Attack, Dataset, and Contact Us. The Home section provides the background and purpose of the T9 Project and an introduction to our overall research.

Figure 2. Front page of the T9 Project website

The Attack section contains a detailed description of each attack, instructions for building and running the environment, MITRE ATT&CK tactic correlations, and the collected attack data (packets, logs, etc.).

Figure 3. Attack page of the T9 Project website

The Dataset section lists attack data available for download (pcap for network logs, evtx for endpoint logs, and log for Windows). The Contact Us section provides contact information for requesting additional information such as deployment environment and attack source.

Figure 4. Dataset page of the T9 Project website

Conclusion

In this blog post, we briefly described the implementation and log collection of the T9 Project and introduced the T9 website where you can obtain T9 Data. We will continue to analyze the latest cyberattacks and update T9 Data periodically. Moreover, in 2025, we will release the benign dataset and the cyberattack detection AI model created using it for practical use in security AI models. T9 Data 2024-02 will be updated on December 17, so stay tuned for our latest news and updates.

2 명이 이 글에 공감합니다.

답글 남기기

이메일 주소는 공개되지 않습니다. 필수 필드는 *로 표시됩니다