During a recent talk titled “Hunters ATT&CKing with the Right Data” that I gave with my brother Jose Luis Rodriguez @Cyb3rPandaH at ATT&CKcon, we talked about the importance of documenting and modeling security event logs before developing any data analytics while preparing for a threat hunting engagement. Defining relationships among Windows security event logs such as Sysmon, for example, helped us to appreciate the extra context that two or more events together can provide for a hunt. Therefore, I was wondering if there was anything that I could do with my project HELK to apply some of the relationships presented in our talk, and enrich the data collected from my endpoints in real-time.
This post is part of a three-part series. In this first one, I will introduce the initial integration of a new application named KSQL to the HELK ecosystem in order to enable a SQL interface for stream processing on the top of the Kafka platform already provided by HELK. On the other two posts, I will go over a basic example of a JOINstatement with Sysmon Event ID 1 (Process Creation) and Sysmon Event ID 3 (Network Connection), and show you how useful it could be during a hunting engagement. The other two parts can be found in the following links:
- Real-Time Sysmon Processing via KSQL and HELK — Part 2: Sysmon-Join KSQL Recipe 📖
- Real-Time Sysmon Processing via KSQL and HELK — Part 3: Basic Use Case 🏹
What is KSQL?
KSQL is the open source streaming SQL engine for Apache Kafka®. It provides an easy-to-use yet powerful interactive SQL interface for stream processing on Kafka, without the need to write code in a programming language such as Java or Python. KSQL is scalable, elastic, fault-tolerant, and real-time. It supports a wide range of streaming operations, including data filtering, transformations, aggregations, joins, windowing, and sessionization.
KSQL is implemented on the top of the Kafka Streams API which means that KSQL queries are compiled via Kafka Streams applications.
What is “Kafka Streams”?
Kafka Streams is a JVM client library to develop stream processing applications that leverage data stored in Kafka clusters. Remember that stream processing applications do not run inside of Kafka nodes. Instead, Kafka Streams applications read from topics available in Kafka nodes, process the data on their own internal stream processors, and write the results back to the same or new topic in the Kafka cluster.
This concept is what makes KSQL flexible and easy to use with current Kafka cluster deployments.
What is a Stream?
A stream is the most important abstraction provided by Kafka Streams: it represents an unbounded, continuously updating data set, where unbounded means “of unknown or of unlimited size”
Think of a stream as a sequence of data records ordered by time in a key-value format that can be queried for further analysis. One example could be messages from a Kafka topic that stores information about process creationsof specific endpoints in your network as shown below.
KSQL: SQL interface for stream processing?
Essentially, KSQL allows you to easily execute SQL-like queries on the top of streams flowing from Kafka topics. KSQL queries get translated to Java code via the Kafka Streams API reducing the complexity of writing several lines of Java code for real-time streaming processing. Our basic design then would look like the following:
KSQL and SQL JOIN?
Now that we have gained some understanding of KSQL, let’s define one of the SQL capabilities provided by KSQL that will be helpful for this post, a SQL JOIN. KSQL join operations merge streams and/or tables on common data key values producing new streams or tables in real-time.
Streams vs Tables
Once again, streams are never-ending sequence of data records ordered by time that represent the past and the present state of data ingested to a Kafka topic. One can access a stream from the beginning of its time all the way to the recent recorded values. Tables on the other hand, represent only the up to date state of data records. For example, if DHCP logs are being collected, you can have a table that keeps the most up to date mapping between an IP address and a domain computer in your environment. Meanwhile, you can query the DHCP logs stream and access past IP addresses assigned to workstations in your network.
According to the Confluent Developers Guide, you can join streams and tables in the following way:
INNER, LEFT OUTER & FULL OUTER JOINS?
Inner Join: It returns data records that have matching values in both sources
Left Outer Join: It returns data records from the left source and the matched data records from the right source
Full Outer Join: It returns data records when there is a match in either left or right source
I hope this small review of KSQL and the concepts around Kafka Streams were helpful to get you familiarized with the technology being added to the HELK.
Why KSQL and HELK?
As I mentioned at the beginning of this post, I wanted to find a way to enrich Windows Sysmon event logs by applying the relationships identified within the information it provides. From an infrastructure perspective, I already collect Sysmon event logs from my Windows endpoints and publish them directly to a Kafka topic named winlogbeat in HELK. Therefore, after what we just learned about KSQL, it will be very easy to use it with the current HELK Kafka deployment and apply a Sysmon data model via join operations in real-time.
What is a data model?
A data model in general determines the structure of data objects present in a data set and the relationships identified among each other. From a security events perspective, data objects can be entities provided in event logs such as a “User”, a “Host”,a “Process”, a “File” or even an “IP address”. As any other data object, they also have properties such as “user_name”, “host_name”, “process_name” or “file_name”, and depending on the information provided by each event log, relationships can be defined among those data objects as shown below:
Modeling data objects identified in security event logs help security analysts to identify the right data sources and correlations that can be used for the development of data analytics.
What is the “Sysmon Data Model”?
Windows Sysmon event logs provide information about several data objects such as “Processes”, “IP Addresses”, “Files”, “Registry Keys”, and even “Named Pipes”. In addition, most of their data objects have a common property named ProcessGUID that defines direct relationships among specific Sysmon events. According to my teammates Matt Graeber and Lee Christensen, in their recent white paper “Subverting Sysmon”, the ProcessGUID is a unique value derived from the machine GUID, process start time, and process token ID that can be used to correlate other related events. After documenting the relationships among Sysmon events and data objects based on their ProcessGUID property, the following data model is possible:
As we already know, KSQL join operations happen on common unique data key values. Therefore, the ProcessGUID property can be used to join Sysmon events. For the purpose of this post, we will join ProcessCreate (Event ID 1) and NetworkConnect (Event ID 3) events.
HELK and KSQL Integration
KSQL was developed as part of the Confluent platform, and it can be distributed via docker images available on DockerHub. HELK is deployed via docker images as a proof of concept, so having docker images for KSQL works perfectly. The two docker images that I added to the HELK ecosystem are the following ones:
- Cp-ksql-server Image: It includes the ksql-server package which runs the engine that executes KSQL queries.
- Cp-ksql-cli Image: It includes the ksql-cli package which acts as a client to the KSQL server which allows researchers to interactively pass KSQL queries to the KSQL server. This one is added to the HELK just for testing purposes. The KSQL server can run independently with predefined SQL queries files.
For this blog post, we will need the following:
- An Ubuntu box hosting the latest HELK build
- A Windows 10 System with Sysmon installed
- Winlogbeat installed on the Windows 10 and shipping logs to HELK
Deploying KSQL via HELK
Clone the HELK to your Ubuntu box, and change your directory to docker as shown below:
git clone https://github.com/Cyb3rWard0g/HELK.git
Run the helk_install.sh script to install and run the HELK docker images. You can just go with all the default options and run the basic HELK deployment which comes with both KSQL server and KSQL CLI containers.
If you want to monitor your HELK installation, you can open another console, and run the following commands:
tail -f /var/log/helk-install.log
Once the installation finishes, you should see the following on your main screen:
Run the following commands to see if your containers are running:
sudo docker ps
Launch KSQL CLI interface
You are now ready to start using KSQL. We will use KSQL via its command line interface (CLI) to connect to the helk-ksql-server container and send KSQL queries to it. Run the following command to access the helk-ksql-clicontainer and establish a connection to the helk-ksql-server container:
sudo docker exec -ti helk-ksql-cli ksql http://helk-ksql-server:8088
A KSQL CLI banner will show up, and you will be able to use the KSQL CLI
Inspect the KSQL Server Properties
You can now start by checking the properties assigned to the KSQL server by running the following commands:
ksql> SHOW PROPERTIES;
The information above confirms that the helk-kafka-broker is part of the ksql.streams.bootstrap.servers. Therefore, we will able to execute KSQL queries on the topics available in the Kafka broker.
Check Available Kafka Topics
We can check the metadata of topics available on ur helk-kafka-broker with the SHOW TOPICS command.
ksql> SHOW TOPICS;
What the HELK is going on so far?
Up to this point, we have all we need to start using KSQL on the top of the HELK project. The following is happening:
- HELK’s Kafka broker with topic winlogbeat running
- KSQL Server is running and configured to read from the Kafka topic named winlogbeat
- KSQL CLI is running and configured to talk to KSQL Server and send interactive queries
- Your Ubuntu box hosting the HELK has an interactive connection to the helk-ksql-cli container
- HELK is waiting for Windows Sysmon logs to be published
Get Sysmon Data to HELK
You can now install Sysmon and Winlogbeat following the initial instructions in this post. The following binary versions and configurations are recommended:
- Sysmon Config: start_logging.xml
- Winlogbeat Config: ksql_demo.yml
Remember to start the winlogbeat service to start sending logs to the HELK Kafka broker as shown in the image below:
Check Winlogbeat Shipping Logs
You can check if the logs being collected by the Winlogbeat Shipper are being published to your HELK Kafka broker by running the following command on your Windows endpoint:
Check Logs Published to Kafka Topics
You can also inspect messages making it to the Kafka topic winlogbeat with the PRINT command as shown below:
ksql> PRINT ‘winlogbeat’;
We can confirm that data is flowing from our Windows system to our HELK Kafka broker, and through our KSQL Server.
I hope this first post helped you to get familiarized with the basic concepts of KSQL, and showed you how easy it is to use it with the latest version of HELK. In the next post, I will show you how to use KSQL in order to start joining Sysmon events 1 and 3 in real-time. The Sysmon-Join KSQL Recipe will be shared for you to try it.