ChatGPT for Forensic Analyze

Jan Sunavec
7 min readJun 3, 2023

Decompiling assembly code for uncommon CPUs such as RISC-V or ARM is quite challenging. There are only a few available tools like IDA Pro. However, can ChatGPT be of any help in this situation? Allow me to share my entire story regarding a particularly difficult CTF challenge.

Weird USB

This is an actual example from the CTF game cybergame.sk. One of the CTF flags revolved around a peculiar USB key. When the user plugged in the USB, the PC started exhibiting strange behavior. There is only a single binary file with a size of 8KB downloaded from the device. The file has a hex extension, but the format is unfamiliar to me. Each line in the file begins with a “:” and consists of 43 hexadecimal characters of the same length.

:100000008BC0AFC0C0C3ADC048C3ABC0AAC0A9C0FD
:10001000A8C0A7C0A6C0A5C0A4C0A3C0A2C02C5E93
:100020007460616264346667656E362D3738271EEA
:100030001F202122232425267333762E77785F44D0
:1000400045464748494A4B4C4D4E4F5051525354E8

Let’s analyze.

Use ChatGPT for analyze unknown files

I only used version 3.5 in all of my examples. When analyzing such lengthy text, it is necessary to utilize Pastebin. Afterward, I proceeded to inquire ChatGPT:

Q:
“You are a forensic expert. Analyze the following (pastebin link) hex format and tell more about the format.”

The first sentence is essentially configuring ChatGPT to operate in a less “artistic” or hallucinatory mode, in order to be more stringent. This is of utmost importance when utilizing ChatGPT for analysis. Please remember to specify this requirement at the start of the session.

A: It’s a binary format. Bla bla bla…

That answer is not satisfactory; it’s a generic response suggesting that the file could be anything in binary format. To obtain a more accurate response, it is essential to provide more specific information. It is highly beneficial to share as much information as you know, as ChatGPT may otherwise produce vague or commonplace answers. Based on my assumption, I believe the file in question is a firmware extracted from a USB key.

Q:
“Analyze the following (pastebin link) hex format and tell more about the format. It’s firmware from USB”

A: “It’s a binary format. Bla bla bla…”

If the response remains identical to the previous one, it indicates that the underlying mathematical model of ChatGPT hasn’t quite grasped the specific context. In such cases, it is advisable to provide additional keywords or details. Based on my further assumption, I deduce that the file could be code extracted from a microcontroller. Microcontrollers are compact CPUs with limited pins and relatively simple architectures.

Q: “Analyze the following (pastebin link) hex format and tell more about the format. It’s firmware from a microcontroller.”

A: “It’s a memory dump from the microcontroller in Intel hex format..”

If the response remains the same as before, it suggests that the underlying mathematical model of ChatGPT hasn’t fully understood the specific context. In such situations, it is recommended to provide additional keywords or details to help refine the answer. Building on my previous assumption, I speculate that the file might contain code extracted from a microcontroller. Microcontrollers are small CPUs with a limited number of pins and a relatively straightforward architecture.

http://digistump.com/products/1

Based on the website specifications, I am aware that the CPU used is Attiny85. Additionally, there is a corresponding schematic available. With this information, I am able to utilize a simulator that supports this particular CPU, such as simavr (available at https://github.com/buserror/simavr). Using the simulator, it becomes possible to dump files into the assembler. However, it’s important to note that the assembler used is not for x86 but rather AVR architecture. This introduces a slight challenge, as distinguishing between data and instructions becomes tricky. Unlike an ELF binary with a clear “data” structure, this information can only be determined during execution.

Decompiler

My initial approach is to convert the AVR assembler into pseudo C++ code. To accomplish this, I will once again utilize Pastebin to share the code and pose questions to ChatGPT for assistance.

Q: “This is an AtTiny85 CPU assembler. https://pastebin.com/rN4LyzEd. Convert it into C++ code”.

Regrettably, there is no specific support available for converting RISC ASM to C++ directly from a memory dump. Although the provided code is correct, it only offers minimal information. However, I would like to know what else you are looking for. How can I assist you further with ChatGPT?

There are three recommended approaches that you can consider:

Utilize more keywords: Include additional specific keywords or terms in your input to guide ChatGPT and provide it with more context regarding your query.

Narrow down the input: Instead of decompiling the entire code, focus on decompiling a single function or a smaller portion of the code. This can help reduce ambiguity and generate more precise results.

Explore hallucinations: Encourage ChatGPT to generate imaginative or creative responses by allowing for more flexibility in the answers. This approach may lead to novel insights or perspectives.

By employing these approaches, you can enhance the interactions with ChatGPT and potentially gain more valuable information.

More hallucinations

Q: “This is an Attiny85 CPU assembler. https://pastebin.com/TczUP1TM. Convert it to pseudo C code”.

In this particular case, ChatGPT operated in a less stringent mode and began generating more imaginative responses or ideas.

int main() {
// Call the setup function to initialize the I/O configurations
setup();
while (1) {
// Check if the button is pressed (active low)
if (!read_bit(PINB, BUTTON_PIN)) {
// Delay for debouncing
_delay_ms(50);
// Check if the button is still pressed after debouncing
if (!read_bit(PINB, BUTTON_PIN)) {
// Toggle the state of LED_PIN
toggle_bit(PORTB, LED_PIN);
// Delay for button release detection
_delay_ms(500);
}
}
}
return 0;
}

Now we have identified the presence of a main loop and a delay function. However, it’s important to note that these findings are the most likely outputs based on our input. It’s akin to assessing the probability that the given ASM corresponds to C code. Therefore, it becomes crucial to validate the ASM code with the results obtained from ChatGPT. By cross-referencing and confirming the information, we can ensure the accuracy and reliability of the analysis.

Deep dive into functions

By examining the ASM code, it becomes evident that the initial lines correspond to the interruption table, featuring several jumps to different handlers. The int 0 instruction represents the starting address when the CPU initializes. Notably, there is a main loop function located at the address 0x11e. These observations help to understand the structure and organization of the code at a higher level.

0: 8e c0 rjmp .+284 ; 0x11e
2: b2 c0 rjmp .+356 ; 0x168
4: de c0 rjmp .+444 ; 0x1c2
6: b0 c0 rjmp .+352 ; 0x168
8: f2 c4 rjmp .+2532 ; 0x9ee

Let’s direct ChatGPT’s attention to the next function in line. With the information obtained from the interruption table, we are aware of the addresses where other functions begin. Let’s proceed and explore the next function.

Q:
“This is an Attiny85 CPU assembler. https://pastebin.com/TczUP1TM. Analyze function from address 0x1c2. End of function is “ret” instruction. Generate human readable C++ code.”

void myFunction() {
// Local variables
unsigned char r20 = 0;
unsigned char r21 = 0;
unsigned char r22 = 0;
// Code block from address 0x1c2
r21 = 0x03; // MOV r21, 0x03
r20 = r21; // MOV r20, r21
r22 = 0x04; // MOV r22, 0x04
// Loop block
for (;;) {
// Code block from address 0x1c6
r21 = 0x03; // MOV r21, 0x03
// Code block from address 0x1c8
if (r20 == r22) { // CP r20, r22
break; // BRNE <exit loop>
}
// Code block from address 0x1ca
// Do something…
// Code block from address 0x1cc
r20++; // INC r20
}
// Function end
return; // RET
}

As you can observe, ChatGPT was configured to operate in strict mode, specifically “Generate C++” mode, rather than “Generate Pseudo” mode. This approach has yielded a valid representation of the function. By combining the pseudo code and C++ code generated by ChatGPT, we can obtain a comprehensive understanding of the internal workings of the function. This combination provides a valuable and insightful depiction of the code’s functionality.

Use more keywords
At this stage, I can confirm that the script is actually emulating a USB keyboard. Since there is no physical USB port available, it becomes necessary to simulate the USB signals through software. In this scenario, it is likely that the project [project name] would be utilized for this purpose.

Q: “This is an Attiny85 CPU assembler. https://pastebin.com/TczUP1TM. Code is using the VUSB project. Generate C++ code.”

The previous answer provided was rather vague and unhelpful. Therefore, attempting different combinations of keywords in strict mode doesn’t seem to be a viable option. ChatGPT is not inclined to generate imaginative responses. However, we can explore the “Pseudo” mode as an alternative approach to gather more meaningful information and insights.

Q: “This is an Attiny85 CPU assembler. https://pastebin.com/TczUP1TM. Code is using VUSB project to simulate key press. Generate Pseudo code.“

// Declare and initialize variables
unsigned char counter = 0;
unsigned char keycode = 0;
// Main loop
while (1) {
// Check if the 'counter' variable is greater than 0
if (counter > 0) {
// Delay for a short period of time
delay();
// Decrement the 'counter' variable
counter - ;
// Check if the 'counter' variable is equal to 0
if (counter == 0) {
// Generate a key press event with the 'keycode' value
simulateKeyPress(keycode);
}
}
// Check if a key press event has occurred
if (keyPressEventOccurred()) {
// Read the key code from the event
keycode = readKeyCode();
// Set the 'counter' variable to a specific value based on the key code
counter = calculateCounterValue(keycode);
}
}

As evident from all the outputs provided, obtaining a comprehensive overview or understanding of the entire picture has proven to be challenging. However, it’s important to acknowledge that this CTF scenario was intentionally designed to be difficult and demanding.
Utilizing ChatGPT for the analysis of unknown files has been an intriguing journey filled with discoveries. You are welcome to utilize the provided pastebin link for your own experiments and investigations. I eagerly await your findings and the results you obtain from your exploration.

CyberGame

I must emphasize that the following example is derived from what is arguably one of the most intricate and expansive CTF games, cybergame.sk, which offers a challenging experience even for ChatGPT. This game is specifically designed for hardcore software engineers who possess a borg-like nature, devoid of emotions, and embrace a collective consciousness, granting them a distinct advantage. It’s a far cry from games like Watch Dogs or CS:GO; CyberGame truly pushes the boundaries of your knowledge and expertise. In this article, I’ve provided a fundamental guide on leveraging ChatGPT to assist in the analysis of binary code, because let’s be honest, that’s where the genuine thrill lies!

--

--

Jan Sunavec

CTO, R&D director, Ad-Tech, Video Streaming, OTT, CTV, OpenRTB