Cocojunk

🚀 Dive deep with CocoJunk – your destination for detailed, well-researched articles across science, technology, culture, and more. Explore knowledge that matters, explained in plain English.

Navigation: Home

Format string attack

Published: Sat May 03 2025 19:23:38 GMT+0000 (Coordinated Universal Time) Last Updated: 5/3/2025, 7:23:38 PM

Read the original article here.


Okay, step into the shadows and prepare to explore one of the foundational techniques of low-level exploitation – a true classic often omitted from standard curricula. This isn't just about crashing programs; it's about bending the control flow and data of a running process to your will, using seemingly innocuous printing functions. Welcome to the world of the Format String Attack.


The Forbidden Code: Underground Programming Techniques They Won’t Teach You in School

Volume 1: Mastering Memory Manipulation Through Misused Functions

Chapter 3: The Format String Exploitation - Hijacking the Print Stream

Introduction: When Printing Becomes Perilous

In the standard world of programming, functions like printf are your friends. They help you output information, debug your code, and interact with the user. But what if these helpful tools could be turned against the program itself? What if the very string used to format output could become a weapon to read sensitive data or even write arbitrary values into memory?

This is the essence of the Format String Attack – a vulnerability that arises when a program uses user-supplied input directly as the format string argument to a function from the printf family (like printf, sprintf, fprintf, vprintf, etc.). It's a vulnerability rooted in the C language's flexible, yet powerful, variadic functions and the specific behaviors of format specifiers. While many modern systems have protections against it, understanding this attack is crucial for anyone delving into low-level security, reverse engineering, and exploit development.

Definition: Format String Attack A software vulnerability that occurs when a program uses external input as the format string argument to a printf-style function. This allows an attacker to potentially read from or write to arbitrary memory locations, crash the program, or execute arbitrary code by manipulating the format string and leveraging the function's interaction with the program's stack.

The Core Vulnerability: Misunderstanding Variadic Functions

To understand the format string attack, you first need to grasp how printf and similar functions work under the hood, particularly their handling of variable arguments.

Functions like printf are variadic, meaning they can accept a variable number of arguments after their initial fixed arguments (like the format string). They rely on the format string to tell them what kind of arguments to expect and how many.

Consider a safe use:

int count = 5;
char name[] = "Alice";
printf("Hello %s, you have %d items.\n", name, count);

Here, printf sees %s and %d in the format string. It knows to look for two more arguments on the stack (or in registers, depending on the calling convention) corresponding to a string pointer and an integer. It correctly retrieves name and count.

Now, consider the vulnerable pattern:

char input[256];
gets(input); // <-- gets is unsafe for other reasons, but illustrates the point
printf(input); // <-- The vulnerability!

If the user enters "Hello World\n", printf correctly prints "Hello World". But what if the user enters something unexpected, like "%x %x %x %x"?

Since the input string "%x %x %x %x" is now the format string, printf sees four %x specifiers. It expects four arguments on the stack (each an integer or pointer to be interpreted as hex). However, the original vulnerable function call printf(input) only pushed one explicit argument onto the stack: the address of the input string itself (or potentially nothing useful if the format string was the first argument to printf, which is often the case in simple vulnerable code).

Because printf expects arguments but finds none explicitly provided after the format string, it starts looking for them on the caller's stack frame. This is where the danger lies. The stack frame contains function arguments, local variables, and the return address. By providing more format specifiers than explicit arguments, an attacker can trick printf into reading arbitrary values off the stack.

Additional Context: The Stack and Calling Conventions When a function is called, a new stack frame is created. Arguments are typically pushed onto the stack (or passed in registers), followed by the return address. Local variables are then allocated space on the stack within this frame. Calling conventions (like cdecl, stdcall) dictate the order and method of passing arguments. Variadic functions like printf access their optional arguments based on the information provided by the format string and the function's position in the stack frame relative to the caller. A format string vulnerability exploits this mechanism, causing the function to read beyond its intended arguments into other parts of the stack or potentially even further into memory.

The Tools of the Trade: Format Specifiers and Their Abuses

Format specifiers are the directives within the format string that tell printf how to interpret and print corresponding arguments. While intended for printing, some specifiers, particularly %n, can be weaponized.

Here are key specifiers and how they are relevant to the attack:

  • %s: Reads a string from a given address. Abuse: If an attacker can place a controlled pointer on the stack at a location printf will read as an argument for %s, they can cause the program to print the content of arbitrary memory locations, potentially leaking sensitive data. A null pointer will likely cause a crash.
  • %d, %i, %u, %x, %p: Read integers or pointers and print them in decimal, unsigned decimal, hexadecimal, or pointer format. Abuse: By repeatedly using these specifiers (e.g., "%x %x %x %x %x %x %x %x"), an attacker can dump the contents of the stack, word by word (or byte by byte depending on size). This is crucial for reconnaissance – finding interesting values like return addresses, pointers to important data structures, or locations of attacker-controlled data on the stack. %p is particularly useful as it prints the full pointer value.
  • %n: This is the true exploit primitive. Instead of reading an argument to print, %n writes the number of bytes (or characters) printed so far by the printf call to the memory location pointed to by the corresponding argument. Abuse: This allows an attacker to write an arbitrary value (the number of bytes printed) to an arbitrary memory location (the address provided as an argument).

Let's look at %n more closely:

int chars_printed;
printf("Hello World%n\n", &chars_printed);
// After this line, chars_printed will be set to 11

In a format string attack, the attacker controls the format string and potentially some data on the stack. If they can arrange the format string to contain %n and ensure that an address they want to write to is the corresponding argument on the stack, they can write the number of bytes printed to that address.

  • %hn and %hhn: These are variations of %n for writing smaller values. %hn writes a short (typically 2 bytes), and %hhn writes a char (1 byte). Abuse: These are essential for writing arbitrary 4-byte or 8-byte values. Instead of printing billions of characters to write a large number with %n, an attacker can use four %hhn or two %hn specifiers targeting different bytes/shorts of the desired destination address, carefully controlling the byte count printed before each write.

  • Position Specifiers (%m$): Some printf implementations (like glibc) support position specifiers (e.g., %2$x to print the 2nd argument as hex, %1$s to print the 1st argument as a string). Abuse: If available, these simplify exploitation significantly as the attacker doesn't need to pad the format string with dummy specifiers to reach a desired argument on the stack. They can directly target the Nth argument.

Exploitation - Reading Memory (%x, %p)

The first step in many format string exploits is reconnaissance: figuring out the layout of the stack and finding useful addresses.

Imagine a vulnerable program:

#include <stdio.h>
#include <string.h>

int main() {
    char buffer[256];
    printf("Enter your input: ");
    fflush(stdout);
    fgets(buffer, sizeof(buffer), stdin);
    buffer[strcspn(buffer, "\n")] = 0; // Remove newline

    // Vulnerable call if buffer contains format specifiers
    printf(buffer);
    printf("\n"); // To make output cleaner
    return 0;
}

If you run this and enter AAAA%x.%x.%x.%x.%x.%x.%x.%x, the output might look like this (values will vary due to ASLR, compilation, etc.):

AAAAf7f2f040.f7f2f040.bffff6c8.41414141.f7dd0987.0.bffff748.80485a1

Let's break this down:

  1. AAAA is printed first.
  2. Then, printf encounters %x. It looks for the first argument after the format string. Since there were no explicit arguments, it reads the first value off the stack after the format string pointer.
  3. It then encounters the next %x, reads the next value off the stack, and so on.

The output f7f2f040.f7f2f040.bffff6c8.41414141.f7dd0987.0.bffff748.80485a1 shows the hexadecimal representation of several values pulled directly from the stack.

Notice the 41414141. This is the hexadecimal ASCII representation of AAAA. This tells us that the input string itself is present on the stack, and it's located at a specific offset from where printf starts looking for arguments. In this example, it's the 4th value read by %x. This offset is crucial. If we want to place an address on the stack for a %n write, we need to know which argument number corresponds to our controlled data on the stack.

By experimenting with more %x specifiers, an attacker can map out parts of the stack and find interesting values:

  • Pointers to input buffers (containing attacker-controlled data).
  • Return addresses (where the function will return when it finishes).
  • Saved base pointers.
  • Addresses within the program's code or libraries.
  • Pointers used by the program.

This reading phase is essential before attempting a write, as it reveals the stack layout and helps locate targets and required data.

Exploitation - Writing Memory (%n, %hn, %hhn)

This is where the real power of the format string attack lies: the ability to write arbitrary data to arbitrary memory locations. The %n specifier is the key.

The goal is typically to:

  1. Determine the address you want to write to (e.g., a return address on the stack, a Global Offset Table (GOT) entry for a library function, a function pointer).
  2. Determine the value you want to write to that address (e.g., the address of attacker-controlled shellcode, the address of a function like system, a new function pointer).
  3. Craft a format string that places the target address on the stack (usually by including it in the input buffer before the format specifiers).
  4. Craft the format string with %n (or %hn/%hhn) specifiers such that:
    • The %n specifier(s) correspond to the stack location(s) where the target address (or parts of it) are placed.
    • The number of bytes printed before each %n corresponds to the value (or parts of the value) you want to write.

Example Scenario (Simplified): Overwriting a Variable

Let's say you want to change a global integer variable can_access_admin = 0; to 1. You find its address (e.g., using a debugger or information leaks). Let's say the address is 0x0804a030.

You need to get this address onto the stack where printf will see it as an argument to %n. A common technique is to put the address before the format string in your input buffer.

Input: \x30\xa0\x04\x08 + [padding] + [format string]

Using the reading technique (%x), you figure out that the 0x0804a030 address you just put at the start of your input buffer appears as the 10th argument printf reads off the stack.

Now you want to write the value 1 to 0x0804a030. You need to print 1 byte, then trigger %n which targets the 10th argument.

The format string could look like: A%10$n.

  • A prints 1 byte.
  • %10$n uses the position specifier (if available) to target the 10th argument on the stack (which is 0x0804a030). It then writes the number of bytes printed so far (which is 1) to the address 0x0804a030.

Combined input: \x30\xa0\x04\x08 + [padding] + A%10$n

After this printf call executes, the memory at 0x0804a030 will be changed from 0 to 1.

Writing Arbitrary 4-Byte (or 8-Byte) Values with Partial Writes

Writing larger values requires a bit more finesse. You can't just print 0x41414141 (over a billion) bytes. Instead, you use %hn (write 2 bytes) or %hhn (write 1 byte) and write the value piece by piece.

Suppose you want to write 0x44332211 to address 0x0804a030 (assuming little-endian architecture, where 0x11 is the first byte, 0x22 the second, etc.).

  1. Put the target address 0x0804a030 on the stack. You'll need it multiple times, or calculate the addresses 0x0804a030+2 and 0x0804a030+4 as well for subsequent writes. Let's assume for simplicity you put 0x0804a030 at stack argument 10 and 0x0804a032 at stack argument 11.
  2. Write 0x11 to 0x0804a030: Print 0x11 (17) bytes, then use %hhn targeting stack argument 10.
  3. Write 0x22 to 0x0804a032: You've already printed 17 bytes. You need to print 0x22 (34) total bytes before the second %hhn. So you need to print 34 - 17 = 17 more bytes. Then use %hhn targeting stack argument 11.
  4. Write 0x33 to 0x0804a034: Calculate bytes needed, print difference, use %hhn.
  5. Write 0x44 to 0x0804a036: Calculate bytes needed, print difference, use %hhn.

The format string becomes complex, involving print width specifiers (%[width]) to control the number of bytes printed between %hn/%hhn writes and padding to reach the desired stack arguments.

Example format string structure (conceptual): [address1][address2][address3][address4][padding]%[bytes1]x%[offset1]$hhn%[bytes2-bytes1]x%[offset2]$hhn%[bytes3-bytes2]x%[offset3]$hhn%[bytes4-bytes3]x%[offset4]$hhn

  • [address1]-[address4] are the target addresses (0x0804a030, 0x0804a032, 0x0804a034, 0x0804a036) placed on the stack before the format string.
  • [padding] aligns the format string relative to the addresses to make %[offset]$ target them correctly.
  • %[bytes1]x: Prints bytes1 (the value of the first byte, 0x11) characters. %x is often used, combined with width specifiers like %17x.
  • %[offset1]$hhn: Writes the total bytes printed so far (17) to the address at stack argument offset1 (where 0x0804a030 is located).
  • %[bytes2-bytes1]x: Prints bytes2 - bytes1 (34 - 17 = 17) characters.
  • %[offset2]$hhn: Writes total bytes printed (34) to address at stack argument offset2 (where 0x0804a032 is located).
  • ... and so on.

This requires careful calculation of byte counts and stack offsets, often aided by exploit development tools or scripting.

The Attack Flow: From Discovery to Control

A typical format string exploitation process looks like this:

  1. Identify the Vulnerability: Find a call to printf, sprintf, etc., where the first argument (the format string) comes directly or indirectly from user-controlled input (e.g., a buffer read from standard input, a network socket, an environment variable).
  2. Determine the Stack Offset: Use printing specifiers (like AAAA%x.%x.%x...) to dump stack contents. Look for the attacker-supplied data (like the AAAA marker) to find the offset (which argument number corresponds to the start of your input buffer on the stack). This offset is needed to correctly target addresses you place on the stack for writing.
  3. Locate Target Address: Determine where you want to write. Common targets include:
    • A return address on the stack (to hijack control flow after the current function returns).
    • A function pointer in a global variable.
    • A Global Offset Table (GOT) entry for a frequently called library function (like exit or strlen) to redirect its execution to attacker-controlled code.
    • A Virtual Table (VTable) entry in C++ objects.
  4. Determine Write Value: Figure out the address you want to write to the target location (e.g., the address of your shellcode, the address of system(), etc.).
  5. Craft the Payload: Construct the input string containing:
    • The target address(es) you want to write to (placed at the beginning or strategically within padding).
    • Padding bytes to align the format string and ensure the target addresses land at the desired stack argument positions determined in step 2.
    • The format string itself, containing %n or %hn/%hhn specifiers, potentially with width and position specifiers, carefully ordered and calculated to print the correct number of bytes before each write to achieve the desired final value at the target address.
  6. Deliver the Payload: Send the crafted input to the vulnerable program.
  7. Gain Control: If successful, the program's execution flow will be altered, potentially leading to arbitrary code execution (e.g., running shellcode placed elsewhere in memory).

Finding the Vulnerability in the Wild

How do you spot this "forbidden code" in existing programs?

  • Source Code Review (Static Analysis): Search the codebase for calls to printf, sprintf, fprintf, vprintf, vfprintf, vsprintf, snprintf, vsnprintf, etc. Examine the first argument (the format string). If it's a variable or a function call result derived from external input, it's potentially vulnerable.
  • Dynamic Analysis (Fuzzing): Feed crafted input strings containing format specifiers (e.g., "%s%n%x" or long strings of %x) to program inputs that you suspect are used in printf calls. Look for crashes (due to %s on a bad pointer or %n attempting to write to an invalid address) or information leaks (stack dumps from %x). This can help identify potential format string sinks.

Real-World Impact: What an Attacker Can Achieve

A successful format string attack can have severe consequences:

  • Denial of Service (DoS): Easily achieved by using %s with a stack offset pointing to an unmapped memory region or by causing %n to write to an unprivileged address, leading to a crash.
  • Information Disclosure: Reading stack data (%x, %p) can reveal pointers, saved registers, buffer contents, and other sensitive information helpful for further exploitation (like bypassing ASLR).
  • Arbitrary Memory Write: This is the most powerful primitive. It allows an attacker to:
    • Modify program variables (e.g., grant admin privileges).
    • Overwrite function pointers (e.g., hijack program control flow).
    • Overwrite Global Offset Table (GOT) entries to redirect library calls (a common way to achieve arbitrary code execution).
    • Overwrite stack return addresses (less common on modern systems with stack cookies, but a classic technique).
  • Arbitrary Code Execution (ACE): By combining arbitrary write with the ability to put shellcode into memory (e.g., in an input buffer), an attacker can often redirect execution to their code, taking full control of the vulnerable process.

The Flip Side: Defending Against the Forbidden

Just as understanding the attack is crucial, so is understanding how to prevent it. Developers must implement secure coding practices:

  1. Never Use User Input as the Format String: This is the golden rule. Always use a static, literal string as the first argument to printf-style functions. If you need to include user-supplied text in the output, pass it as an argument using an appropriate specifier:
    • Safe: printf("User input: %s\n", user_string);
    • Vulnerable: printf(user_string);
  2. Use Type-Safe Output Functions: For simple printing of user-supplied strings, use functions like puts() or fputs(), which do not interpret format specifiers.
    • Safer: puts(user_string);
  3. Input Validation: While difficult for arbitrary format strings, validate input where possible before it reaches sensitive functions. This is a secondary defense.
  4. Compiler Warnings and Features: Modern compilers can often detect potential format string vulnerabilities at compile time or provide warnings. Enable high warning levels (-Wall, -Wformat-security in GCC/Clang).
  5. Operating System Protections: System-level defenses make exploitation harder, though they don't fix the underlying code vulnerability:
    • Address Space Layout Randomization (ASLR): Randomizes memory addresses (stack, heap, libraries), making it difficult for attackers to guess target addresses for reading or writing. Requires information leakage (like from %x) to defeat.
    • Data Execution Prevention (DEP) / No-Execute (NX): Marks memory regions (like the stack and heap) as non-executable, preventing attackers from placing shellcode in data buffers and executing it directly. Forces attackers to use ROP (Return-Oriented Programming) or target existing executable code (like GOT entries).
    • Stack Cookies/Canaries: Place a random value on the stack before the return address. If it's changed (e.g., by a buffer overflow or some format string writes that walk over it), the program aborts. While primarily for buffer overflows, they can sometimes hinder format string attacks that need to overwrite the return address directly.
  6. FormatGuard: A GCC/GLIBC extension that specifically hardens printf-like functions to detect and prevent format string exploits by verifying the format string.

Conclusion: Understanding the Foundation

The format string attack is a powerful reminder that even seemingly benign library functions can become attack vectors if their inputs are not handled securely. While less common in its classic form today due to modern defenses, understanding the mechanics of how it leverages stack layout, calling conventions, and format specifiers is invaluable. It provides fundamental insights into how low-level memory corruption works and how attackers can turn data manipulation into code execution. For those exploring the "forbidden" corners of programming, mastering the format string attack is a rite of passage into the world of process exploitation.

See Also