## GPU Powered Malware

#### Daniel Reynaud

LORIA - Nancy - France

Ruxcon 2008

э

э

< D > < P > < P > < P >



- GPGPU (General Purpose programming on Graphics Processing Units) is no longer an obscure area
- Most consumer hardware is now fully programmable in C
- No need to be a specialist to tap into the computing power of GPUs
- What if malware authors start coding on GPUs ?

How could that be used in a malware ? Reverse Engineering Packing Conclusion CUDA Stream Computing OpenCL Larrabee

# Outline

- GPGPU Technologies
  - CUDA
  - Stream Computing
  - OpenCL
  - Larrabee
- 2 How could that be used in a malware ?
- 3 Reverse Engineering
  - Disassembling
  - Debugging
  - Emulation
- 4 Packing
- 5 Conclusion

CUDA Stream Computing OpenCL Larrabee

# NVIDIA's Compute Unified Device Architecture

- Requires recent NVIDIA hardware with a CUDA driver
- Easily programmable with an extension of the C language
- The device code is compiled to an assembly intermediate language, PTX and then assembled in the cubin file format (undocumented)

Here is the simplified compilation process:



How could that be used in a malware ? Reverse Engineering Packing Conclusion CUDA Stream Computing OpenCL Larrabee

# AMD's Stream Computing

- Requires recent ATI/AMD hardware with a Stream Computing driver
- Easily programmable with an (other) extension of the C language
- The device code is compiled to an (other) assembly intermediate language, AMD IL

How could that be used in a malware ? Reverse Engineering Packing Conclusion CUDA Stream Computing OpenCL Larrabee

# Apple's OpenCL

- Submitted as a standard by Apple, supported by everybody except Microsoft
- Will be shipped with Mac OS X Snow Leopard
- No reference/documentation for the moment



#### OpenCL

Another powerful Snow Leopard technology, OpenCL (Open Computing Language), makes it possible for developers to efficiently tap the vast gigaflogs of computing power currently locked up in the graphics processing unit (CPU). With CPUs approaching processing speeds of a trillion operations per second, they'r capable of considerably more than just drawing pictures. OpenCL takes that power and reflects it for general-purpose computing.

How could that be used in a malware ? Reverse Engineering Packing Conclusion CUDA Stream Computing OpenCL Larrabee

### Intel's Larrabee

- Announced by Intel at SIGGRAPH 2008
- Based on the x86 architecture plus Larrabee-specific extensions
- Will also come in the form of an add-in card managed by an operating system driver
- No reference/documentation for the moment

# Outline

#### GPGPU Technologies

- CUDA
- Stream Computing
- OpenCL
- Larrabee

#### 2 How could that be used in a malware ?

- 3 Reverse Engineering
  - Disassembling
  - Debugging
  - Emulation

## 4 Packing

5 Conclusion

### Quick Answer (credits: ThreatExpert.com)

moυ ebx, eax : create a seed in EBX eax. 8 ror eax, esi and : prepare large number in EAX add eax. ecx push 6 cda , a reminder after dividing by 6 is a number from 0 to 5 pop ecx idiv ecx-; EDX gets a random length from 0 to 5 hhs edx, 7-; EDX gets a random length from 7 to 12 add 7 so that the test edx. edx domain name will have a variable length from 7 to 12 jle short quit the loop [ebp+counter], edx ---- counter = 7..12 mov loop generate next character: ; CODE XREF: Generate DOMAIN NAME+10Ali imu1 ebx, 41C64E6Dh ; progress the seed add ebx, edi : add EDI=12435 eax, ebx ; move the seed into EAX mov ror eax, 8 and eax. esi : ESI=32767 push 26 DOD ecx : ECX = 26 cdq : get a reminder from division bu 26 idiv ecx 1ea ecx, [ebp+temp] add dl. 'a' : EDX get a random number from 0 to 25 ; use it as offset from the character 'a' push edx call take letter at that offset lea eax, [ebp+temp] by using a random number from 0 to 25 and push eax taking it as an offset from 'a', the code simply lea ecx, [ebp+var 10] call add the character picks up a random ASCII character from 'a' to 'z'. 1ea ecx, [ebp+temp] call delete loon dec [ebp+counter] ; decrement the counter (from 7 to 12 times) inz short loop generate next character : progress the seed ; CODE XREF: Generate\_DOMAIN\_NAME+CD†j quit the loop: push offset a Ø lea ecx. [ebp+temp] once the domain name of C&C is generated, cal1 streat-- add dot (".") to it mov eax, [ebp+arg 4] 10 ---push and then append one of 7 suffixes pop ecx (first 3 of them are doubled to double their cdq chance to be picked up; idiv ecx thus, the list has 10 entries) lea ecx, [ebp+var 20] nush ds:random domain suffix 10[edx\*4] call strcat э

# Algorithm Hiding

- The code on the former slide is part of the Kraken botnet
- It is the algorithm generating the list of C&C servers that the bots try to contact
- Once this list is known, the servers can be shut down and the botnet can be infiltrated
- This is the kind of algorithms that might end up being executed on GPUs

Packing

Disassembling Debugging Emulation

# Outline

- **1** GPGPU Technologies
  - CUDA
  - Stream Computing
  - OpenCL
  - Larrabee
- 2 How could that be used in a malware ?
- 3 Reverse Engineering
  - Disassembling
  - Debugging
  - Emulation
- 4 Packing
- 5 Conclusion

Disassembling Debugging Emulation

# Disassembling

• GPGPU software comes in the form of fat binaries (CUDA terminology), i.e. native executables with embedded device code

Conclusion

• The goal is to extract the device code and obtain a dump of the instructions



Disassembling Debugging Emulation

# Disassembling

• Depends heavily on the underlying GPGPU technology

Packing

- Ability to recover the device-specific representation and/or the intermediate language representation
- Usually very different from x86 assembly

GPGPU Technologies ? How could that be used in a malware

Reverse Engineering

Packing Conclusion Disassembling Debugging Emulation

### Disassembling

#### Sample PTX code:

```
.entry globfunc Z6kernelPci
60
     .reg .u16 %rh<8>;
     .reg .u32 %r<23>;
62
     .reg .pred %p<11>;
     .param .u32 cudaparm globfunc Z6kernelPci a d;
     .param .s32 cudaparm globfunc Z6kernelPci n;
     .loc 14 61 0
66 $LBB1 globfunc Z6kernelPci:
     .loc 14 41 0
     ld.param.u32 %r1, [ cudaparm globfunc Z6kernelPci a d]; // id:77 cudaparm globfunc Z6kernelPci a d+0x0
     mov.s32 %r2, %r1;
     .loc 14 23 0
     ld.global.s8 %r3, [%r1+0];
                                  // id:78
     mov.s32 %r4, %r3;
     mov.u32 %r5, 0;
                                  11
74
     setp.eq.s32 %p1, %r3, %r5;
     R%p1 bra $Lt 0 39;
76
     ld.const.s8 %r6, [ constant432+0]; // id:79 g C10
     setp.ne.s32 %p2, %r6, %r3;
     8%p2 bra $Lt 0 40;
79
     mov.u32 %r7, constant432; //
80 $L 0 23:
81 //<loop> Loop body line 24
    .loc 14 24 0
     add.u32 %r2, %r2, 1;
```

< 日 > < 同 > < 回 > < 回 > < 回 > < 回 > < 回 > < 回 > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ >

Disassembling Debugging Emulation

## Disassembling

#### Sample AMD IL code:

| 📮 GPU ShaderAnalyzer - DX HLSL                                                                                                                                                                          |                                                                                                                             |                                                                                                                                                                                                              |
|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| File Edit Help                                                                                                                                                                                          |                                                                                                                             |                                                                                                                                                                                                              |
| <pre>Source Code Function hello_brook_check 1// Enter your shader in this win 2 kernel void hello_brook_check(f: 3 ( 4 if (input &gt; val) 5 ( 6 output = 1.0f; 7 ) 8 else 9 ( 10 output = 0.0f; </pre> | Compile HLSL Compile Target Brook+ Enable Fast Math (Less Accurate) Disable Address Virtualization                          | Object Code<br>Format IL Assembly<br>ret<br>func 35<br>lt r268.x_, r267.x000, r2<br>if_logicalnx r268.x00<br>mov r266.x_, 10.x000<br>endif<br>ret<br>func 36                                                 |
| 11 )<br>12)                                                                                                                                                                                             | Macro Definitions     Symbol Value     Right-click to add macros.     Bool Constants     Constant Value     No bool consts. | <pre>mov r17.x, 10.x000 mov r18.xy_, r269.xy00 call 2 mov r276.x _, r16.x000 mov r273.x _, r276.x000 mov r267.x, r273.x000 call 35 mov r274.x, r266.x000 mov r275.x _, r274.x000 mov r275.y _, 10.0x00</pre> |
|                                                                                                                                                                                                         |                                                                                                                             | x v 10.0x00                                                                                                                                                                                                  |

Packing Conclusion

Daniel Reynaud GPU Powered Malware

æ

Disassembling Debugging Emulation

# Debugging

- Short version: GPUs do not support hardware debugging
- This means: no breakpoints, no single-stepping, no debugger-based tracing

Packing Conclusion

• However, developers want to debug applications, so the answer is the emulation mode...

#### An excerpt of the CUDA documentation:

4.5.2.9 Debugging using the Device Emulation Mode

The programming environment does not include any native debug support for code that runs on the device, but comes with a device emulation mode for the purpose of debugging. When compiling an application in this mode (using the **-deviceemu** option), the device code is compiled for and runs on the host, allowing the programmer to use the host's native debugging support to debug the application as if it were a host application. The preprocessor macro **\_DEVICE\_ENULATION\_** is defined in this mode. All code for an application, including any libraries used, must be compiled consistently either for device emulation or for device execution.

• □ ▶ • • □ ▶ • • □ ▶ •

Disassembling Debugging Emulation

# Debugging

- Short version: GPUs do not support hardware debugging
- This means: no breakpoints, no single-stepping, no debugger-based tracing

Packing Conclusion

• However, developers want to debug applications, so the answer is the emulation mode...

And an excerpt of the Stream Computing documentation:

2.2.4 Debugging

When debugging an application, debugging happens on the generated C++ source, not on the original Brook+ source. For a complete example, see Section 2.4, "Example of Generated C++ Code for sum.br," page 2-12.

There is no hardware debugging of stream kernels (for example:

\_\_sum\_cal\_desc); it is not possible to step through the kernel code. The kernel inputs and outputs can be inspected (before a streamRead and after a streamWrite). Kernels can be written so that intermediate data can be output to streams and inspected.

Alternatively, kernels can be stepped through and debugged as usual using the CPU emulation mode (for example: \_\_sum\_cpu and \_\_sum\_cpu\_inner).

Disassembl Debugging Emulation



- So developers can debug their applications if they compile them with an emulation option
- This means no debugging without the source code

Packing

• But at least, we have emulation, right ?

Disassemblin Debugging Emulation

# Emulation

- Let's read again the CUDA documentation: "When compiling an application in this mode (using the -deviceemu option), the device code is compiled for and runs on the host"
- This means that no GPU code is produced, everything is compiled for the CPU
- Therefore, no emulation without the source code

Packing

Conclusion

• This is bad news for malware analysts, because having a full-software GPU emulator would allow the use of breakpoints, single-stepping and tracing (as with Bochs)

# Outline

- GPGPU Technologies
  - CUDA
  - Stream Computing
  - OpenCL
  - Larrabee
- 2 How could that be used in a malware ?
- 3 Reverse Engineering
  - Disassembling
  - Debugging
  - Emulation

## 4 Packing

Conclusion

## Motivation

- Packing is a software protection method that generates code dynamically (turns data into code)
- To unpack a program, you generally have to set a breakpoint at the entry point of the dynamically created code or to emulate the program and match the current address with the written addresses
- No debugging in GPUs + no emulators (yet) = really strong packing

# Based on the Underlying Hardware

- The lowest-level target but still hardware-independant target for execution is the intermediate language (such as PTX or AMD IL)
- To program self-modifying code, we need data-transfer instructions and control-flow instructions with the same targets
- But...

## Based on the Underlying Hardware

#### Excerpt of the PTX documentation:

Chapter 7. Instruction Set

| BRA           | Branch to a target and continue execution there.                                                                                     |  |  |  |  |
|---------------|--------------------------------------------------------------------------------------------------------------------------------------|--|--|--|--|
|               |                                                                                                                                      |  |  |  |  |
| Syntax        | <pre>bra[.uni] target; // target is a label</pre>                                                                                    |  |  |  |  |
|               | <pre>bra[.uni] a; // indirect branch through register 'a'</pre>                                                                      |  |  |  |  |
| Description   | Continue execution at the target. Conditional branches are specified by using a guard<br>predicate.                                  |  |  |  |  |
| Semantics     | <pre>pc = target;</pre>                                                                                                              |  |  |  |  |
|               | pc = a;                                                                                                                              |  |  |  |  |
| Notes         | A bra is assumed to be divergent unless the .unl suffix is present, indicating that the<br>branch is guaranteed to be non-divergent. |  |  |  |  |
| Release Notes | Indirect branch through a register is unimplemented.                                                                                 |  |  |  |  |
| Examples      | bra.uni L_exit; // uniform unconditional jump                                                                                        |  |  |  |  |
|               | @g bra L23; // conditional branch                                                                                                    |  |  |  |  |
|               | mov.b32 %r, Done;                                                                                                                    |  |  |  |  |
|               | bra %r; // indirect branch                                                                                                           |  |  |  |  |

#### Table 49. Control Flow Instructions: BRA

## Based on the Underlying Hardware

#### And an excerpt of the AMD IL documentation:

AMD COMPUTE ABSTRACTION LAYER (CAL) TECHNOLOGY

| Instructions | CALL                                                                                                                                                                                                                                                                                                                                                                                               |                                                          |       |               |  |  |
|--------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------------------------------------------------------|-------|---------------|--|--|
| Syntax       | Call <integer label=""></integer>                                                                                                                                                                                                                                                                                                                                                                  |                                                          |       |               |  |  |
| Description  | CALL pushes the address of the next instruction in the kernel onto the return address stack<br>and transfers control to the FUNC block identified by <i>sinteger labeb</i> . CALLs can be nested<br>up to 32 levels deep. If the return address stack already contains 32 addresses, the CALL is<br>skipped and execution continues at the next instruction in the kernel. Recursion is permitted. |                                                          |       |               |  |  |
| Format       | 0-input, 0-output.                                                                                                                                                                                                                                                                                                                                                                                 |                                                          |       |               |  |  |
| Opcode       | Token                                                                                                                                                                                                                                                                                                                                                                                              | Field Name                                               | Bits  | Description   |  |  |
|              | 1                                                                                                                                                                                                                                                                                                                                                                                                  | code                                                     | 15:0  | IL_OP_CALL    |  |  |
|              |                                                                                                                                                                                                                                                                                                                                                                                                    | control                                                  | 29:16 | Must be zero. |  |  |
|              |                                                                                                                                                                                                                                                                                                                                                                                                    | sec_modifier_present                                     | 30    | Must be zero. |  |  |
|              |                                                                                                                                                                                                                                                                                                                                                                                                    | pri_modifier_present                                     | 31    | Must be zero. |  |  |
|              | 2                                                                                                                                                                                                                                                                                                                                                                                                  | Must be zero.                                            |       |               |  |  |
|              | 3                                                                                                                                                                                                                                                                                                                                                                                                  | 3 Unsigned integer representing label of the subroutine. |       |               |  |  |
|              |                                                                                                                                                                                                                                                                                                                                                                                                    | CALL LOGICALZ, CALL LOGICALNZ.                           |       |               |  |  |

< ∃ →</li>

-

## Based on a Virtual Machine

- There seems to be no natural / documented way to write self-modifying code with PTX or AMD IL
- However, even if the underlying environment does not support self-modifying code, it is still possible to develop a virtual execution environment in device code
- Since we control the virtual execution environment, everything is possible, including self-modifying code
- Not malware specific, DRM systems may use it in the future (GPU-Themida and GPU-VMProtect ?)

#### Based on a Virtual Machine



(得) (ヨ) (ヨ)

э

# Outline

- GPGPU Technologies
  - CUDA
  - Stream Computing
  - OpenCL
  - Larrabee
- 2 How could that be used in a malware ?
- 3 Reverse Engineering
  - Disassembling
  - Debugging
  - Emulation

# Packing





- Current GPGPU technologies offer programmable hardware black boxes
- If one of these technologies becomes a standard, available by default, it will be used by malware and DRM
- GPU-based packers will be particularly efficient due to the lack of hardware debugging and emulators