Mac OS X PPC Shellcode
Tricks
H D Moore
Chapter 3
PowerPC Basics
The
PowerPC (PPC) architecture uses a reduced instruction set consisting of 32-bit
fixed-width opcodes. Each opcode is exactly four bytes long and can only be
executed by the processor if the opcode is word-aligned in memory.
3.1 Registers
PowerPC
processors have thirty-two 32-bit general-purpose registers (r0-r31) (PowerPC
64-bit processors have 64-bit general-purpose registers, but still use 32-bit
opcodes), thirty-two 64-bit floating-point registers (f0-f31), a link register
(lr), a count register (ctr), and a handful of other registers for tracking
things like branch conditions, integer overflows, and various machine state
flags. Some PowerPC processors also contain a vector-processing unit (AltiVec,
etc), which can add another thirty-two 128-bit registers to the set.
On the
Darwin/Mac OS X platform, r0 is used to store the system call number, r1 is
used as a stack pointer, and r3 to r7 are used to pass arguments to a system call.
General-purpose registers between r3 and r12 are considered volatile and should
be preserved before the execution of any system call or library function.
;;
;; Demonstrate execution
of the reboot system call
;;
main:
li r0, 55 ; #define
SYS_reboot 55
sc
3.2 Branches
Unlike
the IA32 platform, PowerPC does not have a call or jmp instruction.
Execution
flow is controlled by one of the many branch instructions. A branch can
redirect execution to a relative address, absolute address, or the value stored
in either the link or count registers. Conditional branches are performed based
on one of four bit fields in the condition register. The count register can
also be used as a condition for branching and some instructions will
automatically decrement the count register. A branch instruction can
automatically set the link register to be the address following the branch,
which is a very simple way to get the absolute address of any relative location
in memory.
;;
;; Demonstrate GetPC()
through a branch and link instruction
;;
main:
xor. r5, r5, r5 ; xor r5
with r5, storing the value in r5
; the condition register
is updated by the . modifier
ppcGetPC:
bnel ppcGetPC ; branch
if condition is not-equal, which will be false
; the address of
ppcGetPC+4 is now in the link register
mflr r5 ; move the link register to r5,
which points back here
3.3 Memory
Memory
access on PowerPC is performed through the load and store instructions.
Immediate
values can be loaded to a register or stored to a location in memory, but the
immediate value is limited to 16 bits. When using a load instruction on a
non-immediate value, a base register is used, followed by an offset from that
register to the desired location. Store instructions work in a similar fashion;
the value to be stored is placed into a register, and the store instruction then
writes that value to the destination register plus an offset value (Multi-word
memory instructions exist, but are considered bad practice to use, since they
may not be supported in future PowerPC processors).
Since
each PowerPC instruction is 32 bits wide, it is not possible to load a 32-bit
address into a register with a single instruction. The standard method of
loading a full 32-bit value requires a load-immediate-shift (lis) followed by an
or-immediate (ori). The first instruction loads the high 16 bits, while the second
loads the lower 16 bits (Some people prefer to use add-immediate-shift
against the r0 general purpose register. The r0 register has a special property
in that anytime it is used for addition or substraction, it is treated as a
zero, regardless of the current value. 64-bit PowerPC processors require five
separate instructions to load a 32-bit immediate value into a general-purpose
register). This 16-bit limitation also applies to relative branches and every
other instruction that uses an immediate value.
;;
;; Load a 32-bit
immediate value and store it to the stack
;;
main:
lis r5, 0x1122 ; load
the high bits of the value
; r5 contains 0x11220000
ori r5, r5, 0x3344 ;
load the low bits of the value
; r5 now contains
0x11223344
stw r5, 20(r1) ; store
this value to SP+20
lwz r3, 20(r1) ; load this value back to r3
3.4 L1 Cache
The
PowerPC processor uses one or more on-chip memory caches to accelerate access
to frequently referenced data and instructions. This cache memory is separated
into a distinct data and instruction cache. Although the data cache operates in
coherent mode on Mac OS X, shellcode developers need to be aware of how the
data cache and the instruction cache interoperate when executing self-modifying
code.
As a
superscalar architecture, the PowerPC processor contains multiple execution units,
each of which has a pipeline. The pipeline can be described as a conveyor belt
in a factory; as an instruction moves down the belt, specific steps are
performed. To increase the efficiency of the pipeline, multiple instructions can
put on the belt at the same time, one behind another. The processor will
attempt to predict which direction a branch instruction will take and then feed
the pipeline with instructions from the predicted path. If the prediction was
wrong, the contents of the pipeline are trashed and correct instructions are loaded
into the pipeline instead.
This
pipelined execution means that more than one instruction can be processed at
the same time in each execution unit. If one instruction requires the output of
another, a gap can occur in the pipeline while these dependencies are
satisfied.
In the
case of store instruction, the contents of the data cache will be updated before
the results are flushed back to main memory. If a load instruction is executed
directly after the store, it will obtain the newly-updated value. This occurs
because the load instruction will read the value from the data cache,
where it
has already been updated.
The
instruction cache is a different beast altogether. On the PowerPC platform, the
instruction cache is incoherent. If an executable region of memory is modified and
that region is already loaded into the instruction cache, the modified instructions
will not be executed unless the cache is specifically flushed. The instruction
cache is filled from main memory, not the data cache. If you attempt to modify
executable code through a store instruction, flush the cache, and then attempt
to execute that code, there is still a chance that the original, unmodified code
will be executed instead. This can occur because the data cache was not flushed
back to main memory before the instruction cache was filled.
The
solution is a bit tricky, you must use the ”dcbf” instruction to invalidate each
block of memory from the data cache, wait for the invalidation to complete with
the ”sync” instruction, and then flush the instruction cache for that block with
”icbi”. Finally, the ”isync” instruction needs to be executed before the modified
code is actually used. Placing these instructions in any other order may result
in stale data being left in the instruction cache. Due to these restrictions, self-modifying
shellcode on the PowerPC platform is rare and often unreliable.
The
example below is a working PowerPC shellcode decoder included with the
Metasploit Framework (OSXPPCLongXOR).
;;
;; Demonstrate a
cache-safe payload decoder
;; Based on Dino Dai
Zovi’s PPC decoder (20030821)
;;
main:
xor. r5, r5, r5 ; Ensure
that the cr0 flag is always ’equal’
bnel main ; Branch if
cr0 is not-equal and link to LMain
mflr r31 ; Move the
address of LMain into r31
addi r31, r31, 68+1974 ;
68 = distance from branch -> payload
; 1974 is null eliding
constant
subi r5, r5, 1974 ; We
need this for the dcbf and icbi
lis r6, 0x9999 ; XOR key
= hi16(0x99999999)
ori r6, r6, 0x9999 ; XOR
key = lo16(0x99999999)
addi r4, r5, 1974 + 4 ;
Move the number of words to code into r4
mtctr r4 ; Set the count
register to the word count
xorlp:
lwz r4, -1974(r31) ;
Load the encoded word into memory
xor r4, r4, r6 ; XOR
this word against our key in r6
stw r4, -1974(r31) ;
Store the modified work back to memory
dcbf r5, r31 ; Flush the
modified word to main memory
.long 0x7cff04ac ; Wait
for the data block flush (sync)
icbi r5, r31 ;
Invalidate prefetched block from i-cache
subi r30, r5, -1978 ;
Move to next word without using a NULL
add. r31, r31, r30
bdnz- xorlp ; Branch if
--count == 0
.long 0x4cff012c ; Wait
for i-cache to synchronize (isync)
; Insert XORed payload
here
.long (0x7fe00008 ^ 0x99999999)
Tidak ada komentar:
Posting Komentar