The story of exploiting kmalloc() overflows

Written by : qobaiashi

Summary

Intro

Recently many kernel level heap (kmalloc) overflows have been discovered which were rated "unclear" with regard to exploitation. This paper aims at explaining the kernels heap management with security and exploiting heap overflows in kernel space in mind.

The Slab allocator

The slab allocator serves the need for small, dynamicand and thus fragmented memory portions in kernel space. Requested pages from the Buddy system (0x1000 bytes) are further splitted into caches, holding slabs, holding objects where objects are the smallest unit of memory - pointed to by a returned kmalloc() pointer. To increase performance the slab allocator saves more objects of the same size in so called slabs where a just free'd object can - w/o touching the buddy system - be given back to a new instance of a driver module which repeatedly allocates objects of the same size for example.

Caches

  +-------+--->+-------+----->+-------+
  | CACHE |    | CACHE |      | CACHE |
  |       |<---|       |<-----|       |
  +---+---+    +---+---+      +---+---+
                   |             
                   V
        +--------------------+
        | SLAB | SLAB | SLAB |
	|------|------|------|
        |  O   |  B   |  J   |
        |------|------|------|
        |  E   |  C   |  T   |
        |------|------|------|
        |  S   |      |      |
        +--------------------+

A list of all active caches is at

qobaiashi@cocoon:~> cat /proc/slabinfo <----
slabinfo - version: 1.1 (SMP)
kmem_cache            80     80    244    5    5    1 :  252  126
fib6_nodes           113    113     32    1    1    1 :  252  126
ip6_dst_cache         20     20    192    1    1    1 :  252  126
ndisc_cache           30     30    128    1    1    1 :  252  126
hpsb_packet            0      0    100    0    0    1 :  252  126
ip_fib_hash          113    113     32    1    1    1 :  252  126
clip_arp_cache         0      0    128    0    0    1 :  252  126
ip_mrt_cache           0      0     96    0    0    1 :  252  126
tcp_tw_bucket         30     30    128    1    1    1 :  252  126
tcp_bind_bucket      113    113     32    1    1    1 :  252  126
tcp_open_request      40     40     96    1    1    1 :  252  126
ip_dst_cache          20     20    192    1    1    1 :  252  126
arp_cache             30     30    128    1    1    1 :  252  126
blkdev_requests     1560   1560     96   39   39    1 :  252  126
nfs_write_data         0      0    384    0    0    1 :  124   62
nfs_read_data          0      0    352    0    0    1 :  124   62
nfs_page               0      0     96    0    0    1 :  252  126
ext2_xattr             0      0     44    0    0    1 :  252  126
kioctx                 0      0    128    0    0    1 :  252  126
kiocb                  0      0     96    0    0    1 :  252  126
eventpoll pwq          0      0     36    0    0    1 :  252  126
eventpoll epi          0      0     96    0    0    1 :  252  126
dnotify_cache        338    338     20    2    2    1 :  252  126
file_lock_cache       40     40     96    1    1    1 :  252  126
async poll table       0      0    144    0    0    1 :  252  126
fasync_cache         126    202     16    1    1    1 :  252  126
uid_cache            113    113     32    1    1    1 :  252  126
skbuff_head_cache     80     80    192    4    4    1 :  252  126
sock                 216    216   1344   72   72    1 :   60   30
sigqueue              28     28    136    1    1    1 :  252  126
kiobuf                 0      0     64    0    0    1 :  252  126
cdev_cache           531    531     64    9    9    1 :  252  126
bdev_cache            59     59     64    1    1    1 :  252  126
mnt_cache             59     59     64    1    1    1 :  252  126
inode_cache        20808  20808    512 2601 2601    1 :  124   62
dentry_cache       23010  23010    128  767  767    1 :  252  126
dquot                  0      0    128    0    0    1 :  252  126
filp                1110   1110    128   37   37    1 :  252  126
names_cache            8      8   4096    8    8    1 :   60   30
buffer_head        32310  32310    128 1077 1077    1 :  252  126
mm_struct             48     48    160    2    2    1 :  252  126
vm_area_struct      1904   2408     68   43   43    1 :  252  126
fs_cache              59     59     64    1    1    1 :  252  126
files_cache           54     54    416    6    6    1 :  124   62
signal_act            51     51   1312   17   17    1 :   60   30
pae_pgd              113    113     32    1    1    1 :  252  126
size-131072(DMA)       0      0 131072    0    0   32 :    0    0
size-131072            0      0 131072    0    0   32 :    0    0
size-65536(DMA)        0      0  65536    0    0   16 :    0    0
size-65536            20     20  65536   20   20   16 :    0    0
size-32768(DMA)        0      0  32768    0    0    8 :    0    0
size-32768             3      3  32768    3    3    8 :    0    0
size-16384(DMA)        0      0  16384    0    0    4 :    0    0
size-16384             0      5  16384    0    5    4 :    0    0
size-8192(DMA)         0      0   8192    0    0    2 :    0    0
size-8192              5     10   8192    5   10    2 :    0    0
size-4096(DMA)         0      0   4096    0    0    1 :   60   30
size-4096             40     40   4096   40   40    1 :   60   30
size-2048(DMA)         0      0   2048    0    0    1 :   60   30
size-2048             20     20   2048   10   10    1 :   60   30
size-1024(DMA)         0      0   1024    0    0    1 :  124   62
size-1024             92     92   1024   23   23    1 :  124   62
size-512(DMA)          0      0    512    0    0    1 :  124   62
size-512             104    104    512   13   13    1 :  124   62
size-256(DMA)          0      0    256    0    0    1 :  252  126
size-256              75     75    256    5    5    1 :  252  126
size-128(DMA)          0      0    128    0    0    1 :  252  126
size-128             900    900    128   30   30    1 :  252  126
size-64(DMA)           0      0     64    0    0    1 :  252  126
size-64             3835   3835     64   65   65    1 :  252  126
size-32(DMA)           0      0     32    0    0    1 :  252  126
size-32              904    904     32    8    8    1 :  252  126
.

This list tells you the cache name, number of active objects, total object count, sizeof the managed objects, number of objects inside a slab, pages per slab. Here you can see that the kernel allocates special caches for filesystem drivers, network stuff etc. and general purpose caches (size-*) suitable for DMA and for ordinary memory access. All caches are doubly linked which makes traversing them easier in order to find a suitable object for allocation.

Every cache holds three linked lists of slabs for free, partially free and one for full slabs. Additionally every cache has an array for each CPU pointing to free objects in the slabs, managed in the LIFO way (just kfree'd objects should asap be given away again).

+-------+ 
| CACHE |
|-------|                   +-------+    +----------------+
|       |------------------>| CPU_0 |--->| Arry w/ ptrs   |
|       |                   | CPU_N |    | to unused objs |
|  free |-->[ SLAB HEAD ]   +-------+    | in slabs       |
|       |                                +----------------+
|partial|---------------------->+--------+---->+--------+---->+--------+---->+-
|       |<----------------------|  SLAB  |<----|  SLAB  |<----|  SLAB  |<----| 
|  full |--[ SLAB HEAD ]        |  HEAD  |     |  HEAD  |     |  HEAD  |     |
|       |                       +--------+     +--------+     +--------+     +
+-------+                       |        |     |        |     |        |     |
                                |   obj  |     |   obj  |     |   obj  |     |
                                .        .     .        .     .        .     .
                                .        .     .        .     .        .     .

slab.c

/*
 * kmem_cache_t
 *
 * manages a cache.
 */

#define CACHE_NAMELEN   20      /* max name length for a slab cache */

struct kmem_cache_s {
/* 1) each alloc & free */
        /* full, partial first, then free */
        struct list_head        slabs_full;
        struct list_head        slabs_partial;
        struct list_head        slabs_free;
        unsigned int            objsize;
        unsigned int            flags;  /* constant flags */
        unsigned int            num;    /* # of objs per slab */
        spinlock_t              spinlock;
#ifdef CONFIG_SMP
        unsigned int            batchcount;
#endif

/* 2) slab additions /removals */
        /* order of pgs per slab (2^n) */
        unsigned int            gfporder;

        /* force GFP flags, e.g. GFP_DMA */
        unsigned int            gfpflags;

        size_t                  colour;         /* cache colouring range */
        unsigned int            colour_off;     /* colour offset */
        unsigned int            colour_next;    /* cache colouring */
        kmem_cache_t            *slabp_cache;
        unsigned int            growing;
        unsigned int            dflags;         /* dynamic flags */

        /* constructor func */
        void (*ctor)(void *, kmem_cache_t *, unsigned long);

        /* de-constructor func */

        /* de-constructor func */
        void (*dtor)(void *, kmem_cache_t *, unsigned long);

        unsigned long           failures;

/* 3) cache creation/removal */
        char                    name[CACHE_NAMELEN];
        struct list_head        next;
#ifdef CONFIG_SMP
/* 4) per-cpu data */
        cpucache_t              *cpudata[NR_CPUS];
#endif
#if STATS
        unsigned long           num_active;
        unsigned long           num_allocations;
        unsigned long           high_mark;
        unsigned long           grown;
        unsigned long           reaped;
        unsigned long           errors;
#ifdef CONFIG_SMP
        atomic_t                allochit;
        atomic_t                allocmiss;
        atomic_t                freehit;
        atomic_t                freemiss;
#endif
#endif
};

Slabs

As seen in Fig.I slabs hold objects of the same size. The head of each slab looks as follows:

list.h

struct list_head {
        struct list_head *next, *prev;
};

typedef struct list_head list_t;

slab.c

typedef struct slab_s {
        struct list_head        list;           
        unsigned long           colouroff;  
        void                    *s_mem;     
        unsigned int            inuse;      
        kmem_bufctl_t           free;  //typedef unsigned int kmem_bufctl_t;     
} slab_t;

Each header is located PAGE_SIZE aligned at the beginning of a (on-slab) slab. Every object in the slab is sizeof(void *) aligned to increase access spead. After this header follows an array containing an int value for every object. These values however are only important for currently free objects and are used as an index to the next free object in the slab.

A value called BUFCTL_END (slab.c: #define BUFCTL_END 0xffffFFFF) marks the end of this array. "couloroff" describes "offsetting the slab_t structure into the slab area to maximize cache alignment." (slab.c)

It is also possible that the slab header is stored "off-slab" at an independent object. Due to the *s_mem member of the slab_t struct it is unimportant where the slab head is stored because it holds a pointer to the beginning of the objects of a slab. The decision for on or off-slab is made in kmem_cache_create:

        /* Determine if the slab management is 'on' or 'off' slab. */
        if (size >= (PAGE_SIZE>>3)) // if (size-requested >= 512)
                /*
                 * Size is large, assume best to place the slab management obj
                 * off-slab (should allow better packing of objs).
                 */
                flags |= CFLGS_OFF_SLAB; //a special flag was set

If the requested object size is >= 512 bytes. BUT:

        /*
         * If the slab has been placed off-slab, and we have enough space then
         * move it on-slab. This is at the expense of any extra colouring.
         */
        if (flags & CFLGS_OFF_SLAB && left_over >= slab_size) {
                flags &= ~CFLGS_OFF_SLAB;
                left_over -= slab_size;
        }

If the header fits into the allocated slab space then store it on-slab!!

Exploitation notes

As we are writing towards upper memory addresses it is NOT possible to overwrite the current (previously written) slab header:

<--0x00                             .-overflowing obj          0xff ->
[SLAB HEADER][COLOUR][obj1 ][ojb 2][ctrl_][aaaaa][aaaaa][aaaaa][aaaaa]

But other - after "us" allocated objects inside of our slab. Here we must keep in mind that the kernel tries to balance the load between all possible slabs, so all "partial" slabs should have the same count of inuse objects.....

The problem is that we can only overflow *random* other objects that have been allocated by other kernel routines after our bug-trigger.

Randomness or increasing reliability

By carefully consuming objects of a certain cache you can - with the above stuff in mind - quite good control what object comes next to our overflowing object and thus quite reliable exploit the use of this other allocated object...

Got it.. use your brains!

It's time to find kernel functions you can triger from userspace that allocate objects of the cache in which your overflow is in.

And erm successful exploitation is actually possible!

Have fun!

-q