Once upon a free()...

Written by : Anonymous
First published on : Phrack

On the Unix system, and later in the C standard library there are functions to handle variable amounts of memory in a dynamic way. This allows programs to dynamically request memory blocks from the system. The operating system only provides a very rough system call 'brk' to change the size of a big memory chunk, which is known as the heap.

On top of this system call the malloc interface is located, which provides a layer between the application and the system call. It can dynamically split the large single block into smaller chunks, free those chunks on request of the application and avoid fragmentation while doing so. You can compare the malloc interface to a linear file system on a large, but dynamically sized raw device.

There are a few design goals which have to be met by the malloc interface:

There are only a few common malloc implementations. The most common ones are the System V one, implemented by AT&T, the GNU C Library implementation and the malloc-similar interface of the Microsoft operating systems (RtlHeap*).

Here is a table of algorithms and which operating systems use them:

Algorithm               | Operating System
------------------------+--------------------------------------------------
BSD kingsley            | 4.4BSD, AIX (compatibility), Ultrix
BSD phk                 | BSDI, FreeBSD, OpenBSD
GNU Lib C (Doug Lea)    | Hurd, Linux
System V AT&T           | Solaris, IRIX
Yorktown                | AIX (default)
RtlHeap*                | Microsoft Windows *
------------------------+--------------------------------------------------

It is interesting to see that most of the malloc implementations are very easy to port and that they are architecture independent. Most of those implementations just build an interface with the 'brk' system call. You can change this behaviour with a #define. All of the implementations I have come across are written in ANSI C and just do very minimal or even no sanity checking. Most of them have a special compilation define that includes asserts and extra checks. Those are turned off by default in the final build for performance reasons. Some of the implementations also offer extra reliability checks that will detect buffer overflows. Those are made to detect overflows while development, not to stop exploitation in the final release.

Storing management info in-band

Most malloc implementations share the behaviour of storing their own management information, such as lists of used or free blocks, sizes of memory blocks and other useful data within the heap space itself. Since the whole idea of malloc/free is based on the dynamic requirements the application has, the management info itself occupies a variable amount of data too. Because of this, the implementation can seldomly just reserve a certain amount of memory for its own purposes, but stores the management information "in-band", right after and before the blocks of memory that are used by the application.

Some applications do request a block of memory using the malloc interface, which later happens to be vulnerable to a buffer overflow. This way, the data behind the chunk can be changed. Possibly the malloc management structures can be compromised. This has been demonstrated first by Solar Designer's wizard-like exploit [1].

The central attack of exploiting malloc allocated buffer overflows is to modify this management information in a way that will allow arbitrary memory overwrites afterwards. This way pointers can be overwritten within the writeable process memory, hence allowing modification of return addresses, linkage tables or application level data.

To mount such an attack, we have to take a deep look within the internal workings of the implementation we want to exploit. This article discusses the commonly used GNU C Library and the System V implementation and how to gain control over a process using buffer overflows which occur in malloced buffers under Linux, Solaris and IRIX systems.

System V malloc implementation

IRIX and Solaris use an implementation which is based on self-adjusting binary trees. The theoretical background of this implementation has been described in [2].

The basic idea of this implementation is to keep lists of equally sized malloc chunks within a binary tree. If you allocate two chunks of the same size, they will be within the same node and within the same list of this node. The tree is ordered by the size of its elements.

The TREE structure

The definition of the TREE structure can be found in the mallint.h, along with some easy-to-use macros to access its elements. The mallint.h file can be found in the source distribution of the Solaris operating system [4]. Although I cannot verify that IRIX is based on the same source, there are several similarities which indicated this. The malloc interface internally creates the same memory layout and functions, besides some 64 bit alignments. You can utilize the Solaris source for your IRIX exploits, too.

To allow each tree element to be used for a different purpose to avoid overhead and force an alignment, each TREE structure element is defined as a union:

/* the proto-word; size must be ALIGN bytes */
typedef union _w_ {
	size_t		w_i;		/* an unsigned int */
	struct _t_	*w_p;		/* a pointer */
	char		w_a[ALIGN];	/* to force size */
} WORD;

Central TREE structure definition:

/* structure of a node in the free tree */
typedef struct _t_ {
	WORD	t_s;	/* size of this element */
	WORD	t_p;	/* parent node */
	WORD	t_l;	/* left child */
	WORD	t_r;	/* right child */
	WORD	t_n;	/* next in link list */
	WORD	t_d;	/* dummy to reserve space for self-pointer */
} TREE;

The 't_s' element of the chunk header contains the rounded up value of the size the user requested when he called malloc. Since this size is always rounded up to a word boundary, at least the lower two bits of the 't_s' elements are unused - they normally would have the value of zero all the time. Instead of being zero, they are ignored for all size-related operations. They are used as flag elements.

From the malloc.c source it reads:

   BIT0: 1 for busy (block is in use), 0 for free.

   BIT1: if the block is busy, this bit is 1 if the preceding block in
         contiguous memory is free. Otherwise, it is always 0.

TREE Access macros:

/* usable # of bytes in the block */
#define	SIZE(b)		(((b)->t_s).w_i)

/* free tree pointers */
#define	PARENT(b)	(((b)->t_p).w_p)
#define	LEFT(b)		(((b)->t_l).w_p)
#define	RIGHT(b)	(((b)->t_r).w_p)

/* forward link in lists of small blocks */
#define	AFTER(b)	(((b)->t_p).w_p)

/* forward and backward links for lists in the tree */
#define	LINKFOR(b)	(((b)->t_n).w_p)
#define	LINKBAK(b)	(((b)->t_p).w_p)

For all allocation operations a certain alignment and minimum size is enforced, which is defined here:

#define	WORDSIZE	(sizeof (WORD))
#define	MINSIZE		(sizeof (TREE) - sizeof (WORD))
#define	ROUND(s)	if (s % WORDSIZE) s += (WORDSIZE - (s % WORDSIZE))

The tree structure is the central element of each allocated chunk. Normally only the 't_s' and 't_p' elements are used, and user data is stored from 't_l' on. Once the node is freed, this changes and the data is reused to manage the free elements more efficiently. The chunk represents an element within the splay tree. As more chunks get freed, the malloc implementation tries to merge the free chunks right next to it. At most FREESIZE (32 by default) chunks can be in this dangling free state at the same time. They are all stored within the 'flist' array. If a call to free is made while the list is already full, the old element at this place falls out and is forwarded to realfree. The place is then occupied by the newly freed element.

This is done to speed up and avoid defragmentation in cases where a lot of calls to free are made in a row. The real merging process is done by realfree. It inserts the chunk into the central tree, which starts at the 'Root' pointer. The tree is ordered by the size of its elements and is self-balancing. It is a so called "splay tree", in which the elements cycle in a special way to speed up searches (see google.com "splay tree" for further information). This is not much of importance here, but keep in mind that there are two stages of free chunks: one being within the flist array, and one within the free-elements tree starting at 'Root'.

There are some special management routines for allocating small chunks of memory, which happen to have a size below 40 bytes. Those are not considered here, but the basic idea is to have extra lists of them, not keeping them within a tree but in lists, one for each WORD matching size below 40.

There is more than one way to exploit a malloc based buffer overflow, however here is one method which works against both, IRIX and Solaris.

As a chunk is realfree'd, it is checked whether the neighbor-chunks are already within the realfree'd tree. If it is the case, the only thing that has to be done is to logically merge the two chunks and reorder its position within the tree, as the size has changed.

This merging process involves pointer modification within the tree, which consists of nodes. These nodes are represented by the chunk header itself. Pointers to other tree elements are stored there. If we can overwrite them, we can possibly modify the operation when merging the chunks.

Here is, how it is done in malloc.c (modified to show the interesting part of it) :

static void
realfree(void *old)
{
	TREE	*tp, *sp, *np;
	size_t	ts, size;

	/* pointer to the block */
	tp = BLOCK(old);
	ts = SIZE(tp);
	if (!ISBIT0(ts))
		return;
	CLRBITS01(SIZE(tp));

	/* see if coalescing with next block is warranted */
	np = NEXT(tp);
	if (!ISBIT0(SIZE(np))) {
		if (np != Bottom)
			t_delete(np);
		SIZE(tp) += SIZE(np) + WORDSIZE;
	}

We remember NEXT points to the chunk directly following the current one. So we have this memory layout:

          tp               old              np
          |                |                |
          [chunk A header] [chunk A data] | [chunk B or free ....]
                                          |
                                          chunk boundary

In the usual situation the application has allocated some space and got a pointer (old) from malloc. It then messes up and allows a buffer overflow of the chunk data. We cross the chunk boundary by overflowing and hit the data behind, which is either free space or another used chunk.

	np = NEXT(tp);

Since we can only overflow data behind 'old', we cannot modify the header of our own chunk. Therefore we cannot influence the 'np' pointer in any way. It always points to the chunk boundary.

Now a check is made to test if it is possible to merge forward, that is our chunk and the chunk behind it. Remember that we can control the chunk to the right of us.

	if (!ISBIT0(SIZE(np))) {
		if (np != Bottom)
			t_delete(np);
		SIZE(tp) += SIZE(np) + WORDSIZE;
	}

BIT0 is zero if the chunk is free and within the free elements tree. So if it is free and not the last chunk, the special 'Bottom' chunk, it is deleted from the tree. Then the sizes of both chunks are added and later in the code of the realfree function the whole resized chunk is reinserted into the tree.

One important part is that the overflowed chunk must not be the last chunk within the malloc space, condition:

1. Overflowed chunk must not be the last chunk

Here is how the 't_delete' function works:

static void
t_delete(TREE *op)
{
	TREE	*tp, *sp, *gp;

	/* if this is a non-tree node */
	if (ISNOTREE(op)) {
		tp = LINKBAK(op);
		if ((sp = LINKFOR(op)) != NULL)
			LINKBAK(sp) = tp;
		LINKFOR(tp) = sp;
		return;
	}

There are other cases, but this is the one easiest to exploit. As I am already tired of this, I will just explain this one here. The others are very similar (look at malloc.c).

ISNOTREE compares the 't_l' element of the TREE structure with -1. -1 is the special marker for non-tree nodes, which are used as doubly linked list, but that does not matter.

Anyway, this is the first condition we have to obey:

        2. fake->t_l = -1;

Now the unlinking between FOR (t_n) and BAK (t_p) is done, which can be rewritten as:

	t1 = fake->t_p
	t2 = fake->t_n
	t2->t_p = t1
	t1->t_n = t2

Which is (written in pseudo-raw-assignments which happen at the same time):

	[t_n + (1 * sizeof (WORD))] = t_p
	[t_p + (4 * sizeof (WORD))] = t_n

This way we can write to arbitrary addresses together with valid addresses at the same time. We choose to use this:

	t_p = retloc - 4 * sizeof (WORD)
	t_n = retaddr

This way retloc will be overwritten with retaddr and *(retaddr + 8) will be overwritten with retloc. If there is code at retaddr, there should be a small jump over the bytes 8-11 to not execute this address as code. Also, the addresses can be swapped if that fits the situation better.

Finally our overwrite buffer looks like this:

  | <t_s> <t_p> <t_l> <j: t_r> <t_n> <j: t_d>
  |
  chunk boundary

Where: t_s = some small size with lower two bits zeroed out
       t_p = retloc - 4 * sizeof (WORD)
       t_l = -1
       t_r = junk
       t_n = retaddr
       t_d = junk

Note that although all of the data is stored as 32 bit pointers, each structure element occupies eight bytes. This is because of the WORD union, which forces at least ALIGN bytes to be used for each element. ALIGN is defined to eight by default.

So a real overflow buffer behind the chunk boundary might look like:

ff ff ff f0 41 41 41 41  ef ff fc e0 41 41 41 41  | ....AAAA....AAAA
ff ff ff ff 41 41 41 41  41 41 41 41 41 41 41 41  | ....AAAAAAAAAAAA
ef ff fc a8 41 41 41 41  41 41 41 41 41 41 41 41  | ....AAAAAAAAAAAA

All 'A' characters can be set arbitrarily. The 't_s' element has been replaced with a small negative number to avoid NUL bytes. If you want to use NUL bytes, use very few. Otherwise the realfree function will crash later.

The buffer above will overwrite:

	[0xeffffce0 + 32] = 0xeffffca8
	[0xeffffca8 +  8] = 0xeffffce0

See the example code (mxp.c) for a more in-depth explanation.

To summarize down the guts if you happen to exploit a malloc based buffer overflow on IRIX or Solaris:
  1. Create a fake chunk behind the one you overflow
  2. The fake chunk is merged with the one you overflow as it is passed to realfree
  3. To make it pass to realfree it has to call malloc() again or there have to be a lot of successive free() calls
  4. The overflowed chunk must not be the last chunk (the one before Bottom)
  5. Prepend the shellcode/nop-space with jump-aheads to not execute the unavoidable unlink-overwrite address as code
  6. Using the t_splay routines attacks like this are possible too, so if you cannot use the attack described here (say you cannot write 0xff bytes), use the source luke.

There are a lot of other ways to exploit System V malloc management, way more than there are available in the GNU implementation. This is a result of the dynamic tree structure, which also makes it difficult to understand sometimes. If you have read until here, I am sure you can find your own ways to exploit malloc based buffer overflows.

GNU C Library implementation

The GNU C library keeps the information about the memory slices the application requests in so called 'chunks'. They look like this (adapted from malloc.c):

             +----------------------------------+
    chunk -> | prev_size                        |
             +----------------------------------+
             | size                             |
             +----------------------------------+
      mem -> | data                             |
             : ...                              :
             +----------------------------------+
nextchunk -> | prev_size ...                    |
             :                                  :

Where mem is the pointer you get as return value from malloc(). So if you do a:

        unsigned char *	mem = malloc (16);

Then 'mem' is equal to the pointer in the figure, and (mem - 8) would be equal to the 'chunk' pointer.

The 'prev_size' element has a special function: If the chunk before the current one is unused (it was free'd), it contains the length of the chunk before. In the other case - the chunk before the current one is used - 'prev_size' is part of the 'data' of it, saving four bytes.

The 'size' field has a special meaning. As you would expect, it contains the length of the current block of memory, the data section. As you call malloc(), four is added to the size you pass to it and afterwards the size is padded up to the next double-word boundary. So a malloc(7) will become a malloc(16), and a malloc(20) will become malloc(32). For malloc(0) it will be padded to malloc(8). The reason for this behaviour will be explained in the latter.

Since this padding implies that the lower three bits are always zero and are not used for real length, they are used another way. They are used to indicate special attributes of the chunk. The lowest bit, called PREV_INUSE, indicates whether the previous chunk is used or not. It is set if the next chunk is in use. The second least significant bit is set if the memory area is mmap'ed -- a special case which we will not consider. The third least significant bit is unused.

To test whether the current chunk is in use or not, we have to check the next chunk's PREV_INUSE bit within its size value.

Once we free() the chunk, using free(mem), some checks take place and the memory is released. If its neighbour blocks are free, too (checked using the PREV_INUSE flag), they will be merged to keep the number of reuseable blocks low, but their sizes as large as possible. If a merge is not possible, the next chunk is tagged with a cleared PREV_INUSE bit, and the chunk changes a bit:

             +----------------------------------+
    chunk -> | prev_size                        |
             +----------------------------------+
             | size                             |
             +----------------------------------+
      mem -> | fd                               |
             +----------------------------------+
             | bk                               |
             +----------------------------------+
             | (old memory, can be zero bytes)  |
             :                                  :

nextchunk -> | prev_size ...                    |
             :                                  :

You can see that there are two new values, where our data was previously stored (at the 'mem' pointer). Those two values, called 'fd' and 'bk' - forward and backward, that is, are pointers. They point into a double linked list of unconsolidated blocks of free memory. Every time a new free is issued, the list will be checked, and possibly unconsolidated blocks are merged. The whole memory gets defragmented from time to time to release some memory.

Since the malloc size is always at least 8 bytes, there is enough space for both pointers. If there is old data remaining behind the 'bk' pointer, it remains unused until it gets malloc'd again.

The interesting thing regarding this management, is that the whole internal information is held in-band -- a clear channeling problem. (just as with format string commands within the string itself, as control channels in breakable phonelines, as return addresses within stack memory, etc).

Since we can overwrite this internal management information if we can overwrite a malloced area, we should take a look at how it is processed later on. As every malloc'ed area is free()'d again in any sane program, we take a look at free, which is a wrapper to chunk_free() within malloc.c (simplified a bit, took out #ifdef's):

static void
chunk_free(arena *ar_ptr, mchunkptr p)
{
  size_t     hd = p->size; /* its head field */
  size_t     sz;           /* its size */
  int        idx;          /* its bin index */
  mchunkptr  next;         /* next contiguous chunk */
  size_t     nextsz;       /* its size */
  size_t     prevsz;       /* size of previous contiguous chunk */
  mchunkptr  bck;          /* misc temp for linking */
  mchunkptr  fwd;          /* misc temp for linking */
  int        islr;         /* track whether merging with last_remainder */

  check_inuse_chunk(ar_ptr, p);

  sz = hd & ~PREV_INUSE;
  next = chunk_at_offset(p, sz);
  nextsz = chunksize(next);

Since the malloc management keeps chunks within special structures called 'arenas', it is now tested whether the current chunk that should be free directly borders to the 'top' chunk -- a special chunk. The 'top' chunk is always the top-most available memory chunk within an arena, it is the border of the available memory. The whole if-block is not interesting for typical buffer overflows within the malloc space.

  if (next == top(ar_ptr))                         /* merge with top */
  {
    sz += nextsz;

    if (!(hd & PREV_INUSE))                    /* consolidate backward */
    {
      prevsz = p->prev_size;
      p = chunk_at_offset(p, -(long)prevsz);
      sz += prevsz;
      unlink(p, bck, fwd);
    }

    set_head(p, sz | PREV_INUSE);
    top(ar_ptr) = p;

      if ((unsigned long)(sz) >= (unsigned long)trim_threshold)
        main_trim(top_pad);
    return;
  }

Now the 'size' of the current chunk is tested whether the previous chunk is unused (testing for the PREV_INUSE flag). If this is the case, both chunks are merged.

  islr = 0;

  if (!(hd & PREV_INUSE))                    /* consolidate backward */
  {
    prevsz = p->prev_size;
    p = chunk_at_offset(p, -(long)prevsz);
    sz += prevsz;

    if (p->fd == last_remainder(ar_ptr))     /* keep as last_remainder */
      islr = 1;
    else
      unlink(p, bck, fwd);
  }

Now the same is done vice versa. It is checked whether the chunk in front of the current chunk is free (testing for the PREV_INUSE flag of the size two chunks ahead). If this is the case the chunk is also merged into the current one.

  if (!(inuse_bit_at_offset(next, nextsz)))   /* consolidate forward */
  {
    sz += nextsz;

    if (!islr && next->fd == last_remainder(ar_ptr))
                                              /* re-insert last_remainder */
    {
      islr = 1;
      link_last_remainder(ar_ptr, p);
    }
    else
      unlink(next, bck, fwd);

    next = chunk_at_offset(p, sz);
  }
  else
    set_head(next, nextsz);                  /* clear inuse bit */

  set_head(p, sz | PREV_INUSE);
  next->prev_size = sz;
  if (!islr)
    frontlink(ar_ptr, p, sz, idx, bck, fwd);
}

As Solar Designer showed us, it is possible to use the 'unlink' macro to overwrite arbitrary memory locations. Here is how to do:

A usual buffer overflow situation might look like:

        mem = malloc (24);
        gets (mem);
        ...
        free (mem);

This way the malloc'ed chunk looks like this:

[ prev_size ] [ size P] [ 24 bytes ... ] (next chunk from now)
       [ prev_size ] [ size P] [ fd ] [ bk ] or [ data ... ]

As you can see, the next chunk directly borders to our chunk we overflow. We can overwrite anything behind the data region of our chunk, including the header of the following chunk.

If we take a closer look at the end of the chunk_free function, we see this code:

  if (!(inuse_bit_at_offset(next, nextsz)))   /* consolidate forward */
  {
    sz += nextsz;

    if (!islr && next->fd == last_remainder(ar_ptr))
                                              /* re-insert last_remainder */
    {
      islr = 1;
      link_last_remainder(ar_ptr, p);
    }
    else
      unlink(next, bck, fwd);

    next = chunk_at_offset(p, sz);
  }

The inuse_bit_at_offset, is defined as macro in the beginning of malloc.c:

#define inuse_bit_at_offset(p, s)\
 (((mchunkptr)(((char*)(p)) + (s)))->size & PREV_INUSE)

Since we control the header of the 'next' chunk we can trigger the whole if block at will. The inner if statement is uninteresting, except our chunk is bordering to the top-most chunk. So if we choose to trigger the outer if statement, we will call unlink, which is defined as macro, too:

#define unlink(P, BK, FD)                                                \
{                                                                        \
  BK = P->bk;                                                            \
  FD = P->fd;                                                            \
  FD->bk = BK;                                                           \
  BK->fd = FD;                                                           \
}

The unlink is called with a pointer to a free chunk and two temporary pointer variables, called bck and fwd. It does this to the 'next' chunk header:

  *(next->fd + 12) = next->bk
  *(next->bk + 8) = next->fd

They are not swapped, but the 'fd' and 'bk' pointers point to other chunks. This two chunks being pointed to are linked, zapping the current chunk from the table.

So to exploit a malloc based buffer overflow, we have to write a bogus header in the following chunk and then wait for our chunk getting free'd.

  [buffer .... ] | [ prev_size ] [ size ] [ fd ] [ bk ]

  '|' is the chunk boundary.
The values we set for 'prev_size' and 'size' do not matter, but two conditions have to be met, in case it should work:
  1. The least significant bit of 'size' has to be zero
  2. Both, 'prev_size' and 'size' should be add-safe to a pointer that is read from. So either use very small values up to a few thousand, or - to avoid NUL bytes - use big values such as 0xfffffffc.
  3. You have to ensure that at (chunk_boundary + size + 4) the lowest bit is zeroed out (0xfffffffc will work just fine)

'fd' and 'bk' can be set this way (as used in Solar Designers Netscape Exploit):

  fd = retloc - 12
  bk = retaddr

But beware, that (retaddr + 8) is being written to and the content there will be destroyed. You can circumvent this by a simple '\xeb\x0c' at retaddr, which will jump twelve bytes ahead, over the destroyed content.

Well, however, exploitation is pretty straight forward now:

<jmp-ahead, 2> <6> <4 bogus> <nop> <shellcode> |
        \xff\xff\xff\xfc \xff\xff\xff\xfc <retloc - 12> <retaddr>

Where '|' is the chunk boundary (from that point we overflow). Now, the next two negative numbers are just to survive a few checks in free() and to avoid NUL bytes. Then we store (retloc - 12) properly encoded and then the return address, which will return to the 'jmp-ahead'. The buffer before the '|' is the same as with any x86 exploit, except for the first 12 bytes, because we have to take care of the extra write operation by the unlink macro.

Off-by-one / Off-by-five

It is possible to overwrite arbitrary pointers, even in cases where you can overwrite only five bytes, or - in special cases - only one byte. When overwriting five bytes the memory layout has to look like:

        [chunk a] [chunk b]

Where chunk a is under your control and overflowable. Chunk b is already allocated as the overflow happens. By overwriting the first five bytes of chunk b, we trash the 'prev_size' element of the chunks header and the least significant byte of the 'size' element. Now, as chunk b is free()'d, backward consolidation pops in, since 'size' has the PREV_INUSE flag cleared (see below). If we supply a small value for 'prev_size', which is smaller than the size of chunk a, we create a fake chunk structure:

        [chunk a ... fakechunk ...] [chunk b]
                     |
                     p 

Where prev_size of chunk b points relativly negative to the fake chunk. The code which is exploitable through this setting was already discussed:

  if (!(hd & PREV_INUSE))                    /* consolidate backward */
  {
    prevsz = p->prev_size;
    p = chunk_at_offset(p, -(long)prevsz);
    sz += prevsz;

    if (p->fd == last_remainder(ar_ptr))     /* keep as last_remainder */
      islr = 1;
    else
      unlink(p, bck, fwd);
  }

'hd' is the size element of chunk b. When we overwrite it, we clear out the lower two bits, so PREV_INUSE is cleared and the if condition is matched (NUL will do it in fact). In the next few instructions 'p', which was a pointer to chunk b originally, is relocated to our fakechunk. Then the unlink macro is called and we can overwrite the pointers as usual. We use backward consolidation now, while in the previous description we have used forward consolidation. Is this all confusing? Well, when exploiting malloc overflows, do not worry about the details, they will become clearer as you understand the malloc functions from a broader scope.

For a really well done overview and description of the malloc implementation in the GNU C Library, take a look at the GNU C Library reference manual [3]. It makes a good read for non-malloc related things, too.

Possible obstacles and how to get over with them

As with any new exploitation technique people will show up which have the 'perfect' solution to the problem in their head or in form of a patch to the malloc functions. Those people - often ones who have never written an exploit themselves - are misleading into a wrong sense of security and I want to leave a few words about those approaches and why they seldomly work.

There are three host based stages where you can stop a buffer overflow resulting in a compromise:
  1. The bug/overflow stage : This is the place where the real overflow happens, where data is overwritten. If this place is known, the origin of the problem can be fixed (at source level). However, most approaches argue that this place is not known and therefore the problem cannot be fixed yet.
  2. The activation stage : After the overflow happened parts of the data of the application are corrupted. It does not matter what kind of data, whether it is a stack frame, a malloc management record or static data behind a buffer. The process is still running its own path of code, the overwritten data is still passive. This stage is everything after the overflow itself and before the seize of execution control. This is where the natural, non-artificially introduced hurdles for the attacker lies, code which must be passed in a certain way.
  3. The seized stage : This is everything after control has been redirected from its original path of execution. This is the stage where nopspace and shellcode is executed, where no real exploitation hurdles are left.

Now for the protection systems. Most of the "non-exec stack" and "non-exec heap" patches try to catch the switch from stage two to three, where execution is seized, while some proprietary systems check for the origin of a system call from within kernel space. They do not forbid you to run code this way, they try to limit what code can be run.

Those systems which allow you to redirect execution in the first place are fundamentally flawed. They try to limit the exploitation in a black-listing way, by trying to plug the places you may want to go to. But if you can execute legal code within the process space its almost everytime enough to compromise the process as a whole.

Now for the more challenging protections, which try to gripe you in stage two. Those include - among others - libsafe, StackGuard, FormatGuard, and any compiler or library based patches. They usually require a recompilation or relinking of your existing code, to insert their security 'measures' into your code. This includes canary values, barriers of check bytes or reordering and extensive checking of sanity before doing things which might be bad. While sanity checking in general is a good policy for security, it cannot fix stuff that was broken before. Every of those protections is assuming a certain situation of a bug which might appear in your program and try to predict the results of an attacker abusing the bug. They setup traps which they assume you will or have to trigger to exploit the bug. This is done before your control is active, so you cannot influence it much except by choosing the input data. Those are, of course much more tight than protection systems which only monitor stage three, but still there are ways around them. A couple of ways have been discussed in the past, so I will not go into depth here. Rather, I will briefly address on a protection which I already see on the horizon under a name like 'MallocGuard'.

Such a protection would not change the mechanism of malloc management chunks much, since the current code has proved to be effective. The malloc function plays a key role in overall system performance, so you cannot tweak freely here. Such a protection can only introduce a few extra checks, it cannot verify the entire consistency everytime malloc() is called. And this is where it is flawed: Once you seize control over one malloc chunk information, you can seize control over other chunks too. Because chunks are 'walked' by using either stored pointers (SysV) or stored lengths (GlibC), it is possible to 'create' new chunks. Since a sanity check would have to assume inconsistency of all chunks in the worst case, it would have to check all chunks by walking them. But this would eat up too much performance, so its impossible to check for malloc overflows easily while still keep a good performance. So, there will be no 'MallocGuard', or it will be a useless guard, in the tradition of useless pseudo protections. As a friend puts it - 'for every protection there is an anti-protection'.

Thanks

I would like to thank all proofreaders and correctors. For some really needed corrections I thank MaXX, who wrote the more detailed article about GNU C Library malloc in this issue of Phrack, kudos to him ! :)

References

[1] Solar Designer,
    http://www.openwall.com/advisories/OW-002-netscape-jpeg.txt
[2] DD Sleator, RE Tarjan, "Self-Adjusting Binary Trees", 1985,
    http://www.acm.org/pubs/citations/journals/jacm/1985-32-3/p652-sleator/
    http://www.math.tau.ac.il/~haimk/adv-ds-2000/sleator-tarjan-splay.pdf
[3] The GNU C Library
    http://www.gnu.org/manual/glibc-2.2.3/html_node/libc_toc.html
[4] Solaris 8 Foundation Source Program
    http://www.sun.com/software/solaris/source/