Back to all posts
Linux KernelBPFcore.c

BPF Subsystem Core Part 1

BPF Subsystem core.c Part 1

Note: Line numbers may vary based on the version of the Linux kernel source code.

As we already know BPF has 11 registers. The first 10 registers are general purpose registers (R0-R9) and the last register is the frame pointer (R10). R0 is used to store return values from functions and R1-R5 are used to pass arguments to functions. R6-R9 are callee-saved registers, which means that if a function uses these registers, it must save their original values and restore them before returning. (Line 47-58)

There are also named registers: instruction-based names and special purpose names.

Instruction-based names

#define DST  regs[insn->dst_reg]  // Destination register from current instruction
#define SRC  regs[insn->src_reg]  // Source register from current instruction
#define OFF  insn->off            // Offset field (e.g., for memory access)
#define IMM  insn->imm            // Immediate value (constant in instruction)

Example: If instruction is r1 = r2 + 5, DST will be r1, SRC will be r2, and IMM will be 5.

Special purpose names

#define FP   regs[BPF_REG_FP]   // Frame Pointer = R10 (stack access)
#define AX   regs[BPF_REG_AX]   // Auxiliary register (hidden from user!)
#define ARG1 regs[BPF_REG_ARG1] // First argument = R1
#define CTX  regs[BPF_REG_CTX]  // Context register = R1 (program input)

FP is used to access the stack, AX is used for intermediate calculations, ARG1 is used to pass the first argument to helper functions, and CTX holds the context of the BPF program.

Here, something intersting is AX register. It is a hidden register from the user, which means that BPF programs cannot directly access or modify it. And it is used by JIT for constant blinding as security measure. (Constant blinding is a technique used to randomize constant values in the BPF program to prevent certain types of attacks, such as JIT spraying.) It is also used by interpreter as temporary in DIV/ MOD operations to avoid modifying source registers while being used by verifier for special rewrites.

bpf_internal_load_pointer_neg_helper (Line 77)

This function is related to legacy packet access, handling special negative offsets for accessing network packet data

ptr = skb_network_header(skb) + k - SKF_NET_OFF; gives you access to the network layer(e.g., IP header) of the packet while ptr = skb_mac_header(skb) + k - SKF_LL_OFF; gives you access to the link layer (e.g., Ethernet header) of the packet. Classic BPF used these negative offsets to access different layers of packet data.

bpf_prog_alloc_no_stats (Line 99)

This is one of the critical functions of the BPF subsystem as it creates the skeleton of a BPF program object. Think of it as malloc + initialization for a BPF program.

gfp_t gfp_flags = bpf_memcg_flags(GFP_KERNEL | __GFP_ZERO | gfp_extra_flags); sets up the allocation flags. Here, GFP_KERNEL indicates that the allocation is happening in a normal kernel context, __GFP_ZERO ensures that the allocated memory is zeroed out, and gfp_extra_flags allows for additional customization of the allocation behavior.

size = round_up(size, __PAGE_SIZE); rounds up the requested size to the nearest page size. This page alignment is important for later operations like JIT compilation, which may require page-aligned memory for executable code.

fp = __vmalloc(size, gfp_flags); allocates the main memory program structure using, instead of kmalloc, vmalloc, which is suitable for larger allocations that may not be contiguous in physical memory.

kzalloc(sizeof(*aux), bpf_memcg_flags(GFP_KERNEL | gfp_extra_flags)); allocates and zeroes out the auxiliary structure for the BPF program, which holds additional metadata and state information.

fp->active = alloc_percpu_gfp(int, bpf_memcg_flags(GFP_KERNEL | gfp_extra_flags)); allocates per-CPU active counter. This tracks nested BPF program calls. Each CPU needs its own counter to avoid synchronization overhead.

fp->jit_requested = ebpf_jit_enabled();
fp->blinding_requested = bpf_jit_blinding_enabled(fp);

These lines check if JIT should be used. jit_requested: Should we compile to native code? blinding_requested: Should we apply constant blinding for security?

mutex_init(&fp->aux->used_maps_mutex);  // Protects map references
mutex_init(&fp->aux->ext_mutex);        // Protects program extensions
mutex_init(&fp->aux->dst_mutex);        // Protects destination program

These mutexes ensure thread-safe access to various parts of the BPF program's auxiliary data, such as map references, program extensions, and destination programs for tail calls.

Finally, the function returns the newly allocated BPF program structure, ready for further initialization and use.

bpf_prog_alloc (Line 147)

This function is a wrapper around bpf_prog_alloc_no_stats that adds additional initialization for BPF program statistics tracking. It first calls bpf_prog_alloc_no_stats to allocate the basic BPF program structure. Then prog->stats = alloc_percpu_gfp(struct bpf_prog_stats, gfp_flags); allocates per-CPU stats. Each CPU tracks its own number of times program ran (cnt), total execution time (nsecs), and memory used.

for_each_possible_cpu(cpu) {
    struct bpf_prog_stats *pstats = per_cpu_ptr(prog->stats, cpu);
    u64_stats_init(&pstats->syncp);
}

This loop initializes the synchronization primitives for each CPU's stats structure to ensure safe concurrent updates. u64_stats_init sets up the syncp field, which is used to protect the stats from concurrent access issues and allows reading 64-bit stats on 32-bit systems without tearing. (atomic-like reads).

Note: Why per-CPU? This is to avoid cache line bouncing and lock contention when programs run simultaneously on multiple CPUs.

Finally, the function returns the fully initialized BPF program structure with stats tracking enabled.