Back to Cilium posts

Understanding Cilium's XDP Datapath: A Deep Dive into bpf_xdp.c

ciliumxdpebpfnetworkingdatapathbpf_xdp.c

Exploring the architecture and performance of Cilium's XDP datapath implementation.

Understanding Cilium's XDP Datapath: A Deep Dive into bpf_xdp.c

Introduction

XDP (eXpress Data Path) represents the earliest possible packet processing point in the Linux networking stack. Running directly in the network driver before the kernel even allocates an sk_buff, XDP provides unprecedented performance for packet filtering and load balancing operations. Cilium leverages XDP to implement a high-performance datapath capable of dropping malicious traffic in under 100 nanoseconds and performing NodePort load balancing with sub-microsecond latency.

This article dissects Cilium's XDP implementation (bpf/bpf_xdp.c), explaining how it achieves such remarkable performance through careful design and optimization.

Architecture Overview

The XDP program in Cilium serves three primary functions:

  1. Prefiltering - CIDR-based packet filtering to drop unwanted traffic at the earliest possible stage
  2. NodePort Load Balancing - Early service translation and backend selection
  3. Metadata Transfer - Passing state to subsequent TC (Traffic Control) BPF programs

The program is structured as a layered pipeline, with each layer performing progressively more complex operations:

┌─────────────────────────────────────────────────────────┐
│              Packet Arrives at NIC Driver               │
└────────────────────┬────────────────────────────────────┘
                     │
                     ▼
┌─────────────────────────────────────────────────────────┐
│                  cil_xdp_entry()                        │
│         • Extract trace ID from IP options              │
│         • Initialize packet processing                  │
└────────────────────┬────────────────────────────────────┘
                     │
                     ▼
┌─────────────────────────────────────────────────────────┐
│                 check_filters()                         │
│         • Validate Ethernet header                      │
│         • Clear metadata markers                        │
│         • Dispatch by protocol (IPv4/IPv6)              │
└────────────────────┬────────────────────────────────────┘
                     │
          ┌──────────┴──────────┐
          ▼                     ▼
┌──────────────────┐   ┌──────────────────┐
│   check_v4()     │   │   check_v6()     │
│  • Prefilter     │   │  • Prefilter     │
│  • CIDR drops    │   │  • CIDR drops    │
└────────┬─────────┘   └────────┬─────────┘
         │                      │
         ▼                      ▼
┌──────────────────┐   ┌──────────────────┐
│ check_v4_lb()    │   │ check_v6_lb()    │
│  • Tail call     │   │  • Tail call     │
└────────┬─────────┘   └────────┬─────────┘
         │                      │
         ▼                      ▼
┌──────────────────┐   ┌──────────────────┐
│ tail_lb_ipv4()   │   │ tail_lb_ipv6()   │
│  • Parse tunnel  │   │  • Parse tunnel  │
│  • NodePort LB   │   │  • NodePort LB   │
│  • DNAT/DSR      │   │  • DNAT/DSR      │
│  • CT update     │   │  • CT update     │
│  • FIB lookup    │   │  • FIB lookup    │
└────────┬─────────┘   └────────┬─────────┘
         │                      │
         └──────────┬───────────┘
                    ▼
┌─────────────────────────────────────────────────────────┐
│                  bpf_xdp_exit()                         │
│         • Transfer metadata to TC                       │
│         • Return verdict (DROP/PASS/REDIRECT)           │
└────────────────────┬────────────────────────────────────┘
                     │
                     ▼
┌─────────────────────────────────────────────────────────┐
│              Kernel Acts on Verdict                     │
│    • XDP_DROP: Discard immediately                      │
│    • XDP_PASS: Continue to TC hook                      │
│    • XDP_REDIRECT: Send to target interface             │
└─────────────────────────────────────────────────────────┘

Entry Point: cil_xdp_entry()

Every packet's journey begins at cil_xdp_entry(), the function marked with __section_entry that gets attached to the XDP hook:

__section_entry
int cil_xdp_entry(struct __ctx_buff *ctx)
{
    check_and_store_ip_trace_id(ctx);
    return check_filters(ctx);
}

Despite its simplicity, this function establishes two critical operations:

1. Trace ID Extraction

check_and_store_ip_trace_id(ctx);

For distributed tracing and debugging, Cilium supports embedding custom trace IDs in IPv4 options. This function:

  • Checks if the trace ID feature is enabled via runtime configuration
  • Parses IPv4 options (if present) looking for the configured option type
  • Extracts 16, 32, or 64-bit trace IDs
  • Stores the ID in a per-CPU map (cilium_percpu_trace_id) for later retrieval

Why extract it first? The trace ID should be captured even if the packet is dropped in subsequent stages, ensuring complete observability.

2. Main Processing Pipeline

return check_filters(ctx);

This initiates the main packet processing logic, which we'll explore in detail.

The Dispatcher: check_filters()

The check_filters() function acts as a protocol dispatcher and metadata initializer:

static __always_inline int check_filters(struct __ctx_buff *ctx)
{
    int ret = CTX_ACT_OK;
    __u16 proto;

    // Step 1: Validate Ethernet header and extract protocol
    if (!validate_ethertype(ctx, &proto))
        return CTX_ACT_OK;  // Unknown protocol → let kernel handle it

    // Step 2: Initialize metadata for XDP→TC transfer
    ctx_store_meta(ctx, XFER_MARKER, 0);
    ctx_skip_nodeport_clear(ctx);

    // Step 3: Optional early hook (default: no-op)
    ret = xdp_early_hook(ctx, proto);
    if (ret != CTX_ACT_OK)
        return ret;

    // Step 4: Dispatch by protocol
    switch (proto) {
    case bpf_htons(ETH_P_IP):
        ret = check_v4(ctx);
        break;
    case bpf_htons(ETH_P_IPV6):
        ret = check_v6(ctx);
        break;
    default:
        break;  // ARP, etc. → pass through
    }

    return bpf_xdp_exit(ctx, ret);
}

Metadata Management

XDP programs operate before the kernel allocates an sk_buff, so there's no control buffer (cb[]) to store temporary state. Cilium solves this with two mechanisms:

  1. Per-CPU scratch map (cilium_xdp_scratch) - Used during XDP processing
  2. Adjusted metadata area - Transferred to TC via ctx_move_xfer()

The metadata markers include:

  • XFER_MARKER - Flags to pass to TC (e.g., "skip service lookup")
  • RECIRC_MARKER - Prevents double-processing after tail calls

Protocol Validation

The validate_ethertype() function performs critical bounds checking:

if (!validate_ethertype(ctx, &proto))
    return CTX_ACT_OK;

This verifies:

  • The packet has a complete Ethernet header (14 bytes)
  • The EtherType is supported (IPv4, IPv6, ARP)
  • For L3 devices (tunnels, WireGuard), extracts protocol from skb->protocol

Why return XDP_PASS on failure? Unknown protocols (ARP, LLDP, custom protocols) should be handled by the kernel stack, not dropped.

Prefiltering: The First Line of Defense

Before expensive load balancing operations, Cilium performs CIDR-based prefiltering to drop unwanted traffic immediately. This is where the program achieves its sub-100ns drop performance.

IPv4 Prefilter Implementation

static __always_inline int check_v4(struct __ctx_buff *ctx)
{
    void *data_end = ctx_data_end(ctx);
    void *data = ctx_data(ctx);
    struct iphdr *ipv4_hdr = data + sizeof(struct ethhdr);
    struct lpm_v4_key pfx;

    // Bounds check
    if (ctx_no_room(ipv4_hdr + 1, data_end))
        return CTX_ACT_DROP;

    // Prepare lookup key
    memcpy(pfx.lpm.data, &ipv4_hdr->saddr, sizeof(pfx.addr));
    pfx.lpm.prefixlen = 32;

    // Check dynamic CIDR ranges (LPM trie)
    if (map_lookup_elem(&cilium_cidr_v4_dyn, &pfx))
        return CTX_ACT_DROP;

    // Check fixed CIDR list (hash map)
    return map_lookup_elem(&cilium_cidr_v4_fix, &pfx) ?
        CTX_ACT_DROP : check_v4_lb(ctx);
}

Two-Tier Map Architecture

Cilium uses two different map types for prefiltering, each optimized for different use cases:

1. Hash Map (cilium_cidr_v4_fix)

struct {
    __uint(type, BPF_MAP_TYPE_HASH);
    __type(key, struct lpm_v4_key);
    __type(value, struct lpm_val);
    __uint(max_entries, 16384);
    __uint(map_flags, BPF_F_NO_PREALLOC | BPF_F_RDONLY_PROG_COND);
} cilium_cidr_v4_fix;

Use case: Exact IP addresses or specific known-bad hosts Performance: O(1) lookup via hash Typical entries: Blacklisted IPs, honeypot addresses

2. LPM Trie (cilium_cidr_v4_dyn)

struct {
    __uint(type, BPF_MAP_TYPE_LPM_TRIE);
    __type(key, struct lpm_v4_key);
    __type(value, struct lpm_val);
    __uint(max_entries, 16384);
    __uint(map_flags, BPF_F_NO_PREALLOC | BPF_F_RDONLY_PROG_COND);
} cilium_cidr_v4_dyn;

Use case: CIDR ranges configured by network policies Performance: O(log n) via trie traversal, finds longest prefix match Typical entries: "10.0.0.0/8", "192.168.0.0/16"

Read-Only Protection

Notice the BPF_F_RDONLY_PROG_COND flag on both maps. This makes them read-only from the BPF program, preventing accidental corruption. Only the Cilium agent (userspace) can update these maps.

This design ensures:

  • Security: BPF programs can't be exploited to modify filtering rules
  • Correctness: No race conditions between BPF program and agent updates
  • Performance: Compiler can optimize knowing values won't change

NodePort Load Balancing: The Heavy Lifter

After prefiltering, packets proceed to NodePort load balancing via a tail call mechanism:

static __always_inline int check_v4_lb(struct __ctx_buff *ctx)
{
    __s8 ext_err = 0;
    int ret;

    ret = tail_call_internal(ctx, CILIUM_CALL_IPV4_FROM_NETDEV, &ext_err);
    return send_drop_notify_error_ext(ctx, UNKNOWN_ID, ret, ext_err, 
                                      METRIC_INGRESS);
}

Why Tail Calls?

BPF programs face instruction complexity limits. Tail calls allow breaking complex logic into multiple programs:

  • Main XDP program: Fast path (prefilter)
  • Tail call target: Complex path (load balancing)

The tail call replaces the current execution context - it's a jump, not a function call. This avoids stack depth issues.

The Load Balancing Function

__declare_tail(CILIUM_CALL_IPV4_FROM_NETDEV)
int tail_lb_ipv4(struct __ctx_buff *ctx)
{
    bool punt_to_stack = false;
    int ret = CTX_ACT_OK;
    __s8 ext_err = 0;

    if (!ctx_skip_nodeport(ctx)) {
        // ... complex processing ...
        ret = nodeport_lb4(ctx, ip4, l3_off, UNKNOWN_ID, 
                          &punt_to_stack, &ext_err, &is_dsr);
    }

    if (IS_ERR(ret))
        return send_drop_notify_error_ext(ctx, UNKNOWN_ID, ret, 
                                          ext_err, METRIC_INGRESS);

    return bpf_xdp_exit(ctx, ret);
}

Recirculation Prevention

if (!ctx_skip_nodeport(ctx)) {

The ctx_skip_nodeport() check reads the RECIRC_MARKER metadata. This prevents infinite loops when packets are recirculated through tail calls (e.g., for NAT46 translation).

DSR with Geneve Encapsulation

One of the most complex features is Direct Server Return (DSR) with Geneve tunneling:

#if defined(ENABLE_DSR) && DSR_ENCAP_MODE == DSR_ENCAP_GENEVE
{
    // Parse outer headers
    if (ip4->protocol != IPPROTO_UDP)
        goto no_encap;
    
    // Verify tunnel port
    if (dport != bpf_htons(TUNNEL_PORT))
        goto no_encap;
    
    // Verify zero checksum (Cilium uses BPF_F_ZERO_CSUM_TX)
    if (udp_csum != 0)
        goto no_encap;
    
    // Parse Geneve header
    if (geneve.protocol_type != bpf_htons(ETH_P_TEB))
        goto no_encap;
    
    // Point to inner IP header
    l3_off = inner_l2_off + ETH_HLEN;
    
    // Re-validate data with new offset
    if (!revalidate_data_l3_off(ctx, &data, &data_end, &ip4, l3_off))
        return DROP_INVALID;
}
no_encap:
#endif

This tunnel parsing allows XDP to perform load balancing on encapsulated traffic - the inner packet gets DNAT'd, while the outer tunnel headers remain intact. This is crucial for DSR in overlay networks.

Why check for zero checksum? Updating the inner packet would invalidate the outer UDP checksum. Recalculating checksums is expensive and error-prone in XDP. Cilium uses BPF_F_ZERO_CSUM_TX to disable UDP checksumming on tunnel traffic.

The Core Load Balancer

The actual load balancing happens in nodeport_lb4() (from lib/nodeport.h):

  1. Extract L4 tuple: Source/dest IP, source/dest port, protocol
  2. Service lookup: Check if destination matches a Kubernetes Service
  3. Backend selection:
    • Random selection for ClusterIP
    • Maglev consistent hashing for better distribution
    • Respect externalTrafficPolicy=Local
  4. NAT operation:
    • DNAT: Rewrite destination IP/port to backend
    • DSR: Encode original service IP/port in IP options or Geneve header
  5. Connection tracking: Create/update entry for stateful processing
  6. FIB lookup: Determine next-hop interface and MAC address
  7. Return verdict:
    • XDP_REDIRECT: Send directly to target interface
    • XDP_PASS: Let kernel stack route it
    • XDP_DROP: Backend unavailable or policy denied

NAT46 Support

if (ret == NAT_46X64_RECIRC)
    ret = tail_call_internal(ctx, CILIUM_CALL_IPV6_FROM_NETDEV, &ext_err);

For dual-stack services, the load balancer might translate IPv4 to IPv6 (or vice versa). The special return code NAT_46X64_RECIRC triggers another tail call to the IPv6 handler.

Metadata Transfer and Exit

The final stage transfers state to the TC hook:

static __always_inline int
bpf_xdp_exit(struct __ctx_buff *ctx, const int verdict)
{
    if (verdict == CTX_ACT_OK)
        ctx_move_xfer(ctx);

    return verdict;
}

The ctx_move_xfer() Magic

static __always_inline void ctx_move_xfer(struct xdp_md *ctx)
{
    __u32 meta_xfer = ctx_load_meta(ctx, XFER_MARKER);
    
    if (meta_xfer) {
        // Grow metadata area by 4 bytes before packet data
        if (!ctx_adjust_meta(ctx, -(int)sizeof(meta_xfer))) {
            __u32 *data_meta = ctx_data_meta(ctx);
            __u32 *data = ctx_data(ctx);
            
            // Write flags to metadata area
            if (!ctx_no_room(data_meta + 1, data))
                data_meta[XFER_FLAGS] = meta_xfer;
        }
    }
}

This function:

  1. Reads flags from the per-CPU scratch map
  2. Adjusts the XDP metadata area (space before packet data)
  3. Writes the flags to this area
  4. TC programs can then read it via ctx_data_meta()

Flags that get transferred:

  • XFER_PKT_NO_SVC: Skip service lookup in TC (already done in XDP)
  • XFER_PKT_SNAT_DONE: SNAT already performed

This avoids duplicate processing and ensures consistency between XDP and TC layers.

Conditional Compilation: One Source, Many Programs

A key design principle in bpf_xdp.c is compile-time feature toggling. The same source code generates different programs based on configuration:

#ifdef ENABLE_NODEPORT_ACCELERATION
static __always_inline int check_v4_lb(struct __ctx_buff *ctx)
{
    // Full load balancing implementation
    ret = tail_call_internal(ctx, CILIUM_CALL_IPV4_FROM_NETDEV, &ext_err);
    return send_drop_notify_error_ext(ctx, UNKNOWN_ID, ret, ext_err, 
                                      METRIC_INGRESS);
}
#else
static __always_inline int check_v4_lb(struct __ctx_buff *ctx)
{
    return CTX_ACT_OK;  // No-op stub
}
#endif

Benefits

  1. Zero runtime overhead: No feature flags checked at runtime
  2. Smaller programs: Only compiled code for enabled features
  3. Instruction budget: Stays within BPF complexity limits
  4. Flexibility: Same source for minimal and full deployments

Common Configuration Flags

FlagEnablesSet When
ENABLE_IPV4IPv4 supportIPv4 enabled in cluster
ENABLE_IPV6IPv6 supportIPv6 enabled in cluster
ENABLE_PREFILTERCIDR filteringPrefilter policy configured
ENABLE_NODEPORT_ACCELERATIONXDP load balancing--enable-xdp-acceleration=true
CIDR4_FILTERIPv4 CIDR mapsPrefilter CIDRs configured
ENABLE_DSRDirect Server ReturnDSR mode enabled

Complete Packet Flow

Let's trace a complete packet through the XDP datapath:

Step 1: Packet arrives at NIC
   └─> Driver DMA's packet to memory
   └─> XDP hook triggers before sk_buff allocation

Step 2: cil_xdp_entry(ctx)
   ├─> check_and_store_ip_trace_id(ctx)
   │   ├─> Check if CONFIG(tracing_ip_option_type) != 0
   │   ├─> Parse IPv4 options field
   │   ├─> Extract 16/32/64-bit trace ID
   │   └─> Store in cilium_percpu_trace_id[cpu]
   │
   └─> check_filters(ctx)
       ├─> validate_ethertype(ctx, &proto)
       │   ├─> Verify packet >= 14 bytes (Ethernet header)
       │   ├─> Extract proto = eth->h_proto
       │   └─> Return true if IPv4/IPv6
       │
       ├─> ctx_store_meta(ctx, XFER_MARKER, 0)
       │   └─> Write 0 to cilium_xdp_scratch[cpu][6]
       │
       ├─> ctx_skip_nodeport_clear(ctx)
       │   └─> Write 0 to cilium_xdp_scratch[cpu][5]
       │
       └─> switch(proto)

Step 3: case ETH_P_IP → check_v4(ctx)
   ├─> Get data pointers
   │   ├─> data = ctx_data(ctx)
   │   ├─> data_end = ctx_data_end(ctx)
   │   └─> ipv4_hdr = data + 14
   │
   ├─> Bounds check
   │   └─> if (ipv4_hdr + 1 > data_end) → DROP
   │
   ├─> Prefilter check #1 (LPM trie)
   │   ├─> pfx.addr = ipv4_hdr->saddr
   │   ├─> pfx.prefixlen = 32
   │   ├─> lookup cilium_cidr_v4_dyn[pfx]
   │   └─> if found → DROP (⚡ ~50ns drop time)
   │
   ├─> Prefilter check #2 (hash map)
   │   ├─> lookup cilium_cidr_v4_fix[pfx]
   │   └─> if found → DROP (⚡ ~30ns drop time)
   │
   └─> check_v4_lb(ctx)

Step 4: check_v4_lb(ctx)
   └─> tail_call_internal(ctx, CILIUM_CALL_IPV4_FROM_NETDEV)
       └─> Jump to tail_lb_ipv4()

Step 5: tail_lb_ipv4(ctx)
   ├─> if (!ctx_skip_nodeport(ctx))
   │   ├─> Read RECIRC_MARKER
   │   └─> First time: false, continue
   │
   ├─> revalidate_data(ctx, &data, &data_end, &ip4)
   │   └─> Re-establish packet pointers for verifier
   │
   ├─> #if ENABLE_DSR && DSR_ENCAP_GENEVE
   │   ├─> Parse outer UDP/Geneve headers
   │   ├─> Locate inner IPv4 header
   │   ├─> Update l3_off to inner packet
   │   └─> revalidate_data_l3_off() for inner IP
   │
   └─> nodeport_lb4(ctx, ip4, l3_off, ...)

Step 6: nodeport_lb4() [from lib/nodeport.h]
   ├─> Extract L4 tuple
   │   ├─> tuple.saddr = ip4->saddr
   │   ├─> tuple.daddr = ip4->daddr
   │   ├─> tuple.sport = tcp/udp->source
   │   ├─> tuple.dport = tcp/udp->dest
   │   └─> tuple.nexthdr = ip4->protocol
   │
   ├─> Service lookup
   │   ├─> key.address = tuple.daddr
   │   ├─> key.dport = tuple.dport
   │   ├─> key.proto = tuple.nexthdr
   │   └─> svc = lb4_lookup_service(&key)
   │
   ├─> if (svc) {
   │   ├─> Backend selection
   │   │   ├─> Select algorithm: random vs Maglev
   │   │   ├─> Check externalTrafficPolicy
   │   │   └─> backend = lb4_select_backend(svc)
   │   │
   │   ├─> Connection tracking
   │   │   ├─> ct_create4(tuple, CT_EGRESS)
   │   │   └─> Store NAT mapping
   │   │
   │   ├─> DNAT or DSR
   │   │   ├─> if (DSR mode)
   │   │   │   └─> Encode service IP/port in IP options
   │   │   └─> else
   │   │       ├─> ip4->daddr = backend->address
   │   │       └─> tcp/udp->dest = backend->port
   │   │
   │   ├─> FIB lookup
   │   │   ├─> fib_params.dst = backend->address
   │   │   ├─> ret = fib_lookup(&fib_params)
   │   │   └─> Get next-hop interface & MAC
   │   │
   │   └─> Return XDP_REDIRECT or XDP_PASS
   │   }
   │
   └─> else → Return CTX_ACT_OK (no service match)

Step 7: bpf_xdp_exit(ctx, ret)
   ├─> if (ret == CTX_ACT_OK/XDP_PASS)
   │   └─> ctx_move_xfer(ctx)
   │       ├─> meta_xfer = ctx_load_meta(ctx, XFER_MARKER)
   │       ├─> ctx_adjust_meta(ctx, -4)
   │       │   └─> Grow metadata area by 4 bytes
   │       ├─> data_meta = ctx_data_meta(ctx)
   │       └─> data_meta[0] = meta_xfer
   │           └─> TC can now read these flags!
   │
   └─> return ret

Step 8: Kernel acts on verdict
   ├─> XDP_DROP (1): kfree_skb() immediately (~10ns)
   ├─> XDP_PASS (2): Allocate sk_buff, continue to TC
   ├─> XDP_TX (3): Bounce packet back out same interface
   └─> XDP_REDIRECT (4): dev_map_enqueue() to target interface

Performance Characteristics

The XDP datapath achieves remarkable performance through several optimizations:

Drop Performance

  • Prefilter drop: 30-100 nanoseconds
    • No sk_buff allocation
    • No connection tracking
    • No policy lookup
    • Single map lookup and return

Load Balancing Performance

  • XDP NodePort: 200-500 nanoseconds
    • Service lookup: ~50ns (hash map)
    • Backend selection: ~30ns (random) or ~100ns (Maglev)
    • Connection tracking: ~80ns (LRU map update)
    • FIB lookup: ~40ns
    • Total: <500ns for fast path

Comparison with Traditional Stack

OperationTraditional StackXDP DatapathSpeedup
Drop bad traffic5-10 μs50 ns100-200x
NodePort LB10-20 μs300 ns30-60x
Service routing15-25 μs500 ns30-50x

Why So Fast?

  1. No sk_buff allocation: Saves ~1-2 μs
  2. No GRO/GSO processing: Saves ~500 ns
  3. No netfilter traversal: Saves ~1-3 μs
  4. Direct packet modification: In-place rewrites
  5. Cache locality: Per-CPU maps and hot data
  6. JIT compilation: eBPF → native machine code

Limitations and Trade-offs

While XDP provides excellent performance, it has constraints:

BPF Instruction Limits

  • Original limit: 4,096 instructions
  • Modern limit: ~1 million with complexity bounds
  • Solution: Tail calls to split complex logic

No Fragmentation

  • XDP can't fragment packets
  • MTU violations must be handled in TC or kernel
  • DSR ICMP "frag needed" messages generated in TC

Limited Checksum Support

  • Hardware checksum offload only
  • Can't recalculate checksums reliably
  • Why Cilium uses BPF_F_ZERO_CSUM_TX for tunnels

No Socket Context

  • No access to socket state
  • Can't check socket options
  • Socket-layer policies handled by cgroup BPF

Verifier Complexity

  • Must prove all memory accesses are safe
  • Requires careful pointer arithmetic
  • Bounds checks throughout code

Integration with TC Datapath

XDP is just the first stage. Packets that pass proceed to TC (Traffic Control) where more complex operations occur:

TC Responsibilities

  1. Policy enforcement: L3/L4 and L7 network policies
  2. Encryption: IPsec, WireGuard key management
  3. Tunneling: VXLAN/Geneve encapsulation
  4. Local delivery: Routing to container network namespaces
  5. Identity extraction: From tunnel headers or endpoint maps
  6. L7 proxy redirection: To Envoy for HTTP/gRPC policies

XDP → TC Handoff

The metadata transfer via ctx_move_xfer() ensures TC knows what XDP already did:

// TC program (bpf_host.c)
__u32 xfer_flags = ctx_get_xfer(ctx, XFER_FLAGS);

if (xfer_flags & XFER_PKT_NO_SVC) {
    // XDP already did service translation, skip it
    goto skip_service_lookup;
}

if (xfer_flags & XFER_PKT_SNAT_DONE) {
    // XDP already did SNAT, don't do it again
    ctx_snat_done_set(ctx);
}

This coordination prevents:

  • Double service lookups
  • Duplicate NAT operations
  • Wasted CPU cycles
  • State inconsistencies

Deployment Scenarios

Cilium's conditional compilation enables different XDP configurations:

Minimal XDP (Prefilter Only)

clang -DENABLE_IPV4=1 \
      -DENABLE_PREFILTER=1 \
      -DCIDR4_FILTER=1 \
      bpf_xdp.c -o bpf_xdp_minimal.o

Use case: DDoS mitigation, basic traffic filtering Size: ~2,000 BPF instructions Features: CIDR-based drops only

NodePort Acceleration

clang -DENABLE_IPV4=1 \
      -DENABLE_PREFILTER=1 \
      -DENABLE_NODEPORT_ACCELERATION=1 \
      bpf_xdp.c -o bpf_xdp_nodeport.o

Use case: High-performance Kubernetes NodePort Size: ~15,000 BPF instructions Features: Prefilter + load balancing

Full XDP with DSR

clang -DENABLE_IPV4=1 \
      -DENABLE_IPV6=1 \
      -DENABLE_PREFILTER=1 \
      -DENABLE_NODEPORT_ACCELERATION=1 \
      -DENABLE_DSR=1 \
      -DDSR_ENCAP_MODE=DSR_ENCAP_GENEVE \
      bpf_xdp.c -o bpf_xdp_full.o

Use case: Maximum performance with Direct Server Return Size: ~25,000 BPF instructions Features: Everything including DSR tunnel parsing

Debugging and Observability

Trace Events

The send_drop_notify_error_ext() function sends events to Cilium's monitoring infrastructure:

if (IS_ERR(ret))
    return send_drop_notify_error_ext(ctx, UNKNOWN_ID, ret, ext_err,
                                      METRIC_INGRESS);

These events are visible via:

  • cilium monitor - Real-time event stream
  • Hubble UI - Graphical service map
  • Prometheus metrics - Drop counters by reason

Drop Reasons

Common drop reasons in XDP:

CodeReasonMeaning
DROP_INVALIDMalformed packetTruncated or invalid headers
DROP_NO_SERVICENo backendService has zero backends
DROP_POLICY_DENIEDPolicy dropNetwork policy blocked traffic
CTX_ACT_DROPPrefilterCIDR-based drop

Conclusion

Cilium's XDP datapath (bpf_xdp.c) represents a masterclass in high-performance packet processing. Through careful layering, conditional compilation, and intelligent use of BPF features, it achieves:

  • Sub-100ns drop latency for unwanted traffic
  • Sub-microsecond load balancing for Kubernetes Services
  • Zero-copy packet processing with direct hardware access
  • Flexible deployment from minimal to full-featured

The key insights are:

  1. Layer performance-critical operations early - Prefilter before load balancing
  2. Use the right data structure - Hash for exact match, LPM for ranges
  3. Coordinate between stages - XDP → TC metadata transfer
  4. Compile for the deployment - Feature flags eliminate unused code
  5. Observe everything - Rich monitoring for debugging

As eBPF continues to evolve, XDP will only become more powerful. Cilium's implementation shows how to harness this power effectively, creating one of the fastest datapaths in the cloud-native ecosystem.

References


Author's Note: This article is based on the latest development version of Cilium source code. Implementation details may vary across versions.