Understanding Cilium's XDP Datapath: A Deep Dive into bpf_xdp.c
Introduction
XDP (eXpress Data Path) represents the earliest possible packet processing point in the Linux networking stack. Running directly in the network driver before the kernel even allocates an sk_buff, XDP provides unprecedented performance for packet filtering and load balancing operations. Cilium leverages XDP to implement a high-performance datapath capable of dropping malicious traffic in under 100 nanoseconds and performing NodePort load balancing with sub-microsecond latency.
This article dissects Cilium's XDP implementation (bpf/bpf_xdp.c), explaining how it achieves such remarkable performance through careful design and optimization.
Architecture Overview
The XDP program in Cilium serves three primary functions:
- Prefiltering - CIDR-based packet filtering to drop unwanted traffic at the earliest possible stage
- NodePort Load Balancing - Early service translation and backend selection
- Metadata Transfer - Passing state to subsequent TC (Traffic Control) BPF programs
The program is structured as a layered pipeline, with each layer performing progressively more complex operations:
┌─────────────────────────────────────────────────────────┐
│ Packet Arrives at NIC Driver │
└────────────────────┬────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────┐
│ cil_xdp_entry() │
│ • Extract trace ID from IP options │
│ • Initialize packet processing │
└────────────────────┬────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────┐
│ check_filters() │
│ • Validate Ethernet header │
│ • Clear metadata markers │
│ • Dispatch by protocol (IPv4/IPv6) │
└────────────────────┬────────────────────────────────────┘
│
┌──────────┴──────────┐
▼ ▼
┌──────────────────┐ ┌──────────────────┐
│ check_v4() │ │ check_v6() │
│ • Prefilter │ │ • Prefilter │
│ • CIDR drops │ │ • CIDR drops │
└────────┬─────────┘ └────────┬─────────┘
│ │
▼ ▼
┌──────────────────┐ ┌──────────────────┐
│ check_v4_lb() │ │ check_v6_lb() │
│ • Tail call │ │ • Tail call │
└────────┬─────────┘ └────────┬─────────┘
│ │
▼ ▼
┌──────────────────┐ ┌──────────────────┐
│ tail_lb_ipv4() │ │ tail_lb_ipv6() │
│ • Parse tunnel │ │ • Parse tunnel │
│ • NodePort LB │ │ • NodePort LB │
│ • DNAT/DSR │ │ • DNAT/DSR │
│ • CT update │ │ • CT update │
│ • FIB lookup │ │ • FIB lookup │
└────────┬─────────┘ └────────┬─────────┘
│ │
└──────────┬───────────┘
▼
┌─────────────────────────────────────────────────────────┐
│ bpf_xdp_exit() │
│ • Transfer metadata to TC │
│ • Return verdict (DROP/PASS/REDIRECT) │
└────────────────────┬────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────┐
│ Kernel Acts on Verdict │
│ • XDP_DROP: Discard immediately │
│ • XDP_PASS: Continue to TC hook │
│ • XDP_REDIRECT: Send to target interface │
└─────────────────────────────────────────────────────────┘
Entry Point: cil_xdp_entry()
Every packet's journey begins at cil_xdp_entry(), the function marked with __section_entry that gets attached to the XDP hook:
__section_entry
int cil_xdp_entry(struct __ctx_buff *ctx)
{
check_and_store_ip_trace_id(ctx);
return check_filters(ctx);
}
Despite its simplicity, this function establishes two critical operations:
1. Trace ID Extraction
check_and_store_ip_trace_id(ctx);
For distributed tracing and debugging, Cilium supports embedding custom trace IDs in IPv4 options. This function:
- Checks if the trace ID feature is enabled via runtime configuration
- Parses IPv4 options (if present) looking for the configured option type
- Extracts 16, 32, or 64-bit trace IDs
- Stores the ID in a per-CPU map (
cilium_percpu_trace_id) for later retrieval
Why extract it first? The trace ID should be captured even if the packet is dropped in subsequent stages, ensuring complete observability.
2. Main Processing Pipeline
return check_filters(ctx);
This initiates the main packet processing logic, which we'll explore in detail.
The Dispatcher: check_filters()
The check_filters() function acts as a protocol dispatcher and metadata initializer:
static __always_inline int check_filters(struct __ctx_buff *ctx)
{
int ret = CTX_ACT_OK;
__u16 proto;
// Step 1: Validate Ethernet header and extract protocol
if (!validate_ethertype(ctx, &proto))
return CTX_ACT_OK; // Unknown protocol → let kernel handle it
// Step 2: Initialize metadata for XDP→TC transfer
ctx_store_meta(ctx, XFER_MARKER, 0);
ctx_skip_nodeport_clear(ctx);
// Step 3: Optional early hook (default: no-op)
ret = xdp_early_hook(ctx, proto);
if (ret != CTX_ACT_OK)
return ret;
// Step 4: Dispatch by protocol
switch (proto) {
case bpf_htons(ETH_P_IP):
ret = check_v4(ctx);
break;
case bpf_htons(ETH_P_IPV6):
ret = check_v6(ctx);
break;
default:
break; // ARP, etc. → pass through
}
return bpf_xdp_exit(ctx, ret);
}
Metadata Management
XDP programs operate before the kernel allocates an sk_buff, so there's no control buffer (cb[]) to store temporary state. Cilium solves this with two mechanisms:
- Per-CPU scratch map (
cilium_xdp_scratch) - Used during XDP processing - Adjusted metadata area - Transferred to TC via
ctx_move_xfer()
The metadata markers include:
XFER_MARKER- Flags to pass to TC (e.g., "skip service lookup")RECIRC_MARKER- Prevents double-processing after tail calls
Protocol Validation
The validate_ethertype() function performs critical bounds checking:
if (!validate_ethertype(ctx, &proto))
return CTX_ACT_OK;
This verifies:
- The packet has a complete Ethernet header (14 bytes)
- The EtherType is supported (IPv4, IPv6, ARP)
- For L3 devices (tunnels, WireGuard), extracts protocol from
skb->protocol
Why return XDP_PASS on failure? Unknown protocols (ARP, LLDP, custom protocols) should be handled by the kernel stack, not dropped.
Prefiltering: The First Line of Defense
Before expensive load balancing operations, Cilium performs CIDR-based prefiltering to drop unwanted traffic immediately. This is where the program achieves its sub-100ns drop performance.
IPv4 Prefilter Implementation
static __always_inline int check_v4(struct __ctx_buff *ctx)
{
void *data_end = ctx_data_end(ctx);
void *data = ctx_data(ctx);
struct iphdr *ipv4_hdr = data + sizeof(struct ethhdr);
struct lpm_v4_key pfx;
// Bounds check
if (ctx_no_room(ipv4_hdr + 1, data_end))
return CTX_ACT_DROP;
// Prepare lookup key
memcpy(pfx.lpm.data, &ipv4_hdr->saddr, sizeof(pfx.addr));
pfx.lpm.prefixlen = 32;
// Check dynamic CIDR ranges (LPM trie)
if (map_lookup_elem(&cilium_cidr_v4_dyn, &pfx))
return CTX_ACT_DROP;
// Check fixed CIDR list (hash map)
return map_lookup_elem(&cilium_cidr_v4_fix, &pfx) ?
CTX_ACT_DROP : check_v4_lb(ctx);
}
Two-Tier Map Architecture
Cilium uses two different map types for prefiltering, each optimized for different use cases:
1. Hash Map (cilium_cidr_v4_fix)
struct {
__uint(type, BPF_MAP_TYPE_HASH);
__type(key, struct lpm_v4_key);
__type(value, struct lpm_val);
__uint(max_entries, 16384);
__uint(map_flags, BPF_F_NO_PREALLOC | BPF_F_RDONLY_PROG_COND);
} cilium_cidr_v4_fix;
Use case: Exact IP addresses or specific known-bad hosts Performance: O(1) lookup via hash Typical entries: Blacklisted IPs, honeypot addresses
2. LPM Trie (cilium_cidr_v4_dyn)
struct {
__uint(type, BPF_MAP_TYPE_LPM_TRIE);
__type(key, struct lpm_v4_key);
__type(value, struct lpm_val);
__uint(max_entries, 16384);
__uint(map_flags, BPF_F_NO_PREALLOC | BPF_F_RDONLY_PROG_COND);
} cilium_cidr_v4_dyn;
Use case: CIDR ranges configured by network policies Performance: O(log n) via trie traversal, finds longest prefix match Typical entries: "10.0.0.0/8", "192.168.0.0/16"
Read-Only Protection
Notice the BPF_F_RDONLY_PROG_COND flag on both maps. This makes them read-only from the BPF program, preventing accidental corruption. Only the Cilium agent (userspace) can update these maps.
This design ensures:
- Security: BPF programs can't be exploited to modify filtering rules
- Correctness: No race conditions between BPF program and agent updates
- Performance: Compiler can optimize knowing values won't change
NodePort Load Balancing: The Heavy Lifter
After prefiltering, packets proceed to NodePort load balancing via a tail call mechanism:
static __always_inline int check_v4_lb(struct __ctx_buff *ctx)
{
__s8 ext_err = 0;
int ret;
ret = tail_call_internal(ctx, CILIUM_CALL_IPV4_FROM_NETDEV, &ext_err);
return send_drop_notify_error_ext(ctx, UNKNOWN_ID, ret, ext_err,
METRIC_INGRESS);
}
Why Tail Calls?
BPF programs face instruction complexity limits. Tail calls allow breaking complex logic into multiple programs:
- Main XDP program: Fast path (prefilter)
- Tail call target: Complex path (load balancing)
The tail call replaces the current execution context - it's a jump, not a function call. This avoids stack depth issues.
The Load Balancing Function
__declare_tail(CILIUM_CALL_IPV4_FROM_NETDEV)
int tail_lb_ipv4(struct __ctx_buff *ctx)
{
bool punt_to_stack = false;
int ret = CTX_ACT_OK;
__s8 ext_err = 0;
if (!ctx_skip_nodeport(ctx)) {
// ... complex processing ...
ret = nodeport_lb4(ctx, ip4, l3_off, UNKNOWN_ID,
&punt_to_stack, &ext_err, &is_dsr);
}
if (IS_ERR(ret))
return send_drop_notify_error_ext(ctx, UNKNOWN_ID, ret,
ext_err, METRIC_INGRESS);
return bpf_xdp_exit(ctx, ret);
}
Recirculation Prevention
if (!ctx_skip_nodeport(ctx)) {
The ctx_skip_nodeport() check reads the RECIRC_MARKER metadata. This prevents infinite loops when packets are recirculated through tail calls (e.g., for NAT46 translation).
DSR with Geneve Encapsulation
One of the most complex features is Direct Server Return (DSR) with Geneve tunneling:
#if defined(ENABLE_DSR) && DSR_ENCAP_MODE == DSR_ENCAP_GENEVE
{
// Parse outer headers
if (ip4->protocol != IPPROTO_UDP)
goto no_encap;
// Verify tunnel port
if (dport != bpf_htons(TUNNEL_PORT))
goto no_encap;
// Verify zero checksum (Cilium uses BPF_F_ZERO_CSUM_TX)
if (udp_csum != 0)
goto no_encap;
// Parse Geneve header
if (geneve.protocol_type != bpf_htons(ETH_P_TEB))
goto no_encap;
// Point to inner IP header
l3_off = inner_l2_off + ETH_HLEN;
// Re-validate data with new offset
if (!revalidate_data_l3_off(ctx, &data, &data_end, &ip4, l3_off))
return DROP_INVALID;
}
no_encap:
#endif
This tunnel parsing allows XDP to perform load balancing on encapsulated traffic - the inner packet gets DNAT'd, while the outer tunnel headers remain intact. This is crucial for DSR in overlay networks.
Why check for zero checksum? Updating the inner packet would invalidate the outer UDP checksum. Recalculating checksums is expensive and error-prone in XDP. Cilium uses BPF_F_ZERO_CSUM_TX to disable UDP checksumming on tunnel traffic.
The Core Load Balancer
The actual load balancing happens in nodeport_lb4() (from lib/nodeport.h):
- Extract L4 tuple: Source/dest IP, source/dest port, protocol
- Service lookup: Check if destination matches a Kubernetes Service
- Backend selection:
- Random selection for ClusterIP
- Maglev consistent hashing for better distribution
- Respect
externalTrafficPolicy=Local
- NAT operation:
- DNAT: Rewrite destination IP/port to backend
- DSR: Encode original service IP/port in IP options or Geneve header
- Connection tracking: Create/update entry for stateful processing
- FIB lookup: Determine next-hop interface and MAC address
- Return verdict:
XDP_REDIRECT: Send directly to target interfaceXDP_PASS: Let kernel stack route itXDP_DROP: Backend unavailable or policy denied
NAT46 Support
if (ret == NAT_46X64_RECIRC)
ret = tail_call_internal(ctx, CILIUM_CALL_IPV6_FROM_NETDEV, &ext_err);
For dual-stack services, the load balancer might translate IPv4 to IPv6 (or vice versa). The special return code NAT_46X64_RECIRC triggers another tail call to the IPv6 handler.
Metadata Transfer and Exit
The final stage transfers state to the TC hook:
static __always_inline int
bpf_xdp_exit(struct __ctx_buff *ctx, const int verdict)
{
if (verdict == CTX_ACT_OK)
ctx_move_xfer(ctx);
return verdict;
}
The ctx_move_xfer() Magic
static __always_inline void ctx_move_xfer(struct xdp_md *ctx)
{
__u32 meta_xfer = ctx_load_meta(ctx, XFER_MARKER);
if (meta_xfer) {
// Grow metadata area by 4 bytes before packet data
if (!ctx_adjust_meta(ctx, -(int)sizeof(meta_xfer))) {
__u32 *data_meta = ctx_data_meta(ctx);
__u32 *data = ctx_data(ctx);
// Write flags to metadata area
if (!ctx_no_room(data_meta + 1, data))
data_meta[XFER_FLAGS] = meta_xfer;
}
}
}
This function:
- Reads flags from the per-CPU scratch map
- Adjusts the XDP metadata area (space before packet data)
- Writes the flags to this area
- TC programs can then read it via
ctx_data_meta()
Flags that get transferred:
XFER_PKT_NO_SVC: Skip service lookup in TC (already done in XDP)XFER_PKT_SNAT_DONE: SNAT already performed
This avoids duplicate processing and ensures consistency between XDP and TC layers.
Conditional Compilation: One Source, Many Programs
A key design principle in bpf_xdp.c is compile-time feature toggling. The same source code generates different programs based on configuration:
#ifdef ENABLE_NODEPORT_ACCELERATION
static __always_inline int check_v4_lb(struct __ctx_buff *ctx)
{
// Full load balancing implementation
ret = tail_call_internal(ctx, CILIUM_CALL_IPV4_FROM_NETDEV, &ext_err);
return send_drop_notify_error_ext(ctx, UNKNOWN_ID, ret, ext_err,
METRIC_INGRESS);
}
#else
static __always_inline int check_v4_lb(struct __ctx_buff *ctx)
{
return CTX_ACT_OK; // No-op stub
}
#endif
Benefits
- Zero runtime overhead: No feature flags checked at runtime
- Smaller programs: Only compiled code for enabled features
- Instruction budget: Stays within BPF complexity limits
- Flexibility: Same source for minimal and full deployments
Common Configuration Flags
| Flag | Enables | Set When |
|---|---|---|
ENABLE_IPV4 | IPv4 support | IPv4 enabled in cluster |
ENABLE_IPV6 | IPv6 support | IPv6 enabled in cluster |
ENABLE_PREFILTER | CIDR filtering | Prefilter policy configured |
ENABLE_NODEPORT_ACCELERATION | XDP load balancing | --enable-xdp-acceleration=true |
CIDR4_FILTER | IPv4 CIDR maps | Prefilter CIDRs configured |
ENABLE_DSR | Direct Server Return | DSR mode enabled |
Complete Packet Flow
Let's trace a complete packet through the XDP datapath:
Step 1: Packet arrives at NIC
└─> Driver DMA's packet to memory
└─> XDP hook triggers before sk_buff allocation
Step 2: cil_xdp_entry(ctx)
├─> check_and_store_ip_trace_id(ctx)
│ ├─> Check if CONFIG(tracing_ip_option_type) != 0
│ ├─> Parse IPv4 options field
│ ├─> Extract 16/32/64-bit trace ID
│ └─> Store in cilium_percpu_trace_id[cpu]
│
└─> check_filters(ctx)
├─> validate_ethertype(ctx, &proto)
│ ├─> Verify packet >= 14 bytes (Ethernet header)
│ ├─> Extract proto = eth->h_proto
│ └─> Return true if IPv4/IPv6
│
├─> ctx_store_meta(ctx, XFER_MARKER, 0)
│ └─> Write 0 to cilium_xdp_scratch[cpu][6]
│
├─> ctx_skip_nodeport_clear(ctx)
│ └─> Write 0 to cilium_xdp_scratch[cpu][5]
│
└─> switch(proto)
Step 3: case ETH_P_IP → check_v4(ctx)
├─> Get data pointers
│ ├─> data = ctx_data(ctx)
│ ├─> data_end = ctx_data_end(ctx)
│ └─> ipv4_hdr = data + 14
│
├─> Bounds check
│ └─> if (ipv4_hdr + 1 > data_end) → DROP
│
├─> Prefilter check #1 (LPM trie)
│ ├─> pfx.addr = ipv4_hdr->saddr
│ ├─> pfx.prefixlen = 32
│ ├─> lookup cilium_cidr_v4_dyn[pfx]
│ └─> if found → DROP (⚡ ~50ns drop time)
│
├─> Prefilter check #2 (hash map)
│ ├─> lookup cilium_cidr_v4_fix[pfx]
│ └─> if found → DROP (⚡ ~30ns drop time)
│
└─> check_v4_lb(ctx)
Step 4: check_v4_lb(ctx)
└─> tail_call_internal(ctx, CILIUM_CALL_IPV4_FROM_NETDEV)
└─> Jump to tail_lb_ipv4()
Step 5: tail_lb_ipv4(ctx)
├─> if (!ctx_skip_nodeport(ctx))
│ ├─> Read RECIRC_MARKER
│ └─> First time: false, continue
│
├─> revalidate_data(ctx, &data, &data_end, &ip4)
│ └─> Re-establish packet pointers for verifier
│
├─> #if ENABLE_DSR && DSR_ENCAP_GENEVE
│ ├─> Parse outer UDP/Geneve headers
│ ├─> Locate inner IPv4 header
│ ├─> Update l3_off to inner packet
│ └─> revalidate_data_l3_off() for inner IP
│
└─> nodeport_lb4(ctx, ip4, l3_off, ...)
Step 6: nodeport_lb4() [from lib/nodeport.h]
├─> Extract L4 tuple
│ ├─> tuple.saddr = ip4->saddr
│ ├─> tuple.daddr = ip4->daddr
│ ├─> tuple.sport = tcp/udp->source
│ ├─> tuple.dport = tcp/udp->dest
│ └─> tuple.nexthdr = ip4->protocol
│
├─> Service lookup
│ ├─> key.address = tuple.daddr
│ ├─> key.dport = tuple.dport
│ ├─> key.proto = tuple.nexthdr
│ └─> svc = lb4_lookup_service(&key)
│
├─> if (svc) {
│ ├─> Backend selection
│ │ ├─> Select algorithm: random vs Maglev
│ │ ├─> Check externalTrafficPolicy
│ │ └─> backend = lb4_select_backend(svc)
│ │
│ ├─> Connection tracking
│ │ ├─> ct_create4(tuple, CT_EGRESS)
│ │ └─> Store NAT mapping
│ │
│ ├─> DNAT or DSR
│ │ ├─> if (DSR mode)
│ │ │ └─> Encode service IP/port in IP options
│ │ └─> else
│ │ ├─> ip4->daddr = backend->address
│ │ └─> tcp/udp->dest = backend->port
│ │
│ ├─> FIB lookup
│ │ ├─> fib_params.dst = backend->address
│ │ ├─> ret = fib_lookup(&fib_params)
│ │ └─> Get next-hop interface & MAC
│ │
│ └─> Return XDP_REDIRECT or XDP_PASS
│ }
│
└─> else → Return CTX_ACT_OK (no service match)
Step 7: bpf_xdp_exit(ctx, ret)
├─> if (ret == CTX_ACT_OK/XDP_PASS)
│ └─> ctx_move_xfer(ctx)
│ ├─> meta_xfer = ctx_load_meta(ctx, XFER_MARKER)
│ ├─> ctx_adjust_meta(ctx, -4)
│ │ └─> Grow metadata area by 4 bytes
│ ├─> data_meta = ctx_data_meta(ctx)
│ └─> data_meta[0] = meta_xfer
│ └─> TC can now read these flags!
│
└─> return ret
Step 8: Kernel acts on verdict
├─> XDP_DROP (1): kfree_skb() immediately (~10ns)
├─> XDP_PASS (2): Allocate sk_buff, continue to TC
├─> XDP_TX (3): Bounce packet back out same interface
└─> XDP_REDIRECT (4): dev_map_enqueue() to target interface
Performance Characteristics
The XDP datapath achieves remarkable performance through several optimizations:
Drop Performance
- Prefilter drop: 30-100 nanoseconds
- No sk_buff allocation
- No connection tracking
- No policy lookup
- Single map lookup and return
Load Balancing Performance
- XDP NodePort: 200-500 nanoseconds
- Service lookup: ~50ns (hash map)
- Backend selection: ~30ns (random) or ~100ns (Maglev)
- Connection tracking: ~80ns (LRU map update)
- FIB lookup: ~40ns
- Total: <500ns for fast path
Comparison with Traditional Stack
| Operation | Traditional Stack | XDP Datapath | Speedup |
|---|---|---|---|
| Drop bad traffic | 5-10 μs | 50 ns | 100-200x |
| NodePort LB | 10-20 μs | 300 ns | 30-60x |
| Service routing | 15-25 μs | 500 ns | 30-50x |
Why So Fast?
- No sk_buff allocation: Saves ~1-2 μs
- No GRO/GSO processing: Saves ~500 ns
- No netfilter traversal: Saves ~1-3 μs
- Direct packet modification: In-place rewrites
- Cache locality: Per-CPU maps and hot data
- JIT compilation: eBPF → native machine code
Limitations and Trade-offs
While XDP provides excellent performance, it has constraints:
BPF Instruction Limits
- Original limit: 4,096 instructions
- Modern limit: ~1 million with complexity bounds
- Solution: Tail calls to split complex logic
No Fragmentation
- XDP can't fragment packets
- MTU violations must be handled in TC or kernel
- DSR ICMP "frag needed" messages generated in TC
Limited Checksum Support
- Hardware checksum offload only
- Can't recalculate checksums reliably
- Why Cilium uses
BPF_F_ZERO_CSUM_TXfor tunnels
No Socket Context
- No access to socket state
- Can't check socket options
- Socket-layer policies handled by cgroup BPF
Verifier Complexity
- Must prove all memory accesses are safe
- Requires careful pointer arithmetic
- Bounds checks throughout code
Integration with TC Datapath
XDP is just the first stage. Packets that pass proceed to TC (Traffic Control) where more complex operations occur:
TC Responsibilities
- Policy enforcement: L3/L4 and L7 network policies
- Encryption: IPsec, WireGuard key management
- Tunneling: VXLAN/Geneve encapsulation
- Local delivery: Routing to container network namespaces
- Identity extraction: From tunnel headers or endpoint maps
- L7 proxy redirection: To Envoy for HTTP/gRPC policies
XDP → TC Handoff
The metadata transfer via ctx_move_xfer() ensures TC knows what XDP already did:
// TC program (bpf_host.c)
__u32 xfer_flags = ctx_get_xfer(ctx, XFER_FLAGS);
if (xfer_flags & XFER_PKT_NO_SVC) {
// XDP already did service translation, skip it
goto skip_service_lookup;
}
if (xfer_flags & XFER_PKT_SNAT_DONE) {
// XDP already did SNAT, don't do it again
ctx_snat_done_set(ctx);
}
This coordination prevents:
- Double service lookups
- Duplicate NAT operations
- Wasted CPU cycles
- State inconsistencies
Deployment Scenarios
Cilium's conditional compilation enables different XDP configurations:
Minimal XDP (Prefilter Only)
clang -DENABLE_IPV4=1 \
-DENABLE_PREFILTER=1 \
-DCIDR4_FILTER=1 \
bpf_xdp.c -o bpf_xdp_minimal.o
Use case: DDoS mitigation, basic traffic filtering Size: ~2,000 BPF instructions Features: CIDR-based drops only
NodePort Acceleration
clang -DENABLE_IPV4=1 \
-DENABLE_PREFILTER=1 \
-DENABLE_NODEPORT_ACCELERATION=1 \
bpf_xdp.c -o bpf_xdp_nodeport.o
Use case: High-performance Kubernetes NodePort Size: ~15,000 BPF instructions Features: Prefilter + load balancing
Full XDP with DSR
clang -DENABLE_IPV4=1 \
-DENABLE_IPV6=1 \
-DENABLE_PREFILTER=1 \
-DENABLE_NODEPORT_ACCELERATION=1 \
-DENABLE_DSR=1 \
-DDSR_ENCAP_MODE=DSR_ENCAP_GENEVE \
bpf_xdp.c -o bpf_xdp_full.o
Use case: Maximum performance with Direct Server Return Size: ~25,000 BPF instructions Features: Everything including DSR tunnel parsing
Debugging and Observability
Trace Events
The send_drop_notify_error_ext() function sends events to Cilium's monitoring infrastructure:
if (IS_ERR(ret))
return send_drop_notify_error_ext(ctx, UNKNOWN_ID, ret, ext_err,
METRIC_INGRESS);
These events are visible via:
cilium monitor- Real-time event stream- Hubble UI - Graphical service map
- Prometheus metrics - Drop counters by reason
Drop Reasons
Common drop reasons in XDP:
| Code | Reason | Meaning |
|---|---|---|
DROP_INVALID | Malformed packet | Truncated or invalid headers |
DROP_NO_SERVICE | No backend | Service has zero backends |
DROP_POLICY_DENIED | Policy drop | Network policy blocked traffic |
CTX_ACT_DROP | Prefilter | CIDR-based drop |
Conclusion
Cilium's XDP datapath (bpf_xdp.c) represents a masterclass in high-performance packet processing. Through careful layering, conditional compilation, and intelligent use of BPF features, it achieves:
- Sub-100ns drop latency for unwanted traffic
- Sub-microsecond load balancing for Kubernetes Services
- Zero-copy packet processing with direct hardware access
- Flexible deployment from minimal to full-featured
The key insights are:
- Layer performance-critical operations early - Prefilter before load balancing
- Use the right data structure - Hash for exact match, LPM for ranges
- Coordinate between stages - XDP → TC metadata transfer
- Compile for the deployment - Feature flags eliminate unused code
- Observe everything - Rich monitoring for debugging
As eBPF continues to evolve, XDP will only become more powerful. Cilium's implementation shows how to harness this power effectively, creating one of the fastest datapaths in the cloud-native ecosystem.
References
Author's Note: This article is based on the latest development version of Cilium source code. Implementation details may vary across versions.