Back to Cilium posts

Understanding Cilium's TC LXC: A Deep Dive into bpf_lxc.c

ciliumtcebpfnetworkingdatapathbpf_lxc.c

Exploring the architecture and performance of Cilium's TC LXC datapath implementation.

Understanding Cilium's TC LXC Datapath: A Deep Dive into bpf_lxc.c

Introduction

While XDP provides the fastest packet processing at the driver level, the TC (Traffic Control) hook is where Cilium implements the majority of its networking intelligence. The bpf_lxc.c program attaches to the container side of veth pairs, intercepting all traffic entering and leaving containers. This is where policy enforcement, service mesh integration, encryption, and local delivery decisions happen.

This article provides a comprehensive analysis of Cilium's TC LXC (Linux Container) datapath, explaining how it achieves secure, policy-driven networking for Kubernetes pods.

Architecture Context

Veth Pair Topology

Every Kubernetes pod gets a virtual Ethernet pair:

┌──────────────────────────────────────────┐
│           Container Namespace            │
│                                          │
│   ┌──────────────┐                      │
│   │     eth0     │  Container interface │
│   │  (bpf_lxc)   │  TC ingress/egress   │
│   └──────┬───────┘                      │
└──────────┼──────────────────────────────┘
           │ veth pair
┌──────────┼──────────────────────────────┐
│          │         Host Namespace        │
│   ┌──────┴───────┐                      │
│   │  lxc_health  │  Host-side interface │
│   │  (bpf_host)  │  TC ingress/egress   │
│   └──────────────┘                      │
│                                          │
│   ┌──────────────┐                      │
│   │ cilium_host  │  Cilium internal     │
│   │  (bpf_host)  │  interface           │
│   └──────┬───────┘                      │
└──────────┼──────────────────────────────┘
           │
    ┌──────┴───────┐
    │  Physical NIC │
    └──────────────┘

BPF Program Attachment Points

// bpf_lxc.c attaches to:
// 1. TC ingress on container veth (from container)
// 2. TC egress on container veth (to container)

/* Entry point for egress (from container) */
__section_entry
int cil_from_container(struct __ctx_buff *ctx)

/* Entry point for ingress (to container) */  
__section_entry
int cil_to_container(struct __ctx_buff *ctx)

/* Policy enforcement entry point */
__section_entry
int cil_lxc_policy(struct __ctx_buff *ctx)

/* L7 proxy egress entry point */
__section_entry
int cil_lxc_policy_egress(struct __ctx_buff *ctx)

Program Initialization and Headers

Lines 1-60: Configuration and Feature Flags

// SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause)
/* Copyright Authors of Cilium */

#include <bpf/ctx/skb.h>
#include <bpf/api.h>

Context type: <bpf/ctx/skb.h> defines:

  • __ctx_buff as __sk_buff (not xdp_md)
  • TC-specific helpers: skb_load_bytes(), skb_store_bytes()
  • Access to skb->mark, skb->cb[] control buffer
#include <bpf/config/node.h>
#include <bpf/config/global.h>
#include <bpf/config/endpoint.h>
#include <bpf/config/lxc.h>

Configuration hierarchy:

  • node.h: Cluster-wide config (CLUSTER_ID, NODE_ID, encryption settings)
  • global.h: Global features (IPv4/IPv6 enabled, tunnel mode)
  • endpoint.h: Generic endpoint definitions
  • lxc.h: Container-specific config - Generated per pod!

Key values in lxc.h:

#define LXC_ID 1234                    // Unique endpoint ID
#define SECLABEL 5678                  // Security identity
#define SECLABEL_IPV4 5678
#define SECLABEL_IPV6 5679
#define LXC_IPV4 "10.0.1.5"           // Pod IP
#define LXC_IPV6 "fd00::5"
#define IS_BPF_LXC 1

#define EFFECTIVE_EP_ID LXC_ID
#define EVENT_SOURCE LXC_ID

Program type marker: Tells library code this is a container-side TC program. Affects:

  • Policy map selection
  • Connection tracking scope
  • Metrics labeling
#define USE_LOOPBACK_LB		1

Loopback load balancing: Enables special handling when a container talks to a service that backends to itself. Prevents martian source drops.

#undef LB_SELECTION
#define LB_SELECTION LB_SELECTION_RANDOM

Override load balancer algorithm:

  • XDP uses Maglev for external traffic (better distribution)
  • bpf_lxc forces RANDOM for in-cluster traffic
  • Why? Maglev requires precomputed lookup tables. For ClusterIP services (thousands of them), maintaining Maglev tables for each service would consume excessive memory. Random selection is simpler and sufficient for east-west traffic.

Lines 20-58: Library Includes

#include "lib/auth.h"          // Mutual auth between endpoints
#include "lib/tailcall.h"      // Tail call infrastructure
#include "lib/policy.h"        // L3/L4 policy enforcement
#include "lib/lb.h"            // Service load balancing
#include "lib/nat.h"           // NAT for services
#include "lib/encap.h"         // VXLAN/Geneve encapsulation
#include "lib/local_delivery.h" // Routing to local endpoints

Each library provides critical functionality. The order matters for macro definitions and dependencies.

Entry Point 1: cil_from_container() - Egress Path

This is the hottest path - every packet leaving a container goes through here.

Complete Function Flow

__section_entry
int cil_from_container(struct __ctx_buff *ctx)
{
    __u16 proto = 0;
    __u32 sec_label = SECLABEL;
    __s8 ext_err = 0;
    int ret;
    bool valid_ethertype = validate_ethertype(ctx, &proto);

Initial setup:

  • sec_label = SECLABEL: This container's security identity (compiled in)
  • ext_err: Extended error code for detailed drop reasons
  • validate_ethertype(): Extract protocol (IPv4/IPv6/ARP) from Ethernet header
    bpf_clear_meta(ctx);
    check_and_store_ip_trace_id(ctx);

Metadata management:

  • bpf_clear_meta(): Zero out skb->cb[] control buffer
    • Why? The skb might be recycled from another context
    • Ensures clean state for this processing pipeline
  • check_and_store_ip_trace_id(): Extract trace ID from IP options (same as XDP)
    /* Workaround for GH-18311 where veth driver might have recorded
     * veth's RX queue mapping instead of leaving it at 0. This can
     * cause issues on the phys device where all traffic would only
     * hit a single TX queue (given veth device had a single one and
     * mapping was left at 1). Reset so that stack picks a fresh queue.
     * Kernel fix is at 710ad98c363a ("veth: Do not record rx queue
     * hint in veth_xmit").
     */
    ctx->queue_mapping = 0;

Veth driver bug workaround:

  • Problem: Veth driver records RX queue mapping, but veth typically has only one queue
  • Impact: Physical NIC would use that queue hint, causing all traffic to hit one TX queue
  • Fix: Reset to 0, let kernel stack pick queue based on hash
  • Kernel fix: Mainline kernel 5.16+ doesn't have this issue
    send_trace_notify(ctx, TRACE_FROM_LXC, sec_label, UNKNOWN_ID,
                      TRACE_EP_ID_UNKNOWN, TRACE_IFINDEX_UNKNOWN,
                      TRACE_REASON_UNKNOWN, TRACE_PAYLOAD_LEN, proto);

Send monitoring event:

  • TRACE_FROM_LXC: Event type
  • sec_label: Source identity (this container)
  • UNKNOWN_ID: Destination not yet known
  • Goes to Hubble for observability
    if (!valid_ethertype) {
        ret = DROP_UNSUPPORTED_L2;
        goto out;
    }

Protocol validation: Drop non-IP/ARP traffic with proper error code.

    switch (proto) {
#ifdef ENABLE_IPV6
    case bpf_htons(ETH_P_IPV6):
        edt_set_aggregate(ctx, LXC_ID);
        ret = tail_call_internal(ctx, CILIUM_CALL_IPV6_FROM_LXC, &ext_err);
        sec_label = SECLABEL_IPV6;
        break;
#endif

Protocol dispatch with EDT:

  • edt_set_aggregate(ctx, LXC_ID): Set Earliest Departure Time aggregate
    • Purpose: Traffic shaping and QoS
    • Groups packets by endpoint for fair queuing
    • Prevents one container from monopolizing bandwidth
  • tail_call_internal(): Jump to IPv6 handler
  • Why tail call? Complexity limit - IPv6 processing is too large for one program
#ifdef ENABLE_IPV4
    case bpf_htons(ETH_P_IP):
        edt_set_aggregate(ctx, LXC_ID);
        ret = tail_call_internal(ctx, CILIUM_CALL_IPV4_FROM_LXC, &ext_err);
        sec_label = SECLABEL_IPV4;
        break;

IPv4 path: Same pattern as IPv6.

#ifdef ENABLE_ARP_PASSTHROUGH
    case bpf_htons(ETH_P_ARP):
        ret = CTX_ACT_OK;
        break;
#elif defined(ENABLE_ARP_RESPONDER)
    case bpf_htons(ETH_P_ARP):
        ret = tail_call_internal(ctx, CILIUM_CALL_ARP, &ext_err);
        break;
#endif

ARP handling modes:

  1. ENABLE_ARP_PASSTHROUGH: Let kernel handle ARP (simple)
  2. ENABLE_ARP_RESPONDER: BPF responds to ARP (faster, less context switches)
out:
    if (IS_ERR(ret))
        return send_drop_notify_ext(ctx, sec_label, UNKNOWN_ID, LXC_ID,
                                    ret, ext_err, METRIC_EGRESS);
    return ret;
}

Error handling: Send detailed drop notification to monitoring before returning error verdict.

The IPv4 Egress Pipeline

After tail call from cil_from_container(), execution continues in the IPv4 handler. Let's trace the complete path:

Complete egress pipeline flow:

┌─────────────────────────────────────────────────────────────────┐
│ Step 1: Per-Packet Load Balancing                              │
│ __per_packet_lb_svc_xlate_4()                                   │
│   ├─> Service lookup                                            │
│   ├─> Backend selection (RANDOM)                                │
│   ├─> DNAT (rewrite dst IP/port)                                │
│   └─> lb4_ctx_store_state() → save to skb->cb[]                 │
│        └─> tail_call(CILIUM_CALL_IPV4_CT_EGRESS)                │
└─────────────────────────────────────────────────────────────────┘
                            ↓
┌─────────────────────────────────────────────────────────────────┐
│ Step 2: Connection Tracking (Egress)                           │
│ tail_ipv4_ct_egress() [generated by TAIL_CT_LOOKUP4 macro]     │
│   ├─> lb4_ctx_restore_state() → read from skb->cb[]            │
│   ├─> ct_lookup4(CT_EGRESS) → CT_NEW/CT_ESTABLISHED/CT_REPLY   │
│   ├─> map_update_elem(cilium_tail_call_buffer4) → save results │
│   └─> tail_call(CILIUM_CALL_IPV4_FROM_LXC_CONT)                │
└─────────────────────────────────────────────────────────────────┘
                            ↓
┌─────────────────────────────────────────────────────────────────┐
│ Step 3: Policy Enforcement + Forwarding                        │
│ tail_handle_ipv4_cont() → handle_ipv4_from_lxc()               │
│   ├─> map_lookup_elem(cilium_tail_call_buffer4) → read results │
│   ├─> switch (ct_status):                                      │
│   │   ├─> CT_NEW/CT_ESTABLISHED:                               │
│   │   │   ├─> policy_can_egress4() → allow/deny/proxy          │
│   │   │   ├─> ct_create4() if CT_NEW                           │
│   │   │   └─> ipv4_forward_to_destination()                    │
│   │   └─> CT_REPLY/CT_RELATED:                                 │
│   │       └─> skip policy, forward directly                    │
│   └─> Return verdict                                           │
└─────────────────────────────────────────────────────────────────┘

Key inter-step communication:

  1. Step 1 → Step 2: skb->cb[] metadata (via lb4_ctx_store_state/restore)
  2. Step 2 → Step 3: cilium_tail_call_buffer4 map (shared per-CPU buffer)

Step 1: Per-Packet Load Balancing

#ifdef ENABLE_PER_PACKET_LB
static __always_inline int 
__per_packet_lb_svc_xlate_4(void *ctx, struct iphdr *ip4, __s8 *ext_err)
{
    struct ipv4_ct_tuple tuple = {};
    struct ct_state ct_state_new = {};
    const struct lb4_service *svc;
    struct lb4_key key = {};
    __u16 proxy_port = 0;
    __u32 cluster_id = 0;

Why per-packet LB?

  • Socket-layer LB (bpf_sock.c) handles most service translation
  • But some cases need packet-level handling:
    • L7 services (need redirect to Envoy proxy)
    • SCTP protocol (not supported in socket BPF)
    • First packet before socket established

Extract L4 tuple:

    tuple.nexthdr = ip4->protocol;
    tuple.daddr = ip4->daddr;
    tuple.saddr = ip4->saddr;
    
    l4_off = ETH_HLEN + ipv4_hdrlen(ip4);
    // Parse TCP/UDP headers for port numbers
    ret = lb4_extract_tuple(ctx, ip4, fraginfo, l4_off, &tuple);

Service lookup:

    lb4_fill_key(&key, &tuple);
    svc = lb4_lookup_service(&key, is_defined(ENABLE_NODEPORT));

Check for L7 load balancer:

#if defined(ENABLE_L7_LB)
    if (lb4_svc_is_l7_loadbalancer(svc)) {
        proxy_port = (__u16)svc->l7_lb_proxy_port;
        goto skip_service_lookup;
    }

L7 services:

  • Marked with SVC_FLAG_L7_LOADBALANCER in service map
  • Don't DNAT here - need to redirect to Envoy proxy
  • Store proxy_port for later redirect
  • Envoy will handle HTTP/gRPC load balancing

Backend selection and DNAT:

    ret = lb4_local(get_ct_map4(&tuple), ctx, fraginfo,
                    l4_off, &key, &tuple, svc, &ct_state_new,
                    &backend, ext_err);
    
    if (tuple.saddr == backend->address) {
        /* Loopback: container talking to service backed by itself */
        ct_state_new.loopback = 1;
    }
    
    ret = lb4_dnat_request(ctx, backend, ETH_HLEN, fraginfo,
                          l4_off, &key, &tuple, ct_state_new.loopback);

DNAT operation:

  • Rewrites ip4->daddr to backend IP
  • Rewrites TCP/UDP destination port
  • Updates checksums (IP + L4)
  • Loopback handling: If source == destination after DNAT, set loopback flag

State preservation:

skip_service_lookup:
    lb4_ctx_store_state(ctx, &ct_state_new, proxy_port, cluster_id);
    return tail_call_internal(ctx, CILIUM_CALL_IPV4_CT_EGRESS, ext_err);
}

Store state in skb->cb[]:

  • ct_state_new.rev_nat_index: For reverse NAT on reply
  • proxy_port: If L7 proxy redirect needed
  • cluster_id: For multi-cluster routing
  • Why store? Next tail call needs this info, but can't pass parameters

Step 2: Connection Tracking (Egress)

The CILIUM_CALL_IPV4_CT_EGRESS tail call leads to connection tracking. This uses a C macro to generate the actual function code.

Understanding the Macro Pattern

In bpf_lxc.c (line 451), there's a macro definition:

#define TAIL_CT_LOOKUP4(ID, NAME, DIR, CONDITION, TARGET_ID, TARGET_NAME)

This is invoked as:

TAIL_CT_LOOKUP4(CILIUM_CALL_IPV4_CT_EGRESS,    // ID - tail call number
                tail_ipv4_ct_egress,            // NAME - function name
                CT_EGRESS,                      // DIR - direction
                1,                              // CONDITION - always true
                CILIUM_CALL_IPV4_TO_LXC_POLICY_ONLY,  // TARGET_ID
                tail_ipv4_policy)               // TARGET_NAME

Macro expansion flow:

Source Code (bpf_lxc.c)
    ↓
    TAIL_CT_LOOKUP4(CILIUM_CALL_IPV4_CT_EGRESS, tail_ipv4_ct_egress, ...)
    ↓
C Preprocessor (clang -E)
    ↓
    Expands macro → replaces parameters
    ↓
Expanded Code (what compiler actually sees)
    ↓
    __declare_tail(CILIUM_CALL_IPV4_CT_EGRESS)
    static __always_inline
    int tail_ipv4_ct_egress(struct __ctx_buff *ctx)
    {
        // Full function body with DIR=CT_EGRESS everywhere
        ...
    }
    ↓
Compiler (clang)
    ↓
BPF Bytecode

What the C Preprocessor Does

The C preprocessor expands this macro before compilation, replacing the parameters:

// Before macro expansion (what you write):
TAIL_CT_LOOKUP4(CILIUM_CALL_IPV4_CT_EGRESS, tail_ipv4_ct_egress, ...)

// After macro expansion (what the compiler sees):
__declare_tail(CILIUM_CALL_IPV4_CT_EGRESS)
static __always_inline
int tail_ipv4_ct_egress(struct __ctx_buff *ctx)
{
    enum ct_scope scope = SCOPE_BIDIR;
    struct ct_buffer4 ct_buffer = {};
    struct ipv4_ct_tuple *tuple;
    struct ct_state *ct_state;
    void *data, *data_end;
    struct iphdr *ip4;
    __s8 ext_err = 0;
    __u32 zero = 0;
    void *map;
    
    ct_state = (struct ct_state *)&ct_buffer.ct_state;
    tuple = (struct ipv4_ct_tuple *)&ct_buffer.tuple;
    
    if (!revalidate_data(ctx, &data, &data_end, &ip4))
        return drop_for_direction(ctx, CT_EGRESS, DROP_INVALID, ext_err);
    
    tuple->nexthdr = ip4->protocol;
    tuple->daddr = ip4->daddr;
    tuple->saddr = ip4->saddr;
    // ... (full function body)
}

Why use macros?

  • Code reuse: Same pattern for egress/ingress, IPv4/IPv6
  • Compile-time parameters: DIR (direction) is known at compile time
  • Reduces errors: Write the logic once, instantiate multiple times

The actual macro definition (simplified):

#define TAIL_CT_LOOKUP4(ID, NAME, DIR, CONDITION, TARGET_ID, TARGET_NAME) \
__declare_tail(ID)                                                        \
static __always_inline                                                    \
int NAME(struct __ctx_buff *ctx)                                          \
{                                                                         \
    /* Variable declarations */                                           \
    struct ct_buffer4 ct_buffer = {};                                     \
    void *map;                                                            \
    /* ... */                                                             \
                                                                          \
    /* Extract packet tuple */                                            \
    tuple->nexthdr = ip4->protocol;                                       \
    tuple->daddr = ip4->daddr;                                            \
    tuple->saddr = ip4->saddr;                                            \
                                                                          \
    /* Select CT map - DIR parameter used here */                         \
    map = select_ct_map4(ctx, DIR, tuple);                                \
                                                                          \
    /* Do CT lookup - DIR parameter used here */                          \
    ct_buffer.ret = ct_lookup4(map, tuple, ctx, ip4, l4_off,              \
                               DIR, scope, ct_state, &monitor);           \
                                                                          \
    /* Save results to shared buffer */                                   \
    map_update_elem(&cilium_tail_call_buffer4, &zero, &ct_buffer, 0);    \
                                                                          \
    /* Chain to next stage - TARGET_ID/TARGET_NAME used here */           \
    if (CONDITION)                                                        \
        ret = tail_call_internal(ctx, TARGET_ID, &ext_err);               \
    else                                                                  \
        ret = TARGET_NAME(ctx);                                           \
                                                                          \
    return ret;                                                           \
}

Parameter substitution example:

// When you write:
map = select_ct_map4(ctx, DIR, tuple);

// After macro expansion with DIR=CT_EGRESS:
map = select_ct_map4(ctx, CT_EGRESS, tuple);

// After macro expansion with DIR=CT_INGRESS:
map = select_ct_map4(ctx, CT_INGRESS, tuple);

This creates different functions from the same template!

The Generated Function Logic

After macro expansion, tail_ipv4_ct_egress() contains:

Select CT map:

    map = select_ct_map4(ctx, CT_EGRESS, tuple);

Multi-cluster support:

  • Each cluster can have its own CT map
  • Prevents IP overlap issues
  • select_ct_map4() checks CB_CLUSTER_ID_EGRESS metadata

Restore load balancer state:

    if (is_defined(ENABLE_PER_PACKET_LB) && DIR == CT_EGRESS) {
        struct ct_state ct_state_new = {};
        __u32 cluster_id;
        __u16 proxy_port;
        
        lb4_ctx_restore_state(ctx, &ct_state_new, &proxy_port,
                             &cluster_id, false);
        if (ct_state_new.rev_nat_index)
            scope = SCOPE_FORWARD;
    }

Connection tracking lookup:

    ret = ct_lazy_lookup4(map, tuple, ctx, fraginfo, l4_off,
                         CT_EGRESS, scope, CT_ENTRY_ANY,
                         ct_state, &monitor);

CT lookup results:

  • CT_NEW: First packet of connection
  • CT_ESTABLISHED: Existing connection
  • CT_REPLY: Reply packet (reverse direction)
  • CT_RELATED: Related connection (e.g., FTP data channel)

CT creation for new connections:

    if (ret == CT_NEW) {
        ret = ct_create4(map, NULL, tuple, ctx, CT_EGRESS,
                        ct_state, ext_err);
        if (IS_ERR(ret))
            return drop_for_direction(ctx, CT_EGRESS, ret, ext_err);
    }

What gets stored in CT entry:

struct ct_entry {
    __u64 rx_packets;
    __u64 tx_packets;
    __u64 rx_bytes;
    __u64 tx_bytes;
    __u32 lifetime;          // TTL in seconds
    __u16 rx_closing:1,      // FIN seen in RX direction
          tx_closing:1,      // FIN seen in TX direction
          nat46:1,           // NAT46 translation
          lb_loopback:1,     // Loopback service
          seen_non_syn:1,    // Non-SYN packet seen
          node_port:1,       // NodePort service
          proxy_redirect:1,  // Proxy redirect
          dsr:1,             // Direct Server Return
          // ... more flags
    __u16 rev_nat_index;     // Reverse NAT index
    __u16 ifindex;           // Source interface
};

Tail call to next stage:

    // Save CT results to shared buffer for next stage
    if (map_update_elem(&cilium_tail_call_buffer4, &zero, &ct_buffer, 0) < 0)
        return drop_for_direction(ctx, CT_EGRESS, DROP_INVALID_TC_BUFFER, ext_err);
    
    // The macro substitutes TARGET_ID and TARGET_NAME parameters:
    if (CONDITION)  // CONDITION = is_defined(ENABLE_PER_PACKET_LB)
        ret = tail_call_internal(ctx, CILIUM_CALL_IPV4_FROM_LXC_CONT, &ext_err);
    else
        ret = tail_handle_ipv4_cont(ctx);  // Direct call if no per-packet LB
    
    if (IS_ERR(ret))
        return drop_for_direction(ctx, CT_EGRESS, ret, ext_err);
    
    return ret;
}

The CT → Policy linkage:

// After tail_ipv4_ct_egress() completes:
// 1. CT results saved in cilium_tail_call_buffer4
// 2. Tail call (or direct call) to tail_handle_ipv4_cont()

__declare_tail(CILIUM_CALL_IPV4_FROM_LXC_CONT)
static __always_inline
int tail_handle_ipv4_cont(struct __ctx_buff *ctx)
{
    __u32 dst_sec_identity = 0;
    __s8 ext_err = 0;
    
    // This function does policy enforcement + forwarding
    int ret = handle_ipv4_from_lxc(ctx, &dst_sec_identity, &ext_err);
    
    if (IS_ERR(ret))
        return send_drop_notify_ext(ctx, SECLABEL_IPV4, dst_sec_identity,
                                    TRACE_EP_ID_UNKNOWN, ret, ext_err,
                                    METRIC_EGRESS);
    return ret;
}

How to verify macro expansion yourself:

# Generate preprocessed output (macros expanded)
cd /home/sai/foss/cilium
clang -E -I bpf/include -I bpf -D__x86_64__ \
      -DENABLE_IPV4 -DENABLE_IPV6 \
      bpf/bpf_lxc.c > /tmp/bpf_lxc_expanded.c

# Search for the generated function
grep -A 50 "tail_ipv4_ct_egress" /tmp/bpf_lxc_expanded.c

Other instances of this macro:

// Ingress CT lookup
TAIL_CT_LOOKUP4(CILIUM_CALL_IPV4_CT_INGRESS,
                tail_ipv4_ct_ingress,
                CT_INGRESS, ...)

// IPv6 egress CT lookup  
TAIL_CT_LOOKUP6(CILIUM_CALL_IPV6_CT_EGRESS,
                tail_ipv6_ct_egress,
                CT_EGRESS, ...)

Each invocation generates a complete, separate function with the logic customized by the parameters.

Key Takeaways

  1. Macros are code templates: TAIL_CT_LOOKUP4 is a template that generates function code
  2. Preprocessing happens first: Before C compilation, the preprocessor replaces macros with actual code
  3. Parameters customize behavior: Same logic, different constants (DIR, TARGET_ID, etc.)
  4. Result: Multiple similar functions without code duplication

Mental model:

Think of it like a function that writes functions!

TAIL_CT_LOOKUP4(...) is NOT a function call
                     ↓
It's a CODE GENERATOR that creates a new function
                     ↓
The generated function gets compiled into BPF bytecode

Comparison:

// WITHOUT macros (you'd have to write):
int tail_ipv4_ct_egress(...) {
    map = select_ct_map4(ctx, CT_EGRESS, tuple);
    // 50 lines of code
}

int tail_ipv4_ct_ingress(...) {
    map = select_ct_map4(ctx, CT_INGRESS, tuple);
    // Same 50 lines with CT_EGRESS → CT_INGRESS
}

// WITH macros (you write):
TAIL_CT_LOOKUP4(CILIUM_CALL_IPV4_CT_EGRESS, tail_ipv4_ct_egress, CT_EGRESS, ...)
TAIL_CT_LOOKUP4(CILIUM_CALL_IPV4_CT_INGRESS, tail_ipv4_ct_ingress, CT_INGRESS, ...)

// Result: Both functions exist, but you only wrote the logic once!

Step 3: Policy Enforcement (Egress)

The CILIUM_CALL_IPV4_FROM_LXC_CONT tail call (or direct call) leads to tail_handle_ipv4_cont(), which calls handle_ipv4_from_lxc(). This function enforces egress network policies after retrieving CT results:

static __always_inline int
handle_ipv4_from_lxc(struct __ctx_buff *ctx, __u32 *dst_sec_identity,
                     __s8 *ext_err)
{
    struct ipv4_ct_tuple *tuple;
    struct ct_state *ct_state;
    enum ct_status ct_status;
    __u32 zero = 0;

Retrieve CT results from shared buffer:

    // Step 2 (CT lookup) saved results here
    ct_buffer = map_lookup_elem(&cilium_tail_call_buffer4, &zero);
    if (!ct_buffer)
        return DROP_INVALID_TC_BUFFER;
    
    tuple = (struct ipv4_ct_tuple *)&ct_buffer->tuple;
    ct_state = (struct ct_state *)&ct_buffer->ct_state;
    ct_status = ct_buffer->ret;  // CT_NEW, CT_ESTABLISHED, CT_REPLY, etc.
    l4_off = ct_buffer->l4_off;

The data flow between Step 2 and Step 3:

Step 2: tail_ipv4_ct_egress()
    ├─> ct_lookup4() → returns CT_NEW/CT_ESTABLISHED/CT_REPLY
    ├─> Store result in cilium_tail_call_buffer4:
    │   └─> ct_buffer.ret = CT_NEW (or other status)
    │   └─> ct_buffer.tuple = {saddr, daddr, sport, dport, proto}
    │   └─> ct_buffer.ct_state = {flags, rev_nat_index, ...}
    └─> tail_call → CILIUM_CALL_IPV4_FROM_LXC_CONT
                     ↓
Step 3: tail_handle_ipv4_cont() → handle_ipv4_from_lxc()
    ├─> Read cilium_tail_call_buffer4
    ├─> ct_status = ct_buffer->ret
    └─> Policy enforcement based on ct_status

Policy enforcement based on CT state:

    switch (ct_status) {
    case CT_NEW:
    case CT_ESTABLISHED:
        /* Forward to L7 proxy if needed */
        if (proxy_port > 0)
            break;  // Skip policy, L7 LB handles it
        
        /* Skip policy for hairpin (pod to itself via service) */
        if (hairpin_flow)
            break;
        
        /* Egress policy check */
        verdict = policy_can_egress4(ctx, tuple, l4_off, SECLABEL_IPV4,
                                     *dst_sec_identity, &policy_match_type,
                                     &audited, ext_err, &proxy_port, &cookie);

Egress policy function:

int policy_can_egress4(ctx, tuple, l4_off, src_label, dst_label, ...)
{
    // Lookup in egress policy map
    struct policy_key key = {
        .sec_label = src_label,     // This endpoint's identity
        .egress = 1,                // Direction: egress
        .protocol = tuple->nexthdr, // TCP/UDP/ICMP
        .dport = tuple->dport       // Destination port
    };
    
    struct policy_entry *policy = map_lookup_elem(&cilium_policy, &key);
    
    if (!policy)
        return DROP_POLICY;  // Default deny
    
    if (policy->deny)
        return DROP_POLICY_DENY;
    
    if (policy->proxy_port)
        return POLICY_ACT_PROXY_REDIRECT;
    
    return CTX_ACT_OK;
}

Policy map structure:

struct policy_key {
    __u32 sec_label;      // Source identity
    __u16 dport;          // Destination port (or 0 for any)
    __u8  protocol;       // IP protocol (or 0 for any)
    __u8  egress:1,       // Direction
          pad:7;
};

struct policy_entry {
    __u16 proxy_port;     // If L7 proxy redirect
    __u16 pad;
    __u64 packets;        // Metrics
    __u64 bytes;
};

Policy lookup order (most specific first):

  1. (src_id, dst_port, protocol)
  2. (src_id, dst_port, 0)
  3. (src_id, 0, protocol)
  4. (src_id, 0, 0)

Policy verdicts:

switch (verdict) {
case CTX_ACT_OK:
    // Policy allows, continue to forwarding
    break;
    
case DROP_POLICY:
case DROP_POLICY_DENY:
    // Optionally send ICMP unreachable
    if (CONFIG(policy_deny_response_enabled)) {
        ctx_store_meta(ctx, CB_VERDICT, verdict);
        return tail_call_internal(ctx, CILIUM_CALL_IPV4_POLICY_DENIED, ext_err);
    }
    return verdict;
    
case DROP_POLICY_AUTH_REQUIRED:
    // Mutual auth required, check auth cache
    auth_type = (__u8)*ext_err;
    verdict = auth_lookup(ctx, SECLABEL_IPV4, *dst_sec_identity,
                         tunnel_endpoint, auth_type);
    break;
    
case POLICY_ACT_PROXY_REDIRECT:
    // Will redirect to L7 proxy after CT creation
    proxy_port = policy->proxy_port;
    break;
}

Authentication check:

if (verdict == DROP_POLICY_AUTH_REQUIRED) {
    __u32 tunnel_endpoint = 0;
    
    if (info)
        tunnel_endpoint = info->tunnel_endpoint.ip4;
    
    // Check if mutual auth is cached
    verdict = auth_lookup(ctx, SECLABEL_IPV4, *dst_sec_identity,
                         tunnel_endpoint, auth_type);
}

Emit verdict notification:

/* Emit verdict if drop or if allow for CT_NEW */
if (verdict != CTX_ACT_OK || ct_status != CT_ESTABLISHED) {
    send_policy_verdict_notify(ctx, *dst_sec_identity, tuple->dport,
                              tuple->nexthdr, POLICY_EGRESS, 0,
                              verdict, proxy_port,
                              policy_match_type, audited,
                              auth_type, cookie);
}

Create CT entry for new connections:

    case CT_NEW:
        ct_state_new.src_sec_id = SECLABEL_IPV4;
        ct_state_new.proxy_redirect = proxy_port > 0;
        ct_state_new.from_l7lb = from_l7lb;
        
        ret = ct_create4(ct_map, ct_related_map, tuple, ctx,
                        CT_EGRESS, &ct_state_new, ext_err);
        if (IS_ERR(ret))
            return ret;
        break;
    
    case CT_REPLY:
    case CT_RELATED:
        /* Return traffic - no policy check needed */
        break;
    }
    
    /* Forward packet */
    return ipv4_forward_to_destination(ctx, ip4, tuple, *dst_sec_identity,
                                      ct_state, ct_status, info, skip_tunnel,
                                      hairpin_flow, from_l7lb, proxy_port,
                                      cluster_id, &trace, ext_err);
}

Key points:

  • Egress policy is enforced AFTER CT lookup but BEFORE forwarding
  • Reply traffic (CT_REPLY) skips policy - only forward direction is checked
  • Hairpin traffic skips policy - pod talking to itself via service
  • L7 LB traffic bypasses BPF policy - Envoy handles L7 enforcement
  • Policy lookup uses source identity (SECLABEL) and destination identity from endpoint lookup

Step 4: Routing Decision and L7 Proxy Redirect

After policy enforcement, ipv4_forward_to_destination() handles forwarding:

static __always_inline int
ipv4_forward_to_destination(ctx, ip4, tuple, dst_sec_identity,
                           ct_state, ct_status, info, skip_tunnel,
                           hairpin_flow, from_l7lb, proxy_port,
                           cluster_id, trace, ext_err)
{
    hairpin_flow |= ct_state->loopback;

L7 proxy redirect (if proxy_port set by policy):

    /* L7 LB does L7 policy enforcement, so we only redirect packets
     * NOT from L7 LB. */
    if (!from_l7lb && proxy_port > 0) {
        send_trace_notify(ctx, TRACE_TO_PROXY, SECLABEL_IPV4, UNKNOWN_ID,
                         bpf_ntohs(proxy_port), TRACE_IFINDEX_UNKNOWN,
                         trace->reason, trace->monitor, bpf_htons(ETH_P_IP));
        return ctx_redirect_to_proxy4(ctx, tuple, proxy_port, false);
    }

Proxy hairpin flow:

1. Container sends: src=10.0.1.5:12345, dst=10.0.2.6:80
2. Policy verdict: POLICY_ACT_PROXY_REDIRECT, proxy_port=15001
3. BPF redirect to: dst=127.0.0.1:15001 (Envoy listener)
4. Envoy processes HTTP, applies L7 policies
5. Envoy makes new connection: src=10.0.1.5:random, dst=10.0.2.6:80
6. BPF sees MARK_MAGIC_PROXY_EGRESS, bypasses policy
7. Packet forwarded normally

Routing decision:

#ifdef ENABLE_ROUTING
    ret = encap_and_redirect_lxc(ctx, tunnel_endpoint, encrypt_key,
                                 sec_label, &monitor);
#else
    ret = lxc_redirect_to_host(ctx, src_sec_identity, proto, &trace);
#endif

Routing modes:

  1. Tunnel mode (ENABLE_ROUTING undefined):
static __always_inline int
lxc_redirect_to_host(struct __ctx_buff *ctx, __u32 src_sec_identity,
                    __be16 proto, struct trace_ctx *trace)
{
    send_trace_notify(ctx, TRACE_TO_HOST, src_sec_identity, HOST_ID,
                     TRACE_EP_ID_UNKNOWN, CILIUM_NET_IFINDEX,
                     trace->reason, trace->monitor, proto);
    return ctx_redirect(ctx, CILIUM_NET_IFINDEX, BPF_F_INGRESS);
}
  • Redirect to cilium_host interface
  • bpf_host program handles tunneling
  • Used when tunnel: vxlan or tunnel: geneve
  1. Direct routing (ENABLE_ROUTING defined):
static __always_inline int
encap_and_redirect_lxc(ctx, tunnel_endpoint, encrypt_key, ...)
{
    if (tunnel_endpoint) {
        // Encapsulate in VXLAN/Geneve
        ret = __encap_with_nodeid(ctx, tunnel_endpoint, ...);
    }
    
    if (encrypt_key) {
        // IPsec encryption
        ret = set_ipsec_encrypt(ctx, encrypt_key, ...);
    }
    
    // FIB lookup for next hop
    struct bpf_fib_lookup fib_params = {};
    ret = fib_lookup(ctx, &fib_params, ...);
    
    // Redirect to output interface
    return ctx_redirect(ctx, fib_params.ifindex, 0);
}
  • Direct L3 routing without cilium_host
  • Faster path (fewer hops)
  • Requires routable pod IPs

Entry Point 2: cil_to_container() - Ingress Path

Traffic entering a container is the second hottest path, handling all packets destined to this container from the network.

Complete Ingress Function Flow

__section_entry
int cil_to_container(struct __ctx_buff *ctx)
{
    enum trace_point trace = TRACE_FROM_STACK;
    __u32 magic, identity = 0;
    __u32 sec_label = SECLABEL;
    __s8 ext_err = 0;
    __u16 proto;
    int ret;

Initial setup:

  • trace = TRACE_FROM_STACK: Assume from kernel network stack
  • identity = 0: Will be extracted from packet mark
  • sec_label = SECLABEL: This container's identity (destination)

Protocol Validation

    if (!validate_ethertype(ctx, &proto)) {
        ret = DROP_UNSUPPORTED_L2;
        goto out;
    }

Extract Ethernet protocol: Same as egress - IPv4, IPv6, or ARP.

Metadata Initialization

    bpf_clear_meta(ctx);
    check_and_store_ip_trace_id(ctx);

Clean slate for ingress:

  • Clear skb->cb[] to avoid stale data from previous processing
  • Extract trace ID for observability correlation

L7 Proxy Egress Handling

#if defined(ENABLE_L7_LB)
    if ((ctx->mark & MARK_MAGIC_HOST_MASK) == MARK_MAGIC_PROXY_EGRESS_EPID) {
        __u16 lxc_id = get_epid(ctx);
        
        ctx->mark = 0;
        ret = tail_call_egress_policy(ctx, lxc_id);
        return send_drop_notify(ctx, lxc_id, sec_label, LXC_ID,
                               ret, METRIC_INGRESS);
    }
#endif

Special case: Traffic from L7 proxy going to another endpoint.

  • MARK_MAGIC_PROXY_EGRESS_EPID: Proxy marked packet with target endpoint ID
  • Extract endpoint ID from mark
  • Jump to egress policy enforcement for that endpoint
  • Why? Proxy-originated traffic needs policy check in egress direction

Identity Extraction from Packet Mark

    magic = inherit_identity_from_host(ctx, &identity);
    if (magic == MARK_MAGIC_PROXY_INGRESS || 
        magic == MARK_MAGIC_PROXY_EGRESS)
        trace = TRACE_FROM_PROXY;

Critical function: inherit_identity_from_host()

static __always_inline __u32
inherit_identity_from_host(struct __ctx_buff *ctx, __u32 *identity)
{
    __u32 magic = ctx->mark & MARK_MAGIC_HOST_MASK;
    
    *identity = get_identity(ctx);  // Extract lower 16 bits
    ctx->mark = 0;  // Clear mark after reading
    
    return magic;
}

Packet mark encoding:

┌────────────────────────────────────────┐
│         32-bit skb->mark field         │
├──────────────────┬─────────────────────┤
│   Magic (16)     │   Identity (16)     │
└──────────────────┴─────────────────────┘

Magic values:
  0x0A00_0000 = MARK_MAGIC_PROXY_INGRESS
  0x0B00_0000 = MARK_MAGIC_PROXY_EGRESS  
  0x0F00_0000 = MARK_MAGIC_IDENTITY

Who sets this mark?

  • bpf_host: Encodes source identity before redirecting to container
  • bpf_overlay: After tunnel decapsulation
  • bpf_sock: For locally-generated host traffic
  • Envoy proxy: For L7-processed traffic

Send Trace Event

    send_trace_notify(ctx, trace, identity, sec_label, LXC_ID,
                     ctx->ingress_ifindex, TRACE_REASON_UNKNOWN,
                     TRACE_PAYLOAD_LEN, proto);

Observability hook:

  • trace: TRACE_FROM_STACK or TRACE_FROM_PROXY
  • identity: Source identity (from mark)
  • sec_label: Destination identity (this container)
  • Goes to Hubble/cilium monitor

Host Firewall Integration

#if defined(ENABLE_HOST_FIREWALL) && !defined(ENABLE_ROUTING)
    /* If the packet comes from the hostns and per-endpoint routes are enabled,
     * jump to bpf_host to enforce egress host policies before anything else.
     */
    if (identity == HOST_ID) {
        ctx_store_meta(ctx, CB_FROM_HOST, 1);
        ctx_store_meta(ctx, CB_DST_ENDPOINT_ID, LXC_ID);
        
        ret = tail_call_policy(ctx, CONFIG(host_ep_id));
        return send_drop_notify(ctx, identity, sec_label, LXC_ID,
                               DROP_HOST_NOT_READY, METRIC_INGRESS);
    }
#endif

Host firewall scenario:

Host process (identity=HOST_ID) → Container
    ↓
    Need to check: Can host egress to this container?
    ↓
    Tail call to bpf_host policy program
    ↓
    After policy check, return here to continue

Why this complexity?

  • Host firewall policies are in bpf_host program
  • Container policies are in bpf_lxc program
  • This coordination allows both to be enforced

Protocol Dispatch

    switch (proto) {
#if defined(ENABLE_ARP_PASSTHROUGH) || defined(ENABLE_ARP_RESPONDER)
    case bpf_htons(ETH_P_ARP):
        ret = CTX_ACT_OK;  // Let kernel handle ARP
        break;
#endif

ARP handling: Usually passthrough for ingress (container doesn't respond to ARP).

#ifdef ENABLE_IPV6
    case bpf_htons(ETH_P_IPV6):
        sec_label = SECLABEL_IPV6;
        ctx_store_meta(ctx, CB_SRC_LABEL, identity);
        ret = tail_call_internal(ctx, CILIUM_CALL_IPV6_CT_INGRESS, &ext_err);
        break;
#endif

IPv6 path: Store source identity in metadata and jump to CT.

#ifdef ENABLE_IPV4
    case bpf_htons(ETH_P_IP):
        sec_label = SECLABEL_IPV4;
        ctx_store_meta(ctx, CB_SRC_LABEL, identity);
        ret = tail_call_internal(ctx, CILIUM_CALL_IPV4_CT_INGRESS, &ext_err);
        break;
#endif

IPv4 path: Same pattern - store identity and tail call.

Why store identity in CB_SRC_LABEL?

  • Tail calls can't pass parameters
  • CT and policy programs need source identity
  • skb->cb[] is the inter-tail-call communication channel
    default:
        ret = DROP_UNKNOWN_L3;
        break;
    }

out:
    if (IS_ERR(ret))
        return send_drop_notify_ext(ctx, identity, sec_label, LXC_ID, ret,
                                   ext_err, METRIC_INGRESS);
    return ret;
}

The IPv4 Ingress Pipeline

After tail call from cil_to_container(), execution continues through multiple stages:

Complete ingress pipeline flow:

┌─────────────────────────────────────────────────────────────────┐
│ Entry: cil_to_container()                                      │
│   ├─> Validate protocol                                        │
│   ├─> inherit_identity_from_host() → extract src identity      │
│   ├─> ctx_store_meta(CB_SRC_LABEL, identity)                   │
│   └─> tail_call(CILIUM_CALL_IPV4_CT_INGRESS)                   │
└─────────────────────────────────────────────────────────────────┘
                            ↓
┌─────────────────────────────────────────────────────────────────┐
│ Step 1: Connection Tracking (Ingress)                          │
│ tail_ipv4_ct_ingress() [generated by TAIL_CT_LOOKUP4 macro]    │
│   ├─> ctx_load_meta(CB_SRC_LABEL) → retrieve src identity      │
│   ├─> ct_lookup4(CT_INGRESS) → CT_NEW/CT_REPLY/CT_ESTABLISHED  │
│   ├─> map_update_elem(cilium_tail_call_buffer4) → save results │
│   └─> tail_call(CILIUM_CALL_IPV4_TO_ENDPOINT)                  │
└─────────────────────────────────────────────────────────────────┘
                            ↓
┌─────────────────────────────────────────────────────────────────┐
│ Step 2: Policy Enforcement (Ingress)                           │
│ tail_ipv4_to_endpoint() → ipv4_policy()                        │
│   ├─> map_lookup_elem(cilium_tail_call_buffer4) → read CT      │
│   ├─> Lookup source identity from ipcache (if needed)          │
│   ├─> policy_can_ingress4(src_id, LXC_ID, port, proto)         │
│   ├─> if ALLOW: continue                                       │
│   ├─> if DENY: drop + notification                             │
│   └─> if PROXY: ctx_redirect_to_proxy_hairpin()                │
└─────────────────────────────────────────────────────────────────┘
                            ↓
┌─────────────────────────────────────────────────────────────────┐
│ Step 3: Local Delivery                                         │
│   ├─> update_metrics(METRIC_INGRESS)                           │
│   ├─> send_trace_notify(TRACE_TO_LXC)                          │
│   └─> CTX_ACT_OK → deliver to container namespace              │
└─────────────────────────────────────────────────────────────────┘

Key inter-step communication:

  • Entry → Step 1: CB_SRC_LABEL in skb->cb[]
  • Step 1 → Step 2: cilium_tail_call_buffer4 map (CT results)

Step 1: Connection Tracking (Ingress)

Generated by the same TAIL_CT_LOOKUP4 macro:

TAIL_CT_LOOKUP4(CILIUM_CALL_IPV4_CT_INGRESS,  // ID
                tail_ipv4_ct_ingress,          // NAME
                CT_INGRESS,                    // DIR - different from egress!
                1,                             // CONDITION - always tail call
                CILIUM_CALL_IPV4_TO_ENDPOINT,  // TARGET_ID
                tail_ipv4_to_endpoint)         // TARGET_NAME

Key differences from egress CT:

  • DIR = CT_INGRESS instead of CT_EGRESS
  • Targets CILIUM_CALL_IPV4_TO_ENDPOINT instead of CILIUM_CALL_IPV4_FROM_LXC_CONT

The generated function performs:

int tail_ipv4_ct_ingress(struct __ctx_buff *ctx)
{
    struct ct_buffer4 ct_buffer = {};
    struct ipv4_ct_tuple *tuple;
    struct ct_state *ct_state;
    void *map;
    
    // Extract tuple from packet
    tuple->nexthdr = ip4->protocol;
    tuple->daddr = ip4->daddr;    // This container's IP
    tuple->saddr = ip4->saddr;    // Source IP
    
    // Select CT map
    map = select_ct_map4(ctx, CT_INGRESS, tuple);
    
    // CT lookup in INGRESS direction
    ct_buffer.ret = ct_lookup4(map, tuple, ctx, ip4, l4_off,
                               CT_INGRESS, scope, ct_state, &monitor);

Ingress CT lookup logic:

Packet: src=10.0.2.6:8080, dst=10.0.1.5:45678

CT lookup searches for:
  Direction: CT_INGRESS
  Tuple: (saddr=10.0.2.6, sport=8080, daddr=10.0.1.5, dport=45678, proto=TCP)

Results:
  - CT_REPLY: This is reply traffic for an existing egress connection
    └─> Original: 10.0.1.5:45678 → 10.0.2.6:8080 (egress)
    └─> Reply:    10.0.2.6:8080 → 10.0.1.5:45678 (ingress)
  
  - CT_NEW: New inbound connection (never seen before)
  
  - CT_ESTABLISHED: Existing ingress connection

Save results and tail call:

    // Save CT results for next stage
    map_update_elem(&cilium_tail_call_buffer4, &zero, &ct_buffer, 0);
    
    // Always tail call to endpoint delivery (CONDITION=1)
    ret = tail_call_internal(ctx, CILIUM_CALL_IPV4_TO_ENDPOINT, &ext_err);
    
    return ret;
}

Step 2: Policy Enforcement (Ingress)

The CILIUM_CALL_IPV4_TO_ENDPOINT tail call leads to tail_ipv4_to_endpoint():

__declare_tail(CILIUM_CALL_IPV4_TO_ENDPOINT)
int tail_ipv4_to_endpoint(struct __ctx_buff *ctx)
{
    __u32 src_sec_identity = ctx_load_and_clear_meta(ctx, CB_SRC_LABEL);
    void *data, *data_end;
    struct iphdr *ip4;
    __u16 proxy_port = 0;
    __s8 ext_err = 0;
    int ret;

Retrieve source identity: Read from CB_SRC_LABEL (stored in cil_to_container).

Identity Refinement

    if (!revalidate_data(ctx, &data, &data_end, &ip4)) {
        ret = DROP_INVALID;
        goto out;
    }
    
    /* Packets from the proxy will already have a real identity. */
    if (identity_is_reserved(src_sec_identity)) {
        const struct remote_endpoint_info *info;
        
        info = lookup_ip4_remote_endpoint(ip4->saddr, 0);
        if (info != NULL) {
            __u32 sec_identity = info->sec_identity;
            
            /* When SNAT is enabled on traffic ingressing into Cilium,
             * all traffic from the world will have a source IP of the host.
             * It will only actually be from the host if "src_sec_identity"
             * reports the src as the host. So we can ignore the ipcache
             * if it reports the source as HOST_ID.
             */
            if (sec_identity != HOST_ID)
                src_sec_identity = sec_identity;
        }
        cilium_dbg(ctx, info ? DBG_IP_ID_MAP_SUCCEED4 : DBG_IP_ID_MAP_FAILED4,
                  ip4->saddr, src_sec_identity);
    }

Identity resolution priority:

  1. If identity from mark is reserved (WORLD_ID, UNMANAGED_ID, etc.):

    • Lookup IP in ipcache (remote endpoint map)
    • Use more specific identity if found
    • Exception: Ignore if ipcache says HOST_ID but mark doesn't
  2. Why needed?

    • SNAT can hide original source
    • ipcache maps external IPs to identities
    • Enables policy enforcement for internet traffic

Metrics and Policy Check

    cilium_dbg(ctx, DBG_LOCAL_DELIVERY, LXC_ID, SECLABEL_IPV4);
    
#ifdef LOCAL_DELIVERY_METRICS
    update_metrics(ctx_full_len(ctx), METRIC_INGRESS, REASON_FORWARDED);
#endif
    
    ret = ipv4_policy(ctx, ip4, src_sec_identity, NULL, &ext_err,
                     &proxy_port, false);

The ipv4_policy() function (defined in bpf_lxc.c):

static __always_inline int
ipv4_policy(struct __ctx_buff *ctx, struct iphdr *ip4,
           __u32 src_sec_identity, struct ipv4_ct_tuple *tuple_out,
           __s8 *ext_err, __u16 *proxy_port, bool from_host)
{
    struct ipv4_ct_tuple tuple = {};
    __u8 policy_match_type = POLICY_MATCH_NONE;
    __u8 audited = 0;
    int ret, verdict;
    
    // Extract L4 tuple
    tuple.nexthdr = ip4->protocol;
    tuple.daddr = ip4->daddr;
    tuple.saddr = ip4->saddr;
    // ... extract ports
    
    // Ingress policy check
    verdict = policy_can_ingress4(ctx, &tuple, l4_off,
                                  src_sec_identity, SECLABEL_IPV4,
                                  &policy_match_type, &audited,
                                  ext_err, proxy_port, &cookie);

Ingress policy lookup:

struct policy_key key = {
    .sec_label = src_sec_identity,  // Source (external)
    .egress = 0,                    // Direction: INGRESS
    .protocol = tuple.nexthdr,
    .dport = tuple.dport            // Our port
};

Policy map lookup order (most specific first):

  1. (src_id, our_port, protocol) - "Allow src_id to port 80/TCP"
  2. (src_id, our_port, 0) - "Allow src_id to port 80/any proto"
  3. (src_id, 0, protocol) - "Allow src_id to any port/TCP"
  4. (src_id, 0, 0) - "Allow src_id to any port/any proto"

Policy Verdict Handling

    switch (ret) {
    case POLICY_ACT_PROXY_REDIRECT:
        if (!revalidate_data(ctx, &data, &data_end, &ip4)) {
            ret = DROP_INVALID;
            goto out;
        }
        
        ret = ctx_redirect_to_proxy_hairpin_ipv4(ctx, ip4, proxy_port);
        ctx->mark = ctx_load_meta(ctx, CB_PROXY_MAGIC);
        break;
        
    case CTX_ACT_OK:
        break;  // Allowed - continue to delivery
        
    default:
        break;  // Denied - will drop
    }

L7 proxy redirect on ingress:

External → Container:80
    ↓
Policy: Requires L7 inspection
    ↓
Redirect to Envoy:15001
    ↓
Envoy checks HTTP path/headers
    ↓
If allowed: Envoy → Container:80

Step 3: Local Delivery

After policy allows the packet:

out:
    if (IS_ERR(ret))
        return send_drop_notify_ext(ctx, src_sec_identity, SECLABEL_IPV4,
                                   LXC_ID, ret, ext_err, METRIC_INGRESS);
    
    return ret;  // CTX_ACT_OK
}

Return CTX_ACT_OK:

  • TC subsystem delivers packet to container network namespace
  • Packet appears on container's eth0 interface
  • Application receives via socket

Final trace event:

send_trace_notify4(ctx, TRACE_TO_LXC, src_label, SECLABEL_IPV4, orig_sip,
                  LXC_ID, ifindex, trace.reason, trace.monitor);

Ingress vs Egress Comparison

AspectEgress (from container)Ingress (to container)
Entry pointcil_from_container()cil_to_container()
Identity sourceCompiled-in SECLABELExtracted from skb->mark
Identity destinationLookup in endpoint mapCompiled-in SECLABEL (self)
Service translationYes (DNAT to backend)No (already DNATed)
CT directionCT_EGRESSCT_INGRESS
CT lookup semanticsMatch outbound connectionsMatch inbound + replies
Policy directionegress = 1egress = 0 (ingress)
Policy key(SECLABEL, dst_port, proto)(src_id, our_port, proto)
Reverse NATOn reply (CT_REPLY)On reply (CT_REPLY)
Typical latency~700 ns~400 ns (no LB needed)

Key insight: Ingress is simpler because:

  • Service translation already happened (in egress or bpf_host)
  • No backend selection needed
  • Just policy check and delivery

Entry Point 3: cil_lxc_policy() - Policy-Only Entry

This entry point handles packets that already went through initial processing:

__section_entry
int cil_lxc_policy(struct __ctx_buff *ctx)
{
    __u32 src_label = ctx_load_meta(ctx, CB_SRC_LABEL);
    __u32 sec_label = SECLABEL;

Used by:

  • bpf_host: Packets from other nodes (via tunnel)
  • bpf_overlay: Decapsulated tunnel traffic
  • bpf_lxc: Traffic from other local containers

Why separate entry?:

  • Avoids duplicate service translation
  • CT already done by sender
  • Only needs policy check

Flow:

switch (proto) {
case bpf_htons(ETH_P_IP):
    ret = tail_call_internal(ctx, CILIUM_CALL_IPV4_CT_INGRESS_POLICY_ONLY,
                            &ext_err);
    break;
}

The POLICY_ONLY variant:

  • Skips service lookup
  • Runs CT in forward direction only
  • Applies ingress policy
  • Delivers to endpoint

Special Features

1. ARP Responder

#ifdef ENABLE_ARP_RESPONDER
__declare_tail(CILIUM_CALL_ARP)
int tail_handle_arp(struct __ctx_buff *ctx)
{
    union macaddr mac = CONFIG(interface_mac);
    union macaddr smac;
    __be32 sip;
    __be32 tip;
    
    if (!arp_validate(ctx, &mac, &smac, &sip, &tip))
        return CTX_ACT_OK;  // Not an ARP request
    
    // Respond for any IP except container's own
    if (tip == CONFIG(endpoint_ipv4).be32)
        return CTX_ACT_OK;
    
    ret = arp_respond(ctx, &mac, tip, &smac, sip, 0);
    return ret;
}
#endif

Why respond to all IPs?:

  • Container might have stale gateway config
  • After Cilium restart, gateway IP might change
  • Responding to all IPs (except container's own) ensures connectivity
  • Prevents IP duplicate detection false positives

ARP response construction:

int arp_respond(ctx, src_mac, src_ip, dst_mac, dst_ip, vlan_id)
{
    // Build ARP reply
    arp->ar_op = bpf_htons(ARPOP_REPLY);
    arp->ar_sha = *src_mac;
    arp->ar_sip = src_ip;
    arp->ar_tha = *dst_mac;
    arp->ar_tip = dst_ip;
    
    // Swap ethernet addresses
    eth->h_dest = *dst_mac;
    eth->h_source = *src_mac;
    
    return ctx_redirect(ctx, ctx->ingress_ifindex, 0);
}

2. Loopback Service Handling

if (tuple.saddr == backend->address) {
    /* Special loopback case: The origin endpoint has transmitted to a
     * service which is being translated back to the source. This would
     * result in a packet with identical source and destination address.
     * Linux considers such packets as martian source and will drop unless
     * received on a loopback device. Perform NAT on the source address
     * to make it appear from an outside address.
     */
    ct_state_new.loopback = 1;
}

Scenario:

Pod A (10.0.1.5) → Service (10.96.0.1) → Backend: Pod A (10.0.1.5)

Problem:

  • After DNAT: src=10.0.1.5, dst=10.0.1.5
  • Kernel drops as martian source

Solution:

  • Set loopback flag in CT entry
  • CT performs SNAT to magic IP (e.g., 169.254.169.254)
  • Packet: src=169.254.169.254, dst=10.0.1.5
  • Reply gets reverse NAT back to service IP

3. L7 Proxy Hairpin

case POLICY_ACT_PROXY_REDIRECT:
    if (!revalidate_data(ctx, &data, &data_end, &ip4))
        return DROP_INVALID;
    
    ret = ctx_redirect_to_proxy_hairpin_ipv4(ctx, ip4, proxy_port);
    ctx->mark = ctx_load_meta(ctx, CB_PROXY_MAGIC);
    break;

Proxy hairpin flow:

1. Container sends: src=10.0.1.5:12345, dst=10.0.2.6:80
2. BPF redirect to: dst=127.0.0.1:15001 (Envoy)
3. Envoy processes HTTP, makes new connection
4. Envoy sends: src=10.0.1.5:random, dst=10.0.2.6:80
5. BPF sees MARK_MAGIC_PROXY_EGRESS, allows through

Mark magic values:

#define MARK_MAGIC_HOST_MASK       0xFF000000
#define MARK_MAGIC_PROXY_INGRESS   0x0A000000
#define MARK_MAGIC_PROXY_EGRESS    0x0B000000
#define MARK_MAGIC_IDENTITY        0x0F000000

4. Encryption Integration

#ifdef ENABLE_IPSEC
    encrypt_key = get_encryption_key(tunnel_endpoint, ...)
    if (encrypt_key) {
        set_encrypt_key_mark(ctx, encrypt_key);
        set_identity_mark(ctx, src_sec_identity);
    }
#endif

IPsec datapath:

  • BPF marks packets for encryption
  • XFRM (kernel crypto) handles actual encryption
  • Encrypted packets go to physical NIC
  • Remote node decrypts and delivers

WireGuard datapath:

#ifdef ENABLE_WIREGUARD
    if (is_wireguard_enabled(...)) {
        ret = wg_maybe_redirect_to_encrypt(ctx);
        if (ret == CTX_ACT_REDIRECT)
            return ret;  // Redirected to cilium_wg0
    }
#endif

Complete Packet Flows

Flow 1: Container → Internet (Egress)

1. Application sends packet
   └─> sk_buff created in container netns

2. TC egress on container veth
   └─> cil_from_container(ctx)
       ├─> validate_ethertype() → ETH_P_IP
       ├─> edt_set_aggregate() → QoS
       └─> tail_call(CILIUM_CALL_IPV4_FROM_LXC)

3. Service translation (if applicable)
   └─> __per_packet_lb_svc_xlate_4()
       ├─> Extract tuple (src/dst IP, src/dst port)
       ├─> Service lookup
       ├─> Backend selection (random)
       ├─> DNAT (rewrite dst IP/port)
       └─> lb4_ctx_store_state() → save to skb->cb[]

4. Connection tracking
   └─> tail_ipv4_ct_egress()
       ├─> lb4_ctx_restore_state() → read from skb->cb[]
       ├─> ct_lazy_lookup4(CT_EGRESS)
       ├─> if CT_NEW: ct_create4()
       └─> tail_call(CILIUM_CALL_IPV4_TO_LXC_POLICY_ONLY)

5. Policy enforcement
   └─> tail_ipv4_policy()
       ├─> Lookup destination endpoint
       ├─> __policy_can_access(src_id, dst_id, port, proto)
       ├─> if DENY: send_icmp4_policy_denied() + DROP
       ├─> if PROXY: ctx_redirect_to_proxy_hairpin()
       └─> if ALLOW: continue

6. Routing decision
   └─> lxc_redirect_to_host()
       ├─> send_trace_notify(TRACE_TO_HOST)
       └─> ctx_redirect(CILIUM_NET_IFINDEX, BPF_F_INGRESS)

7. Host-side processing
   └─> bpf_host TC ingress
       ├─> Masquerade (SNAT to node IP)
       ├─> Encryption (if enabled)
       ├─> FIB lookup
       └─> Forward to physical NIC

8. Packet exits node

Flow 2: Internet → Container (Ingress)

1. Packet arrives at NIC
   └─> XDP prefilter (optional)
   └─> TC ingress on physical device

2. bpf_host processes
   ├─> Decryption (if encrypted)
   ├─> De-encapsulation (if tunneled)
   ├─> Endpoint lookup by dst IP
   ├─> Extract source identity from tunnel
   └─> ctx_redirect(lxc_veth, BPF_F_INGRESS)

3. TC ingress on container veth
   └─> cil_to_container(ctx)
       ├─> inherit_identity_from_host() → read skb->mark
       ├─> send_trace_notify(TRACE_FROM_STACK)
       ├─> ctx_store_meta(CB_SRC_LABEL, identity)
       └─> tail_call(CILIUM_CALL_IPV4_CT_INGRESS)

4. Connection tracking
   └─> tail_ipv4_ct_ingress()
       ├─> ct_lazy_lookup4(CT_INGRESS)
       ├─> if CT_REPLY: reverse NAT
       └─> tail_call(CILIUM_CALL_IPV4_TO_ENDPOINT)

5. Policy enforcement
   └─> tail_ipv4_policy()
       ├─> src_identity from CB_SRC_LABEL
       ├─> dst_identity = LXC_ID
       ├─> __policy_can_access(src_id, LXC_ID, port, proto)
       └─> if ALLOW: continue

6. Local delivery
   └─> tail_ipv4_to_endpoint()
       ├─> update_metrics(METRIC_INGRESS)
       ├─> ipv4_policy() → final check
       └─> TC_ACT_OK → deliver to container

7. Packet enters container namespace
   └─> Application receives

Flow 3: Pod → Pod (Same Node)

1. Source container egress
   └─> cil_from_container(pod-A)
       └─> ... (same as Flow 1, steps 2-5)

2. Routing decision
   └─> Destination endpoint lookup
       ├─> ep = __lookup_ip4_endpoint(dst_ip)
       ├─> if (ep && ep->flags & ENDPOINT_F_HOST) → local
       └─> ctx_redirect(ep->ifindex, BPF_F_INGRESS)

3. Direct redirect to destination veth
   └─> TC ingress on destination veth
       └─> cil_to_container(pod-B)
           └─> ... (same as Flow 2, steps 3-6)

4. No host networking stack traversal!
   └─> Zero-copy forwarding
   └─> ~1 microsecond latency

Performance Characteristics

Latency Breakdown

PathOperationTime
EgressValidate + metadata50-100 ns
Service translation200-300 ns
CT lookup/create150-200 ns
Policy lookup100-150 ns
Routing decision50-100 ns
Total egress~700 ns
IngressIdentity extraction30-50 ns
CT lookup100-150 ns
Policy lookup100-150 ns
Local delivery50-100 ns
Total ingress~400 ns
Pod→PodEgress + Ingress~1.1 μs

Map Access Patterns

Hot path maps (accessed every packet):

  • Connection tracking: LRU hash (100-200 ns)
  • Policy: Hash (100-150 ns)
  • Endpoints: Hash (50-100 ns)

Cold path maps (accessed occasionally):

  • Services: Hash (200-300 ns, only for new connections)
  • Identity: Array (20-30 ns, cached)

Tail Call Overhead

  • Tail call: ~50 ns overhead
  • Why acceptable? Avoids stack depth, enables modular code
  • Typical packet: 3-5 tail calls
  • Total overhead: ~200 ns

Debugging and Observability

Trace Events

Every packet can generate events:

send_trace_notify(ctx, TRACE_FROM_LXC, sec_label, UNKNOWN_ID,
                 TRACE_EP_ID_UNKNOWN, TRACE_IFINDEX_UNKNOWN,
                 TRACE_REASON_UNKNOWN, TRACE_PAYLOAD_LEN, proto);

Event types:

  • TRACE_FROM_LXC: Leaving container
  • TRACE_TO_LXC: Entering container
  • TRACE_TO_PROXY: Redirected to Envoy
  • TRACE_FROM_PROXY: Coming from Envoy
  • TRACE_TO_HOST: Sent to host stack

Drop Notifications

send_drop_notify_ext(ctx, src_label, dst_label, dst_id,
                    reason, ext_err, direction);

Drop reasons:

  • DROP_POLICY_DENIED: Network policy
  • DROP_INVALID_SRC_MAC: L2 validation
  • DROP_NO_SERVICE: Service has no backends
  • DROP_CT_NO_MAP_FOUND: CT map missing

Metrics

update_metrics(ctx_full_len(ctx), METRIC_EGRESS, REASON_FORWARDED);

Metrics tracked:

  • Packets/bytes per direction
  • Drops by reason
  • Policy hits
  • Service translations

Access via:

cilium bpf metrics list
cilium monitor --type drop
hubble observe --pod my-pod

Integration Points

With bpf_host

// bpf_lxc redirects to host
lxc_redirect_to_host(ctx, ...)
    └─> ctx_redirect(CILIUM_NET_IFINDEX, BPF_F_INGRESS)

// bpf_host redirects to lxc
ctx_redirect(lxc_ifindex, BPF_F_INGRESS)
    └─> cil_to_container() entry point

With bpf_sock

// Socket LB marks packets
skb->mark = MARK_MAGIC_IDENTITY | identity

// bpf_lxc reads mark
identity = inherit_identity_from_host(ctx, ...)

With Envoy Proxy

// BPF redirects to proxy
ctx_redirect_to_proxy_hairpin(ctx, proxy_port)

// Envoy processes, sends back with mark
skb->mark = MARK_MAGIC_PROXY_EGRESS

// BPF recognizes and allows
if (magic == MARK_MAGIC_PROXY_EGRESS)
    skip_policy = true;

Conclusion

The bpf_lxc.c program is the workhorse of Cilium's datapath, implementing:

Core Functions:

  • Per-packet service load balancing
  • Stateful connection tracking
  • L3/L4 network policy enforcement
  • L7 proxy integration
  • Local pod-to-pod optimization

Advanced Features:

  • Multi-cluster routing
  • IPsec/WireGuard encryption
  • Loopback service handling
  • ARP responder
  • QoS via EDT

Performance:

  • Sub-microsecond egress latency
  • Zero-copy pod-to-pod forwarding
  • Efficient map-based lookups

The program demonstrates sophisticated eBPF programming:

  • Tail call chains for complex logic
  • Metadata passing via skb->cb[]
  • Conditional compilation for flexibility
  • Tight integration with kernel networking

Combined with bpf_host.c (host-side) and bpf_xdp.c (prefilter), bpf_lxc.c forms a complete, high-performance datapath that powers Kubernetes networking, service mesh, and network security.

References


Author's Note: This deep dive is based on the source code up to the date of writing. Implementation details may vary by version and configuration.