Understanding Cilium's TC LXC Datapath: A Deep Dive into bpf_lxc.c
Introduction
While XDP provides the fastest packet processing at the driver level, the TC (Traffic Control) hook is where Cilium implements the majority of its networking intelligence. The bpf_lxc.c program attaches to the container side of veth pairs, intercepting all traffic entering and leaving containers. This is where policy enforcement, service mesh integration, encryption, and local delivery decisions happen.
This article provides a comprehensive analysis of Cilium's TC LXC (Linux Container) datapath, explaining how it achieves secure, policy-driven networking for Kubernetes pods.
Architecture Context
Veth Pair Topology
Every Kubernetes pod gets a virtual Ethernet pair:
┌──────────────────────────────────────────┐
│ Container Namespace │
│ │
│ ┌──────────────┐ │
│ │ eth0 │ Container interface │
│ │ (bpf_lxc) │ TC ingress/egress │
│ └──────┬───────┘ │
└──────────┼──────────────────────────────┘
│ veth pair
┌──────────┼──────────────────────────────┐
│ │ Host Namespace │
│ ┌──────┴───────┐ │
│ │ lxc_health │ Host-side interface │
│ │ (bpf_host) │ TC ingress/egress │
│ └──────────────┘ │
│ │
│ ┌──────────────┐ │
│ │ cilium_host │ Cilium internal │
│ │ (bpf_host) │ interface │
│ └──────┬───────┘ │
└──────────┼──────────────────────────────┘
│
┌──────┴───────┐
│ Physical NIC │
└──────────────┘
BPF Program Attachment Points
// bpf_lxc.c attaches to:
// 1. TC ingress on container veth (from container)
// 2. TC egress on container veth (to container)
/* Entry point for egress (from container) */
__section_entry
int cil_from_container(struct __ctx_buff *ctx)
/* Entry point for ingress (to container) */
__section_entry
int cil_to_container(struct __ctx_buff *ctx)
/* Policy enforcement entry point */
__section_entry
int cil_lxc_policy(struct __ctx_buff *ctx)
/* L7 proxy egress entry point */
__section_entry
int cil_lxc_policy_egress(struct __ctx_buff *ctx)
Program Initialization and Headers
Lines 1-60: Configuration and Feature Flags
// SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause)
/* Copyright Authors of Cilium */
#include <bpf/ctx/skb.h>
#include <bpf/api.h>
Context type: <bpf/ctx/skb.h> defines:
__ctx_buffas__sk_buff(notxdp_md)- TC-specific helpers:
skb_load_bytes(),skb_store_bytes() - Access to
skb->mark,skb->cb[]control buffer
#include <bpf/config/node.h>
#include <bpf/config/global.h>
#include <bpf/config/endpoint.h>
#include <bpf/config/lxc.h>
Configuration hierarchy:
node.h: Cluster-wide config (CLUSTER_ID, NODE_ID, encryption settings)global.h: Global features (IPv4/IPv6 enabled, tunnel mode)endpoint.h: Generic endpoint definitionslxc.h: Container-specific config - Generated per pod!
Key values in lxc.h:
#define LXC_ID 1234 // Unique endpoint ID
#define SECLABEL 5678 // Security identity
#define SECLABEL_IPV4 5678
#define SECLABEL_IPV6 5679
#define LXC_IPV4 "10.0.1.5" // Pod IP
#define LXC_IPV6 "fd00::5"
#define IS_BPF_LXC 1
#define EFFECTIVE_EP_ID LXC_ID
#define EVENT_SOURCE LXC_ID
Program type marker: Tells library code this is a container-side TC program. Affects:
- Policy map selection
- Connection tracking scope
- Metrics labeling
#define USE_LOOPBACK_LB 1
Loopback load balancing: Enables special handling when a container talks to a service that backends to itself. Prevents martian source drops.
#undef LB_SELECTION
#define LB_SELECTION LB_SELECTION_RANDOM
Override load balancer algorithm:
- XDP uses Maglev for external traffic (better distribution)
- bpf_lxc forces RANDOM for in-cluster traffic
- Why? Maglev requires precomputed lookup tables. For ClusterIP services (thousands of them), maintaining Maglev tables for each service would consume excessive memory. Random selection is simpler and sufficient for east-west traffic.
Lines 20-58: Library Includes
#include "lib/auth.h" // Mutual auth between endpoints
#include "lib/tailcall.h" // Tail call infrastructure
#include "lib/policy.h" // L3/L4 policy enforcement
#include "lib/lb.h" // Service load balancing
#include "lib/nat.h" // NAT for services
#include "lib/encap.h" // VXLAN/Geneve encapsulation
#include "lib/local_delivery.h" // Routing to local endpoints
Each library provides critical functionality. The order matters for macro definitions and dependencies.
Entry Point 1: cil_from_container() - Egress Path
This is the hottest path - every packet leaving a container goes through here.
Complete Function Flow
__section_entry
int cil_from_container(struct __ctx_buff *ctx)
{
__u16 proto = 0;
__u32 sec_label = SECLABEL;
__s8 ext_err = 0;
int ret;
bool valid_ethertype = validate_ethertype(ctx, &proto);
Initial setup:
sec_label = SECLABEL: This container's security identity (compiled in)ext_err: Extended error code for detailed drop reasonsvalidate_ethertype(): Extract protocol (IPv4/IPv6/ARP) from Ethernet header
bpf_clear_meta(ctx);
check_and_store_ip_trace_id(ctx);
Metadata management:
bpf_clear_meta(): Zero outskb->cb[]control buffer- Why? The skb might be recycled from another context
- Ensures clean state for this processing pipeline
check_and_store_ip_trace_id(): Extract trace ID from IP options (same as XDP)
/* Workaround for GH-18311 where veth driver might have recorded
* veth's RX queue mapping instead of leaving it at 0. This can
* cause issues on the phys device where all traffic would only
* hit a single TX queue (given veth device had a single one and
* mapping was left at 1). Reset so that stack picks a fresh queue.
* Kernel fix is at 710ad98c363a ("veth: Do not record rx queue
* hint in veth_xmit").
*/
ctx->queue_mapping = 0;
Veth driver bug workaround:
- Problem: Veth driver records RX queue mapping, but veth typically has only one queue
- Impact: Physical NIC would use that queue hint, causing all traffic to hit one TX queue
- Fix: Reset to 0, let kernel stack pick queue based on hash
- Kernel fix: Mainline kernel 5.16+ doesn't have this issue
send_trace_notify(ctx, TRACE_FROM_LXC, sec_label, UNKNOWN_ID,
TRACE_EP_ID_UNKNOWN, TRACE_IFINDEX_UNKNOWN,
TRACE_REASON_UNKNOWN, TRACE_PAYLOAD_LEN, proto);
Send monitoring event:
TRACE_FROM_LXC: Event typesec_label: Source identity (this container)UNKNOWN_ID: Destination not yet known- Goes to Hubble for observability
if (!valid_ethertype) {
ret = DROP_UNSUPPORTED_L2;
goto out;
}
Protocol validation: Drop non-IP/ARP traffic with proper error code.
switch (proto) {
#ifdef ENABLE_IPV6
case bpf_htons(ETH_P_IPV6):
edt_set_aggregate(ctx, LXC_ID);
ret = tail_call_internal(ctx, CILIUM_CALL_IPV6_FROM_LXC, &ext_err);
sec_label = SECLABEL_IPV6;
break;
#endif
Protocol dispatch with EDT:
edt_set_aggregate(ctx, LXC_ID): Set Earliest Departure Time aggregate- Purpose: Traffic shaping and QoS
- Groups packets by endpoint for fair queuing
- Prevents one container from monopolizing bandwidth
tail_call_internal(): Jump to IPv6 handler- Why tail call? Complexity limit - IPv6 processing is too large for one program
#ifdef ENABLE_IPV4
case bpf_htons(ETH_P_IP):
edt_set_aggregate(ctx, LXC_ID);
ret = tail_call_internal(ctx, CILIUM_CALL_IPV4_FROM_LXC, &ext_err);
sec_label = SECLABEL_IPV4;
break;
IPv4 path: Same pattern as IPv6.
#ifdef ENABLE_ARP_PASSTHROUGH
case bpf_htons(ETH_P_ARP):
ret = CTX_ACT_OK;
break;
#elif defined(ENABLE_ARP_RESPONDER)
case bpf_htons(ETH_P_ARP):
ret = tail_call_internal(ctx, CILIUM_CALL_ARP, &ext_err);
break;
#endif
ARP handling modes:
- ENABLE_ARP_PASSTHROUGH: Let kernel handle ARP (simple)
- ENABLE_ARP_RESPONDER: BPF responds to ARP (faster, less context switches)
out:
if (IS_ERR(ret))
return send_drop_notify_ext(ctx, sec_label, UNKNOWN_ID, LXC_ID,
ret, ext_err, METRIC_EGRESS);
return ret;
}
Error handling: Send detailed drop notification to monitoring before returning error verdict.
The IPv4 Egress Pipeline
After tail call from cil_from_container(), execution continues in the IPv4 handler. Let's trace the complete path:
Complete egress pipeline flow:
┌─────────────────────────────────────────────────────────────────┐
│ Step 1: Per-Packet Load Balancing │
│ __per_packet_lb_svc_xlate_4() │
│ ├─> Service lookup │
│ ├─> Backend selection (RANDOM) │
│ ├─> DNAT (rewrite dst IP/port) │
│ └─> lb4_ctx_store_state() → save to skb->cb[] │
│ └─> tail_call(CILIUM_CALL_IPV4_CT_EGRESS) │
└─────────────────────────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────────────┐
│ Step 2: Connection Tracking (Egress) │
│ tail_ipv4_ct_egress() [generated by TAIL_CT_LOOKUP4 macro] │
│ ├─> lb4_ctx_restore_state() → read from skb->cb[] │
│ ├─> ct_lookup4(CT_EGRESS) → CT_NEW/CT_ESTABLISHED/CT_REPLY │
│ ├─> map_update_elem(cilium_tail_call_buffer4) → save results │
│ └─> tail_call(CILIUM_CALL_IPV4_FROM_LXC_CONT) │
└─────────────────────────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────────────┐
│ Step 3: Policy Enforcement + Forwarding │
│ tail_handle_ipv4_cont() → handle_ipv4_from_lxc() │
│ ├─> map_lookup_elem(cilium_tail_call_buffer4) → read results │
│ ├─> switch (ct_status): │
│ │ ├─> CT_NEW/CT_ESTABLISHED: │
│ │ │ ├─> policy_can_egress4() → allow/deny/proxy │
│ │ │ ├─> ct_create4() if CT_NEW │
│ │ │ └─> ipv4_forward_to_destination() │
│ │ └─> CT_REPLY/CT_RELATED: │
│ │ └─> skip policy, forward directly │
│ └─> Return verdict │
└─────────────────────────────────────────────────────────────────┘
Key inter-step communication:
- Step 1 → Step 2:
skb->cb[]metadata (vialb4_ctx_store_state/restore) - Step 2 → Step 3:
cilium_tail_call_buffer4map (shared per-CPU buffer)
Step 1: Per-Packet Load Balancing
#ifdef ENABLE_PER_PACKET_LB
static __always_inline int
__per_packet_lb_svc_xlate_4(void *ctx, struct iphdr *ip4, __s8 *ext_err)
{
struct ipv4_ct_tuple tuple = {};
struct ct_state ct_state_new = {};
const struct lb4_service *svc;
struct lb4_key key = {};
__u16 proxy_port = 0;
__u32 cluster_id = 0;
Why per-packet LB?
- Socket-layer LB (
bpf_sock.c) handles most service translation - But some cases need packet-level handling:
- L7 services (need redirect to Envoy proxy)
- SCTP protocol (not supported in socket BPF)
- First packet before socket established
Extract L4 tuple:
tuple.nexthdr = ip4->protocol;
tuple.daddr = ip4->daddr;
tuple.saddr = ip4->saddr;
l4_off = ETH_HLEN + ipv4_hdrlen(ip4);
// Parse TCP/UDP headers for port numbers
ret = lb4_extract_tuple(ctx, ip4, fraginfo, l4_off, &tuple);
Service lookup:
lb4_fill_key(&key, &tuple);
svc = lb4_lookup_service(&key, is_defined(ENABLE_NODEPORT));
Check for L7 load balancer:
#if defined(ENABLE_L7_LB)
if (lb4_svc_is_l7_loadbalancer(svc)) {
proxy_port = (__u16)svc->l7_lb_proxy_port;
goto skip_service_lookup;
}
L7 services:
- Marked with
SVC_FLAG_L7_LOADBALANCERin service map - Don't DNAT here - need to redirect to Envoy proxy
- Store
proxy_portfor later redirect - Envoy will handle HTTP/gRPC load balancing
Backend selection and DNAT:
ret = lb4_local(get_ct_map4(&tuple), ctx, fraginfo,
l4_off, &key, &tuple, svc, &ct_state_new,
&backend, ext_err);
if (tuple.saddr == backend->address) {
/* Loopback: container talking to service backed by itself */
ct_state_new.loopback = 1;
}
ret = lb4_dnat_request(ctx, backend, ETH_HLEN, fraginfo,
l4_off, &key, &tuple, ct_state_new.loopback);
DNAT operation:
- Rewrites
ip4->daddrto backend IP - Rewrites TCP/UDP destination port
- Updates checksums (IP + L4)
- Loopback handling: If source == destination after DNAT, set loopback flag
State preservation:
skip_service_lookup:
lb4_ctx_store_state(ctx, &ct_state_new, proxy_port, cluster_id);
return tail_call_internal(ctx, CILIUM_CALL_IPV4_CT_EGRESS, ext_err);
}
Store state in skb->cb[]:
ct_state_new.rev_nat_index: For reverse NAT on replyproxy_port: If L7 proxy redirect neededcluster_id: For multi-cluster routing- Why store? Next tail call needs this info, but can't pass parameters
Step 2: Connection Tracking (Egress)
The CILIUM_CALL_IPV4_CT_EGRESS tail call leads to connection tracking. This uses a C macro to generate the actual function code.
Understanding the Macro Pattern
In bpf_lxc.c (line 451), there's a macro definition:
#define TAIL_CT_LOOKUP4(ID, NAME, DIR, CONDITION, TARGET_ID, TARGET_NAME)
This is invoked as:
TAIL_CT_LOOKUP4(CILIUM_CALL_IPV4_CT_EGRESS, // ID - tail call number
tail_ipv4_ct_egress, // NAME - function name
CT_EGRESS, // DIR - direction
1, // CONDITION - always true
CILIUM_CALL_IPV4_TO_LXC_POLICY_ONLY, // TARGET_ID
tail_ipv4_policy) // TARGET_NAME
Macro expansion flow:
Source Code (bpf_lxc.c)
↓
TAIL_CT_LOOKUP4(CILIUM_CALL_IPV4_CT_EGRESS, tail_ipv4_ct_egress, ...)
↓
C Preprocessor (clang -E)
↓
Expands macro → replaces parameters
↓
Expanded Code (what compiler actually sees)
↓
__declare_tail(CILIUM_CALL_IPV4_CT_EGRESS)
static __always_inline
int tail_ipv4_ct_egress(struct __ctx_buff *ctx)
{
// Full function body with DIR=CT_EGRESS everywhere
...
}
↓
Compiler (clang)
↓
BPF Bytecode
What the C Preprocessor Does
The C preprocessor expands this macro before compilation, replacing the parameters:
// Before macro expansion (what you write):
TAIL_CT_LOOKUP4(CILIUM_CALL_IPV4_CT_EGRESS, tail_ipv4_ct_egress, ...)
// After macro expansion (what the compiler sees):
__declare_tail(CILIUM_CALL_IPV4_CT_EGRESS)
static __always_inline
int tail_ipv4_ct_egress(struct __ctx_buff *ctx)
{
enum ct_scope scope = SCOPE_BIDIR;
struct ct_buffer4 ct_buffer = {};
struct ipv4_ct_tuple *tuple;
struct ct_state *ct_state;
void *data, *data_end;
struct iphdr *ip4;
__s8 ext_err = 0;
__u32 zero = 0;
void *map;
ct_state = (struct ct_state *)&ct_buffer.ct_state;
tuple = (struct ipv4_ct_tuple *)&ct_buffer.tuple;
if (!revalidate_data(ctx, &data, &data_end, &ip4))
return drop_for_direction(ctx, CT_EGRESS, DROP_INVALID, ext_err);
tuple->nexthdr = ip4->protocol;
tuple->daddr = ip4->daddr;
tuple->saddr = ip4->saddr;
// ... (full function body)
}
Why use macros?
- Code reuse: Same pattern for egress/ingress, IPv4/IPv6
- Compile-time parameters:
DIR(direction) is known at compile time - Reduces errors: Write the logic once, instantiate multiple times
The actual macro definition (simplified):
#define TAIL_CT_LOOKUP4(ID, NAME, DIR, CONDITION, TARGET_ID, TARGET_NAME) \
__declare_tail(ID) \
static __always_inline \
int NAME(struct __ctx_buff *ctx) \
{ \
/* Variable declarations */ \
struct ct_buffer4 ct_buffer = {}; \
void *map; \
/* ... */ \
\
/* Extract packet tuple */ \
tuple->nexthdr = ip4->protocol; \
tuple->daddr = ip4->daddr; \
tuple->saddr = ip4->saddr; \
\
/* Select CT map - DIR parameter used here */ \
map = select_ct_map4(ctx, DIR, tuple); \
\
/* Do CT lookup - DIR parameter used here */ \
ct_buffer.ret = ct_lookup4(map, tuple, ctx, ip4, l4_off, \
DIR, scope, ct_state, &monitor); \
\
/* Save results to shared buffer */ \
map_update_elem(&cilium_tail_call_buffer4, &zero, &ct_buffer, 0); \
\
/* Chain to next stage - TARGET_ID/TARGET_NAME used here */ \
if (CONDITION) \
ret = tail_call_internal(ctx, TARGET_ID, &ext_err); \
else \
ret = TARGET_NAME(ctx); \
\
return ret; \
}
Parameter substitution example:
// When you write:
map = select_ct_map4(ctx, DIR, tuple);
// After macro expansion with DIR=CT_EGRESS:
map = select_ct_map4(ctx, CT_EGRESS, tuple);
// After macro expansion with DIR=CT_INGRESS:
map = select_ct_map4(ctx, CT_INGRESS, tuple);
This creates different functions from the same template!
The Generated Function Logic
After macro expansion, tail_ipv4_ct_egress() contains:
Select CT map:
map = select_ct_map4(ctx, CT_EGRESS, tuple);
Multi-cluster support:
- Each cluster can have its own CT map
- Prevents IP overlap issues
select_ct_map4()checksCB_CLUSTER_ID_EGRESSmetadata
Restore load balancer state:
if (is_defined(ENABLE_PER_PACKET_LB) && DIR == CT_EGRESS) {
struct ct_state ct_state_new = {};
__u32 cluster_id;
__u16 proxy_port;
lb4_ctx_restore_state(ctx, &ct_state_new, &proxy_port,
&cluster_id, false);
if (ct_state_new.rev_nat_index)
scope = SCOPE_FORWARD;
}
Connection tracking lookup:
ret = ct_lazy_lookup4(map, tuple, ctx, fraginfo, l4_off,
CT_EGRESS, scope, CT_ENTRY_ANY,
ct_state, &monitor);
CT lookup results:
CT_NEW: First packet of connectionCT_ESTABLISHED: Existing connectionCT_REPLY: Reply packet (reverse direction)CT_RELATED: Related connection (e.g., FTP data channel)
CT creation for new connections:
if (ret == CT_NEW) {
ret = ct_create4(map, NULL, tuple, ctx, CT_EGRESS,
ct_state, ext_err);
if (IS_ERR(ret))
return drop_for_direction(ctx, CT_EGRESS, ret, ext_err);
}
What gets stored in CT entry:
struct ct_entry {
__u64 rx_packets;
__u64 tx_packets;
__u64 rx_bytes;
__u64 tx_bytes;
__u32 lifetime; // TTL in seconds
__u16 rx_closing:1, // FIN seen in RX direction
tx_closing:1, // FIN seen in TX direction
nat46:1, // NAT46 translation
lb_loopback:1, // Loopback service
seen_non_syn:1, // Non-SYN packet seen
node_port:1, // NodePort service
proxy_redirect:1, // Proxy redirect
dsr:1, // Direct Server Return
// ... more flags
__u16 rev_nat_index; // Reverse NAT index
__u16 ifindex; // Source interface
};
Tail call to next stage:
// Save CT results to shared buffer for next stage
if (map_update_elem(&cilium_tail_call_buffer4, &zero, &ct_buffer, 0) < 0)
return drop_for_direction(ctx, CT_EGRESS, DROP_INVALID_TC_BUFFER, ext_err);
// The macro substitutes TARGET_ID and TARGET_NAME parameters:
if (CONDITION) // CONDITION = is_defined(ENABLE_PER_PACKET_LB)
ret = tail_call_internal(ctx, CILIUM_CALL_IPV4_FROM_LXC_CONT, &ext_err);
else
ret = tail_handle_ipv4_cont(ctx); // Direct call if no per-packet LB
if (IS_ERR(ret))
return drop_for_direction(ctx, CT_EGRESS, ret, ext_err);
return ret;
}
The CT → Policy linkage:
// After tail_ipv4_ct_egress() completes:
// 1. CT results saved in cilium_tail_call_buffer4
// 2. Tail call (or direct call) to tail_handle_ipv4_cont()
__declare_tail(CILIUM_CALL_IPV4_FROM_LXC_CONT)
static __always_inline
int tail_handle_ipv4_cont(struct __ctx_buff *ctx)
{
__u32 dst_sec_identity = 0;
__s8 ext_err = 0;
// This function does policy enforcement + forwarding
int ret = handle_ipv4_from_lxc(ctx, &dst_sec_identity, &ext_err);
if (IS_ERR(ret))
return send_drop_notify_ext(ctx, SECLABEL_IPV4, dst_sec_identity,
TRACE_EP_ID_UNKNOWN, ret, ext_err,
METRIC_EGRESS);
return ret;
}
How to verify macro expansion yourself:
# Generate preprocessed output (macros expanded)
cd /home/sai/foss/cilium
clang -E -I bpf/include -I bpf -D__x86_64__ \
-DENABLE_IPV4 -DENABLE_IPV6 \
bpf/bpf_lxc.c > /tmp/bpf_lxc_expanded.c
# Search for the generated function
grep -A 50 "tail_ipv4_ct_egress" /tmp/bpf_lxc_expanded.c
Other instances of this macro:
// Ingress CT lookup
TAIL_CT_LOOKUP4(CILIUM_CALL_IPV4_CT_INGRESS,
tail_ipv4_ct_ingress,
CT_INGRESS, ...)
// IPv6 egress CT lookup
TAIL_CT_LOOKUP6(CILIUM_CALL_IPV6_CT_EGRESS,
tail_ipv6_ct_egress,
CT_EGRESS, ...)
Each invocation generates a complete, separate function with the logic customized by the parameters.
Key Takeaways
- Macros are code templates:
TAIL_CT_LOOKUP4is a template that generates function code - Preprocessing happens first: Before C compilation, the preprocessor replaces macros with actual code
- Parameters customize behavior: Same logic, different constants (DIR, TARGET_ID, etc.)
- Result: Multiple similar functions without code duplication
Mental model:
Think of it like a function that writes functions!
TAIL_CT_LOOKUP4(...) is NOT a function call
↓
It's a CODE GENERATOR that creates a new function
↓
The generated function gets compiled into BPF bytecode
Comparison:
// WITHOUT macros (you'd have to write):
int tail_ipv4_ct_egress(...) {
map = select_ct_map4(ctx, CT_EGRESS, tuple);
// 50 lines of code
}
int tail_ipv4_ct_ingress(...) {
map = select_ct_map4(ctx, CT_INGRESS, tuple);
// Same 50 lines with CT_EGRESS → CT_INGRESS
}
// WITH macros (you write):
TAIL_CT_LOOKUP4(CILIUM_CALL_IPV4_CT_EGRESS, tail_ipv4_ct_egress, CT_EGRESS, ...)
TAIL_CT_LOOKUP4(CILIUM_CALL_IPV4_CT_INGRESS, tail_ipv4_ct_ingress, CT_INGRESS, ...)
// Result: Both functions exist, but you only wrote the logic once!
Step 3: Policy Enforcement (Egress)
The CILIUM_CALL_IPV4_FROM_LXC_CONT tail call (or direct call) leads to tail_handle_ipv4_cont(), which calls handle_ipv4_from_lxc(). This function enforces egress network policies after retrieving CT results:
static __always_inline int
handle_ipv4_from_lxc(struct __ctx_buff *ctx, __u32 *dst_sec_identity,
__s8 *ext_err)
{
struct ipv4_ct_tuple *tuple;
struct ct_state *ct_state;
enum ct_status ct_status;
__u32 zero = 0;
Retrieve CT results from shared buffer:
// Step 2 (CT lookup) saved results here
ct_buffer = map_lookup_elem(&cilium_tail_call_buffer4, &zero);
if (!ct_buffer)
return DROP_INVALID_TC_BUFFER;
tuple = (struct ipv4_ct_tuple *)&ct_buffer->tuple;
ct_state = (struct ct_state *)&ct_buffer->ct_state;
ct_status = ct_buffer->ret; // CT_NEW, CT_ESTABLISHED, CT_REPLY, etc.
l4_off = ct_buffer->l4_off;
The data flow between Step 2 and Step 3:
Step 2: tail_ipv4_ct_egress()
├─> ct_lookup4() → returns CT_NEW/CT_ESTABLISHED/CT_REPLY
├─> Store result in cilium_tail_call_buffer4:
│ └─> ct_buffer.ret = CT_NEW (or other status)
│ └─> ct_buffer.tuple = {saddr, daddr, sport, dport, proto}
│ └─> ct_buffer.ct_state = {flags, rev_nat_index, ...}
└─> tail_call → CILIUM_CALL_IPV4_FROM_LXC_CONT
↓
Step 3: tail_handle_ipv4_cont() → handle_ipv4_from_lxc()
├─> Read cilium_tail_call_buffer4
├─> ct_status = ct_buffer->ret
└─> Policy enforcement based on ct_status
Policy enforcement based on CT state:
switch (ct_status) {
case CT_NEW:
case CT_ESTABLISHED:
/* Forward to L7 proxy if needed */
if (proxy_port > 0)
break; // Skip policy, L7 LB handles it
/* Skip policy for hairpin (pod to itself via service) */
if (hairpin_flow)
break;
/* Egress policy check */
verdict = policy_can_egress4(ctx, tuple, l4_off, SECLABEL_IPV4,
*dst_sec_identity, &policy_match_type,
&audited, ext_err, &proxy_port, &cookie);
Egress policy function:
int policy_can_egress4(ctx, tuple, l4_off, src_label, dst_label, ...)
{
// Lookup in egress policy map
struct policy_key key = {
.sec_label = src_label, // This endpoint's identity
.egress = 1, // Direction: egress
.protocol = tuple->nexthdr, // TCP/UDP/ICMP
.dport = tuple->dport // Destination port
};
struct policy_entry *policy = map_lookup_elem(&cilium_policy, &key);
if (!policy)
return DROP_POLICY; // Default deny
if (policy->deny)
return DROP_POLICY_DENY;
if (policy->proxy_port)
return POLICY_ACT_PROXY_REDIRECT;
return CTX_ACT_OK;
}
Policy map structure:
struct policy_key {
__u32 sec_label; // Source identity
__u16 dport; // Destination port (or 0 for any)
__u8 protocol; // IP protocol (or 0 for any)
__u8 egress:1, // Direction
pad:7;
};
struct policy_entry {
__u16 proxy_port; // If L7 proxy redirect
__u16 pad;
__u64 packets; // Metrics
__u64 bytes;
};
Policy lookup order (most specific first):
- (src_id, dst_port, protocol)
- (src_id, dst_port, 0)
- (src_id, 0, protocol)
- (src_id, 0, 0)
Policy verdicts:
switch (verdict) {
case CTX_ACT_OK:
// Policy allows, continue to forwarding
break;
case DROP_POLICY:
case DROP_POLICY_DENY:
// Optionally send ICMP unreachable
if (CONFIG(policy_deny_response_enabled)) {
ctx_store_meta(ctx, CB_VERDICT, verdict);
return tail_call_internal(ctx, CILIUM_CALL_IPV4_POLICY_DENIED, ext_err);
}
return verdict;
case DROP_POLICY_AUTH_REQUIRED:
// Mutual auth required, check auth cache
auth_type = (__u8)*ext_err;
verdict = auth_lookup(ctx, SECLABEL_IPV4, *dst_sec_identity,
tunnel_endpoint, auth_type);
break;
case POLICY_ACT_PROXY_REDIRECT:
// Will redirect to L7 proxy after CT creation
proxy_port = policy->proxy_port;
break;
}
Authentication check:
if (verdict == DROP_POLICY_AUTH_REQUIRED) {
__u32 tunnel_endpoint = 0;
if (info)
tunnel_endpoint = info->tunnel_endpoint.ip4;
// Check if mutual auth is cached
verdict = auth_lookup(ctx, SECLABEL_IPV4, *dst_sec_identity,
tunnel_endpoint, auth_type);
}
Emit verdict notification:
/* Emit verdict if drop or if allow for CT_NEW */
if (verdict != CTX_ACT_OK || ct_status != CT_ESTABLISHED) {
send_policy_verdict_notify(ctx, *dst_sec_identity, tuple->dport,
tuple->nexthdr, POLICY_EGRESS, 0,
verdict, proxy_port,
policy_match_type, audited,
auth_type, cookie);
}
Create CT entry for new connections:
case CT_NEW:
ct_state_new.src_sec_id = SECLABEL_IPV4;
ct_state_new.proxy_redirect = proxy_port > 0;
ct_state_new.from_l7lb = from_l7lb;
ret = ct_create4(ct_map, ct_related_map, tuple, ctx,
CT_EGRESS, &ct_state_new, ext_err);
if (IS_ERR(ret))
return ret;
break;
case CT_REPLY:
case CT_RELATED:
/* Return traffic - no policy check needed */
break;
}
/* Forward packet */
return ipv4_forward_to_destination(ctx, ip4, tuple, *dst_sec_identity,
ct_state, ct_status, info, skip_tunnel,
hairpin_flow, from_l7lb, proxy_port,
cluster_id, &trace, ext_err);
}
Key points:
- Egress policy is enforced AFTER CT lookup but BEFORE forwarding
- Reply traffic (CT_REPLY) skips policy - only forward direction is checked
- Hairpin traffic skips policy - pod talking to itself via service
- L7 LB traffic bypasses BPF policy - Envoy handles L7 enforcement
- Policy lookup uses source identity (SECLABEL) and destination identity from endpoint lookup
Step 4: Routing Decision and L7 Proxy Redirect
After policy enforcement, ipv4_forward_to_destination() handles forwarding:
static __always_inline int
ipv4_forward_to_destination(ctx, ip4, tuple, dst_sec_identity,
ct_state, ct_status, info, skip_tunnel,
hairpin_flow, from_l7lb, proxy_port,
cluster_id, trace, ext_err)
{
hairpin_flow |= ct_state->loopback;
L7 proxy redirect (if proxy_port set by policy):
/* L7 LB does L7 policy enforcement, so we only redirect packets
* NOT from L7 LB. */
if (!from_l7lb && proxy_port > 0) {
send_trace_notify(ctx, TRACE_TO_PROXY, SECLABEL_IPV4, UNKNOWN_ID,
bpf_ntohs(proxy_port), TRACE_IFINDEX_UNKNOWN,
trace->reason, trace->monitor, bpf_htons(ETH_P_IP));
return ctx_redirect_to_proxy4(ctx, tuple, proxy_port, false);
}
Proxy hairpin flow:
1. Container sends: src=10.0.1.5:12345, dst=10.0.2.6:80
2. Policy verdict: POLICY_ACT_PROXY_REDIRECT, proxy_port=15001
3. BPF redirect to: dst=127.0.0.1:15001 (Envoy listener)
4. Envoy processes HTTP, applies L7 policies
5. Envoy makes new connection: src=10.0.1.5:random, dst=10.0.2.6:80
6. BPF sees MARK_MAGIC_PROXY_EGRESS, bypasses policy
7. Packet forwarded normally
Routing decision:
#ifdef ENABLE_ROUTING
ret = encap_and_redirect_lxc(ctx, tunnel_endpoint, encrypt_key,
sec_label, &monitor);
#else
ret = lxc_redirect_to_host(ctx, src_sec_identity, proto, &trace);
#endif
Routing modes:
- Tunnel mode (
ENABLE_ROUTINGundefined):
static __always_inline int
lxc_redirect_to_host(struct __ctx_buff *ctx, __u32 src_sec_identity,
__be16 proto, struct trace_ctx *trace)
{
send_trace_notify(ctx, TRACE_TO_HOST, src_sec_identity, HOST_ID,
TRACE_EP_ID_UNKNOWN, CILIUM_NET_IFINDEX,
trace->reason, trace->monitor, proto);
return ctx_redirect(ctx, CILIUM_NET_IFINDEX, BPF_F_INGRESS);
}
- Redirect to
cilium_hostinterface bpf_hostprogram handles tunneling- Used when
tunnel: vxlanortunnel: geneve
- Direct routing (
ENABLE_ROUTINGdefined):
static __always_inline int
encap_and_redirect_lxc(ctx, tunnel_endpoint, encrypt_key, ...)
{
if (tunnel_endpoint) {
// Encapsulate in VXLAN/Geneve
ret = __encap_with_nodeid(ctx, tunnel_endpoint, ...);
}
if (encrypt_key) {
// IPsec encryption
ret = set_ipsec_encrypt(ctx, encrypt_key, ...);
}
// FIB lookup for next hop
struct bpf_fib_lookup fib_params = {};
ret = fib_lookup(ctx, &fib_params, ...);
// Redirect to output interface
return ctx_redirect(ctx, fib_params.ifindex, 0);
}
- Direct L3 routing without
cilium_host - Faster path (fewer hops)
- Requires routable pod IPs
Entry Point 2: cil_to_container() - Ingress Path
Traffic entering a container is the second hottest path, handling all packets destined to this container from the network.
Complete Ingress Function Flow
__section_entry
int cil_to_container(struct __ctx_buff *ctx)
{
enum trace_point trace = TRACE_FROM_STACK;
__u32 magic, identity = 0;
__u32 sec_label = SECLABEL;
__s8 ext_err = 0;
__u16 proto;
int ret;
Initial setup:
trace = TRACE_FROM_STACK: Assume from kernel network stackidentity = 0: Will be extracted from packet marksec_label = SECLABEL: This container's identity (destination)
Protocol Validation
if (!validate_ethertype(ctx, &proto)) {
ret = DROP_UNSUPPORTED_L2;
goto out;
}
Extract Ethernet protocol: Same as egress - IPv4, IPv6, or ARP.
Metadata Initialization
bpf_clear_meta(ctx);
check_and_store_ip_trace_id(ctx);
Clean slate for ingress:
- Clear
skb->cb[]to avoid stale data from previous processing - Extract trace ID for observability correlation
L7 Proxy Egress Handling
#if defined(ENABLE_L7_LB)
if ((ctx->mark & MARK_MAGIC_HOST_MASK) == MARK_MAGIC_PROXY_EGRESS_EPID) {
__u16 lxc_id = get_epid(ctx);
ctx->mark = 0;
ret = tail_call_egress_policy(ctx, lxc_id);
return send_drop_notify(ctx, lxc_id, sec_label, LXC_ID,
ret, METRIC_INGRESS);
}
#endif
Special case: Traffic from L7 proxy going to another endpoint.
MARK_MAGIC_PROXY_EGRESS_EPID: Proxy marked packet with target endpoint ID- Extract endpoint ID from mark
- Jump to egress policy enforcement for that endpoint
- Why? Proxy-originated traffic needs policy check in egress direction
Identity Extraction from Packet Mark
magic = inherit_identity_from_host(ctx, &identity);
if (magic == MARK_MAGIC_PROXY_INGRESS ||
magic == MARK_MAGIC_PROXY_EGRESS)
trace = TRACE_FROM_PROXY;
Critical function: inherit_identity_from_host()
static __always_inline __u32
inherit_identity_from_host(struct __ctx_buff *ctx, __u32 *identity)
{
__u32 magic = ctx->mark & MARK_MAGIC_HOST_MASK;
*identity = get_identity(ctx); // Extract lower 16 bits
ctx->mark = 0; // Clear mark after reading
return magic;
}
Packet mark encoding:
┌────────────────────────────────────────┐
│ 32-bit skb->mark field │
├──────────────────┬─────────────────────┤
│ Magic (16) │ Identity (16) │
└──────────────────┴─────────────────────┘
Magic values:
0x0A00_0000 = MARK_MAGIC_PROXY_INGRESS
0x0B00_0000 = MARK_MAGIC_PROXY_EGRESS
0x0F00_0000 = MARK_MAGIC_IDENTITY
Who sets this mark?
bpf_host: Encodes source identity before redirecting to containerbpf_overlay: After tunnel decapsulationbpf_sock: For locally-generated host traffic- Envoy proxy: For L7-processed traffic
Send Trace Event
send_trace_notify(ctx, trace, identity, sec_label, LXC_ID,
ctx->ingress_ifindex, TRACE_REASON_UNKNOWN,
TRACE_PAYLOAD_LEN, proto);
Observability hook:
trace: TRACE_FROM_STACK or TRACE_FROM_PROXYidentity: Source identity (from mark)sec_label: Destination identity (this container)- Goes to Hubble/cilium monitor
Host Firewall Integration
#if defined(ENABLE_HOST_FIREWALL) && !defined(ENABLE_ROUTING)
/* If the packet comes from the hostns and per-endpoint routes are enabled,
* jump to bpf_host to enforce egress host policies before anything else.
*/
if (identity == HOST_ID) {
ctx_store_meta(ctx, CB_FROM_HOST, 1);
ctx_store_meta(ctx, CB_DST_ENDPOINT_ID, LXC_ID);
ret = tail_call_policy(ctx, CONFIG(host_ep_id));
return send_drop_notify(ctx, identity, sec_label, LXC_ID,
DROP_HOST_NOT_READY, METRIC_INGRESS);
}
#endif
Host firewall scenario:
Host process (identity=HOST_ID) → Container
↓
Need to check: Can host egress to this container?
↓
Tail call to bpf_host policy program
↓
After policy check, return here to continue
Why this complexity?
- Host firewall policies are in
bpf_hostprogram - Container policies are in
bpf_lxcprogram - This coordination allows both to be enforced
Protocol Dispatch
switch (proto) {
#if defined(ENABLE_ARP_PASSTHROUGH) || defined(ENABLE_ARP_RESPONDER)
case bpf_htons(ETH_P_ARP):
ret = CTX_ACT_OK; // Let kernel handle ARP
break;
#endif
ARP handling: Usually passthrough for ingress (container doesn't respond to ARP).
#ifdef ENABLE_IPV6
case bpf_htons(ETH_P_IPV6):
sec_label = SECLABEL_IPV6;
ctx_store_meta(ctx, CB_SRC_LABEL, identity);
ret = tail_call_internal(ctx, CILIUM_CALL_IPV6_CT_INGRESS, &ext_err);
break;
#endif
IPv6 path: Store source identity in metadata and jump to CT.
#ifdef ENABLE_IPV4
case bpf_htons(ETH_P_IP):
sec_label = SECLABEL_IPV4;
ctx_store_meta(ctx, CB_SRC_LABEL, identity);
ret = tail_call_internal(ctx, CILIUM_CALL_IPV4_CT_INGRESS, &ext_err);
break;
#endif
IPv4 path: Same pattern - store identity and tail call.
Why store identity in CB_SRC_LABEL?
- Tail calls can't pass parameters
- CT and policy programs need source identity
skb->cb[]is the inter-tail-call communication channel
default:
ret = DROP_UNKNOWN_L3;
break;
}
out:
if (IS_ERR(ret))
return send_drop_notify_ext(ctx, identity, sec_label, LXC_ID, ret,
ext_err, METRIC_INGRESS);
return ret;
}
The IPv4 Ingress Pipeline
After tail call from cil_to_container(), execution continues through multiple stages:
Complete ingress pipeline flow:
┌─────────────────────────────────────────────────────────────────┐
│ Entry: cil_to_container() │
│ ├─> Validate protocol │
│ ├─> inherit_identity_from_host() → extract src identity │
│ ├─> ctx_store_meta(CB_SRC_LABEL, identity) │
│ └─> tail_call(CILIUM_CALL_IPV4_CT_INGRESS) │
└─────────────────────────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────────────┐
│ Step 1: Connection Tracking (Ingress) │
│ tail_ipv4_ct_ingress() [generated by TAIL_CT_LOOKUP4 macro] │
│ ├─> ctx_load_meta(CB_SRC_LABEL) → retrieve src identity │
│ ├─> ct_lookup4(CT_INGRESS) → CT_NEW/CT_REPLY/CT_ESTABLISHED │
│ ├─> map_update_elem(cilium_tail_call_buffer4) → save results │
│ └─> tail_call(CILIUM_CALL_IPV4_TO_ENDPOINT) │
└─────────────────────────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────────────┐
│ Step 2: Policy Enforcement (Ingress) │
│ tail_ipv4_to_endpoint() → ipv4_policy() │
│ ├─> map_lookup_elem(cilium_tail_call_buffer4) → read CT │
│ ├─> Lookup source identity from ipcache (if needed) │
│ ├─> policy_can_ingress4(src_id, LXC_ID, port, proto) │
│ ├─> if ALLOW: continue │
│ ├─> if DENY: drop + notification │
│ └─> if PROXY: ctx_redirect_to_proxy_hairpin() │
└─────────────────────────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────────────┐
│ Step 3: Local Delivery │
│ ├─> update_metrics(METRIC_INGRESS) │
│ ├─> send_trace_notify(TRACE_TO_LXC) │
│ └─> CTX_ACT_OK → deliver to container namespace │
└─────────────────────────────────────────────────────────────────┘
Key inter-step communication:
- Entry → Step 1:
CB_SRC_LABELinskb->cb[] - Step 1 → Step 2:
cilium_tail_call_buffer4map (CT results)
Step 1: Connection Tracking (Ingress)
Generated by the same TAIL_CT_LOOKUP4 macro:
TAIL_CT_LOOKUP4(CILIUM_CALL_IPV4_CT_INGRESS, // ID
tail_ipv4_ct_ingress, // NAME
CT_INGRESS, // DIR - different from egress!
1, // CONDITION - always tail call
CILIUM_CALL_IPV4_TO_ENDPOINT, // TARGET_ID
tail_ipv4_to_endpoint) // TARGET_NAME
Key differences from egress CT:
DIR = CT_INGRESSinstead ofCT_EGRESS- Targets
CILIUM_CALL_IPV4_TO_ENDPOINTinstead ofCILIUM_CALL_IPV4_FROM_LXC_CONT
The generated function performs:
int tail_ipv4_ct_ingress(struct __ctx_buff *ctx)
{
struct ct_buffer4 ct_buffer = {};
struct ipv4_ct_tuple *tuple;
struct ct_state *ct_state;
void *map;
// Extract tuple from packet
tuple->nexthdr = ip4->protocol;
tuple->daddr = ip4->daddr; // This container's IP
tuple->saddr = ip4->saddr; // Source IP
// Select CT map
map = select_ct_map4(ctx, CT_INGRESS, tuple);
// CT lookup in INGRESS direction
ct_buffer.ret = ct_lookup4(map, tuple, ctx, ip4, l4_off,
CT_INGRESS, scope, ct_state, &monitor);
Ingress CT lookup logic:
Packet: src=10.0.2.6:8080, dst=10.0.1.5:45678
CT lookup searches for:
Direction: CT_INGRESS
Tuple: (saddr=10.0.2.6, sport=8080, daddr=10.0.1.5, dport=45678, proto=TCP)
Results:
- CT_REPLY: This is reply traffic for an existing egress connection
└─> Original: 10.0.1.5:45678 → 10.0.2.6:8080 (egress)
└─> Reply: 10.0.2.6:8080 → 10.0.1.5:45678 (ingress)
- CT_NEW: New inbound connection (never seen before)
- CT_ESTABLISHED: Existing ingress connection
Save results and tail call:
// Save CT results for next stage
map_update_elem(&cilium_tail_call_buffer4, &zero, &ct_buffer, 0);
// Always tail call to endpoint delivery (CONDITION=1)
ret = tail_call_internal(ctx, CILIUM_CALL_IPV4_TO_ENDPOINT, &ext_err);
return ret;
}
Step 2: Policy Enforcement (Ingress)
The CILIUM_CALL_IPV4_TO_ENDPOINT tail call leads to tail_ipv4_to_endpoint():
__declare_tail(CILIUM_CALL_IPV4_TO_ENDPOINT)
int tail_ipv4_to_endpoint(struct __ctx_buff *ctx)
{
__u32 src_sec_identity = ctx_load_and_clear_meta(ctx, CB_SRC_LABEL);
void *data, *data_end;
struct iphdr *ip4;
__u16 proxy_port = 0;
__s8 ext_err = 0;
int ret;
Retrieve source identity: Read from CB_SRC_LABEL (stored in cil_to_container).
Identity Refinement
if (!revalidate_data(ctx, &data, &data_end, &ip4)) {
ret = DROP_INVALID;
goto out;
}
/* Packets from the proxy will already have a real identity. */
if (identity_is_reserved(src_sec_identity)) {
const struct remote_endpoint_info *info;
info = lookup_ip4_remote_endpoint(ip4->saddr, 0);
if (info != NULL) {
__u32 sec_identity = info->sec_identity;
/* When SNAT is enabled on traffic ingressing into Cilium,
* all traffic from the world will have a source IP of the host.
* It will only actually be from the host if "src_sec_identity"
* reports the src as the host. So we can ignore the ipcache
* if it reports the source as HOST_ID.
*/
if (sec_identity != HOST_ID)
src_sec_identity = sec_identity;
}
cilium_dbg(ctx, info ? DBG_IP_ID_MAP_SUCCEED4 : DBG_IP_ID_MAP_FAILED4,
ip4->saddr, src_sec_identity);
}
Identity resolution priority:
-
If identity from mark is reserved (WORLD_ID, UNMANAGED_ID, etc.):
- Lookup IP in
ipcache(remote endpoint map) - Use more specific identity if found
- Exception: Ignore if ipcache says HOST_ID but mark doesn't
- Lookup IP in
-
Why needed?
- SNAT can hide original source
- ipcache maps external IPs to identities
- Enables policy enforcement for internet traffic
Metrics and Policy Check
cilium_dbg(ctx, DBG_LOCAL_DELIVERY, LXC_ID, SECLABEL_IPV4);
#ifdef LOCAL_DELIVERY_METRICS
update_metrics(ctx_full_len(ctx), METRIC_INGRESS, REASON_FORWARDED);
#endif
ret = ipv4_policy(ctx, ip4, src_sec_identity, NULL, &ext_err,
&proxy_port, false);
The ipv4_policy() function (defined in bpf_lxc.c):
static __always_inline int
ipv4_policy(struct __ctx_buff *ctx, struct iphdr *ip4,
__u32 src_sec_identity, struct ipv4_ct_tuple *tuple_out,
__s8 *ext_err, __u16 *proxy_port, bool from_host)
{
struct ipv4_ct_tuple tuple = {};
__u8 policy_match_type = POLICY_MATCH_NONE;
__u8 audited = 0;
int ret, verdict;
// Extract L4 tuple
tuple.nexthdr = ip4->protocol;
tuple.daddr = ip4->daddr;
tuple.saddr = ip4->saddr;
// ... extract ports
// Ingress policy check
verdict = policy_can_ingress4(ctx, &tuple, l4_off,
src_sec_identity, SECLABEL_IPV4,
&policy_match_type, &audited,
ext_err, proxy_port, &cookie);
Ingress policy lookup:
struct policy_key key = {
.sec_label = src_sec_identity, // Source (external)
.egress = 0, // Direction: INGRESS
.protocol = tuple.nexthdr,
.dport = tuple.dport // Our port
};
Policy map lookup order (most specific first):
- (src_id, our_port, protocol) - "Allow src_id to port 80/TCP"
- (src_id, our_port, 0) - "Allow src_id to port 80/any proto"
- (src_id, 0, protocol) - "Allow src_id to any port/TCP"
- (src_id, 0, 0) - "Allow src_id to any port/any proto"
Policy Verdict Handling
switch (ret) {
case POLICY_ACT_PROXY_REDIRECT:
if (!revalidate_data(ctx, &data, &data_end, &ip4)) {
ret = DROP_INVALID;
goto out;
}
ret = ctx_redirect_to_proxy_hairpin_ipv4(ctx, ip4, proxy_port);
ctx->mark = ctx_load_meta(ctx, CB_PROXY_MAGIC);
break;
case CTX_ACT_OK:
break; // Allowed - continue to delivery
default:
break; // Denied - will drop
}
L7 proxy redirect on ingress:
External → Container:80
↓
Policy: Requires L7 inspection
↓
Redirect to Envoy:15001
↓
Envoy checks HTTP path/headers
↓
If allowed: Envoy → Container:80
Step 3: Local Delivery
After policy allows the packet:
out:
if (IS_ERR(ret))
return send_drop_notify_ext(ctx, src_sec_identity, SECLABEL_IPV4,
LXC_ID, ret, ext_err, METRIC_INGRESS);
return ret; // CTX_ACT_OK
}
Return CTX_ACT_OK:
- TC subsystem delivers packet to container network namespace
- Packet appears on container's
eth0interface - Application receives via socket
Final trace event:
send_trace_notify4(ctx, TRACE_TO_LXC, src_label, SECLABEL_IPV4, orig_sip,
LXC_ID, ifindex, trace.reason, trace.monitor);
Ingress vs Egress Comparison
| Aspect | Egress (from container) | Ingress (to container) |
|---|---|---|
| Entry point | cil_from_container() | cil_to_container() |
| Identity source | Compiled-in SECLABEL | Extracted from skb->mark |
| Identity destination | Lookup in endpoint map | Compiled-in SECLABEL (self) |
| Service translation | Yes (DNAT to backend) | No (already DNATed) |
| CT direction | CT_EGRESS | CT_INGRESS |
| CT lookup semantics | Match outbound connections | Match inbound + replies |
| Policy direction | egress = 1 | egress = 0 (ingress) |
| Policy key | (SECLABEL, dst_port, proto) | (src_id, our_port, proto) |
| Reverse NAT | On reply (CT_REPLY) | On reply (CT_REPLY) |
| Typical latency | ~700 ns | ~400 ns (no LB needed) |
Key insight: Ingress is simpler because:
- Service translation already happened (in egress or bpf_host)
- No backend selection needed
- Just policy check and delivery
Entry Point 3: cil_lxc_policy() - Policy-Only Entry
This entry point handles packets that already went through initial processing:
__section_entry
int cil_lxc_policy(struct __ctx_buff *ctx)
{
__u32 src_label = ctx_load_meta(ctx, CB_SRC_LABEL);
__u32 sec_label = SECLABEL;
Used by:
bpf_host: Packets from other nodes (via tunnel)bpf_overlay: Decapsulated tunnel trafficbpf_lxc: Traffic from other local containers
Why separate entry?:
- Avoids duplicate service translation
- CT already done by sender
- Only needs policy check
Flow:
switch (proto) {
case bpf_htons(ETH_P_IP):
ret = tail_call_internal(ctx, CILIUM_CALL_IPV4_CT_INGRESS_POLICY_ONLY,
&ext_err);
break;
}
The POLICY_ONLY variant:
- Skips service lookup
- Runs CT in forward direction only
- Applies ingress policy
- Delivers to endpoint
Special Features
1. ARP Responder
#ifdef ENABLE_ARP_RESPONDER
__declare_tail(CILIUM_CALL_ARP)
int tail_handle_arp(struct __ctx_buff *ctx)
{
union macaddr mac = CONFIG(interface_mac);
union macaddr smac;
__be32 sip;
__be32 tip;
if (!arp_validate(ctx, &mac, &smac, &sip, &tip))
return CTX_ACT_OK; // Not an ARP request
// Respond for any IP except container's own
if (tip == CONFIG(endpoint_ipv4).be32)
return CTX_ACT_OK;
ret = arp_respond(ctx, &mac, tip, &smac, sip, 0);
return ret;
}
#endif
Why respond to all IPs?:
- Container might have stale gateway config
- After Cilium restart, gateway IP might change
- Responding to all IPs (except container's own) ensures connectivity
- Prevents IP duplicate detection false positives
ARP response construction:
int arp_respond(ctx, src_mac, src_ip, dst_mac, dst_ip, vlan_id)
{
// Build ARP reply
arp->ar_op = bpf_htons(ARPOP_REPLY);
arp->ar_sha = *src_mac;
arp->ar_sip = src_ip;
arp->ar_tha = *dst_mac;
arp->ar_tip = dst_ip;
// Swap ethernet addresses
eth->h_dest = *dst_mac;
eth->h_source = *src_mac;
return ctx_redirect(ctx, ctx->ingress_ifindex, 0);
}
2. Loopback Service Handling
if (tuple.saddr == backend->address) {
/* Special loopback case: The origin endpoint has transmitted to a
* service which is being translated back to the source. This would
* result in a packet with identical source and destination address.
* Linux considers such packets as martian source and will drop unless
* received on a loopback device. Perform NAT on the source address
* to make it appear from an outside address.
*/
ct_state_new.loopback = 1;
}
Scenario:
Pod A (10.0.1.5) → Service (10.96.0.1) → Backend: Pod A (10.0.1.5)
Problem:
- After DNAT: src=10.0.1.5, dst=10.0.1.5
- Kernel drops as martian source
Solution:
- Set
loopbackflag in CT entry - CT performs SNAT to magic IP (e.g., 169.254.169.254)
- Packet: src=169.254.169.254, dst=10.0.1.5
- Reply gets reverse NAT back to service IP
3. L7 Proxy Hairpin
case POLICY_ACT_PROXY_REDIRECT:
if (!revalidate_data(ctx, &data, &data_end, &ip4))
return DROP_INVALID;
ret = ctx_redirect_to_proxy_hairpin_ipv4(ctx, ip4, proxy_port);
ctx->mark = ctx_load_meta(ctx, CB_PROXY_MAGIC);
break;
Proxy hairpin flow:
1. Container sends: src=10.0.1.5:12345, dst=10.0.2.6:80
2. BPF redirect to: dst=127.0.0.1:15001 (Envoy)
3. Envoy processes HTTP, makes new connection
4. Envoy sends: src=10.0.1.5:random, dst=10.0.2.6:80
5. BPF sees MARK_MAGIC_PROXY_EGRESS, allows through
Mark magic values:
#define MARK_MAGIC_HOST_MASK 0xFF000000
#define MARK_MAGIC_PROXY_INGRESS 0x0A000000
#define MARK_MAGIC_PROXY_EGRESS 0x0B000000
#define MARK_MAGIC_IDENTITY 0x0F000000
4. Encryption Integration
#ifdef ENABLE_IPSEC
encrypt_key = get_encryption_key(tunnel_endpoint, ...)
if (encrypt_key) {
set_encrypt_key_mark(ctx, encrypt_key);
set_identity_mark(ctx, src_sec_identity);
}
#endif
IPsec datapath:
- BPF marks packets for encryption
- XFRM (kernel crypto) handles actual encryption
- Encrypted packets go to physical NIC
- Remote node decrypts and delivers
WireGuard datapath:
#ifdef ENABLE_WIREGUARD
if (is_wireguard_enabled(...)) {
ret = wg_maybe_redirect_to_encrypt(ctx);
if (ret == CTX_ACT_REDIRECT)
return ret; // Redirected to cilium_wg0
}
#endif
Complete Packet Flows
Flow 1: Container → Internet (Egress)
1. Application sends packet
└─> sk_buff created in container netns
2. TC egress on container veth
└─> cil_from_container(ctx)
├─> validate_ethertype() → ETH_P_IP
├─> edt_set_aggregate() → QoS
└─> tail_call(CILIUM_CALL_IPV4_FROM_LXC)
3. Service translation (if applicable)
└─> __per_packet_lb_svc_xlate_4()
├─> Extract tuple (src/dst IP, src/dst port)
├─> Service lookup
├─> Backend selection (random)
├─> DNAT (rewrite dst IP/port)
└─> lb4_ctx_store_state() → save to skb->cb[]
4. Connection tracking
└─> tail_ipv4_ct_egress()
├─> lb4_ctx_restore_state() → read from skb->cb[]
├─> ct_lazy_lookup4(CT_EGRESS)
├─> if CT_NEW: ct_create4()
└─> tail_call(CILIUM_CALL_IPV4_TO_LXC_POLICY_ONLY)
5. Policy enforcement
└─> tail_ipv4_policy()
├─> Lookup destination endpoint
├─> __policy_can_access(src_id, dst_id, port, proto)
├─> if DENY: send_icmp4_policy_denied() + DROP
├─> if PROXY: ctx_redirect_to_proxy_hairpin()
└─> if ALLOW: continue
6. Routing decision
└─> lxc_redirect_to_host()
├─> send_trace_notify(TRACE_TO_HOST)
└─> ctx_redirect(CILIUM_NET_IFINDEX, BPF_F_INGRESS)
7. Host-side processing
└─> bpf_host TC ingress
├─> Masquerade (SNAT to node IP)
├─> Encryption (if enabled)
├─> FIB lookup
└─> Forward to physical NIC
8. Packet exits node
Flow 2: Internet → Container (Ingress)
1. Packet arrives at NIC
└─> XDP prefilter (optional)
└─> TC ingress on physical device
2. bpf_host processes
├─> Decryption (if encrypted)
├─> De-encapsulation (if tunneled)
├─> Endpoint lookup by dst IP
├─> Extract source identity from tunnel
└─> ctx_redirect(lxc_veth, BPF_F_INGRESS)
3. TC ingress on container veth
└─> cil_to_container(ctx)
├─> inherit_identity_from_host() → read skb->mark
├─> send_trace_notify(TRACE_FROM_STACK)
├─> ctx_store_meta(CB_SRC_LABEL, identity)
└─> tail_call(CILIUM_CALL_IPV4_CT_INGRESS)
4. Connection tracking
└─> tail_ipv4_ct_ingress()
├─> ct_lazy_lookup4(CT_INGRESS)
├─> if CT_REPLY: reverse NAT
└─> tail_call(CILIUM_CALL_IPV4_TO_ENDPOINT)
5. Policy enforcement
└─> tail_ipv4_policy()
├─> src_identity from CB_SRC_LABEL
├─> dst_identity = LXC_ID
├─> __policy_can_access(src_id, LXC_ID, port, proto)
└─> if ALLOW: continue
6. Local delivery
└─> tail_ipv4_to_endpoint()
├─> update_metrics(METRIC_INGRESS)
├─> ipv4_policy() → final check
└─> TC_ACT_OK → deliver to container
7. Packet enters container namespace
└─> Application receives
Flow 3: Pod → Pod (Same Node)
1. Source container egress
└─> cil_from_container(pod-A)
└─> ... (same as Flow 1, steps 2-5)
2. Routing decision
└─> Destination endpoint lookup
├─> ep = __lookup_ip4_endpoint(dst_ip)
├─> if (ep && ep->flags & ENDPOINT_F_HOST) → local
└─> ctx_redirect(ep->ifindex, BPF_F_INGRESS)
3. Direct redirect to destination veth
└─> TC ingress on destination veth
└─> cil_to_container(pod-B)
└─> ... (same as Flow 2, steps 3-6)
4. No host networking stack traversal!
└─> Zero-copy forwarding
└─> ~1 microsecond latency
Performance Characteristics
Latency Breakdown
| Path | Operation | Time |
|---|---|---|
| Egress | Validate + metadata | 50-100 ns |
| Service translation | 200-300 ns | |
| CT lookup/create | 150-200 ns | |
| Policy lookup | 100-150 ns | |
| Routing decision | 50-100 ns | |
| Total egress | ~700 ns | |
| Ingress | Identity extraction | 30-50 ns |
| CT lookup | 100-150 ns | |
| Policy lookup | 100-150 ns | |
| Local delivery | 50-100 ns | |
| Total ingress | ~400 ns | |
| Pod→Pod | Egress + Ingress | ~1.1 μs |
Map Access Patterns
Hot path maps (accessed every packet):
- Connection tracking: LRU hash (100-200 ns)
- Policy: Hash (100-150 ns)
- Endpoints: Hash (50-100 ns)
Cold path maps (accessed occasionally):
- Services: Hash (200-300 ns, only for new connections)
- Identity: Array (20-30 ns, cached)
Tail Call Overhead
- Tail call: ~50 ns overhead
- Why acceptable? Avoids stack depth, enables modular code
- Typical packet: 3-5 tail calls
- Total overhead: ~200 ns
Debugging and Observability
Trace Events
Every packet can generate events:
send_trace_notify(ctx, TRACE_FROM_LXC, sec_label, UNKNOWN_ID,
TRACE_EP_ID_UNKNOWN, TRACE_IFINDEX_UNKNOWN,
TRACE_REASON_UNKNOWN, TRACE_PAYLOAD_LEN, proto);
Event types:
TRACE_FROM_LXC: Leaving containerTRACE_TO_LXC: Entering containerTRACE_TO_PROXY: Redirected to EnvoyTRACE_FROM_PROXY: Coming from EnvoyTRACE_TO_HOST: Sent to host stack
Drop Notifications
send_drop_notify_ext(ctx, src_label, dst_label, dst_id,
reason, ext_err, direction);
Drop reasons:
DROP_POLICY_DENIED: Network policyDROP_INVALID_SRC_MAC: L2 validationDROP_NO_SERVICE: Service has no backendsDROP_CT_NO_MAP_FOUND: CT map missing
Metrics
update_metrics(ctx_full_len(ctx), METRIC_EGRESS, REASON_FORWARDED);
Metrics tracked:
- Packets/bytes per direction
- Drops by reason
- Policy hits
- Service translations
Access via:
cilium bpf metrics list
cilium monitor --type drop
hubble observe --pod my-pod
Integration Points
With bpf_host
// bpf_lxc redirects to host
lxc_redirect_to_host(ctx, ...)
└─> ctx_redirect(CILIUM_NET_IFINDEX, BPF_F_INGRESS)
// bpf_host redirects to lxc
ctx_redirect(lxc_ifindex, BPF_F_INGRESS)
└─> cil_to_container() entry point
With bpf_sock
// Socket LB marks packets
skb->mark = MARK_MAGIC_IDENTITY | identity
// bpf_lxc reads mark
identity = inherit_identity_from_host(ctx, ...)
With Envoy Proxy
// BPF redirects to proxy
ctx_redirect_to_proxy_hairpin(ctx, proxy_port)
// Envoy processes, sends back with mark
skb->mark = MARK_MAGIC_PROXY_EGRESS
// BPF recognizes and allows
if (magic == MARK_MAGIC_PROXY_EGRESS)
skip_policy = true;
Conclusion
The bpf_lxc.c program is the workhorse of Cilium's datapath, implementing:
✅ Core Functions:
- Per-packet service load balancing
- Stateful connection tracking
- L3/L4 network policy enforcement
- L7 proxy integration
- Local pod-to-pod optimization
✅ Advanced Features:
- Multi-cluster routing
- IPsec/WireGuard encryption
- Loopback service handling
- ARP responder
- QoS via EDT
✅ Performance:
- Sub-microsecond egress latency
- Zero-copy pod-to-pod forwarding
- Efficient map-based lookups
The program demonstrates sophisticated eBPF programming:
- Tail call chains for complex logic
- Metadata passing via skb->cb[]
- Conditional compilation for flexibility
- Tight integration with kernel networking
Combined with bpf_host.c (host-side) and bpf_xdp.c (prefilter), bpf_lxc.c forms a complete, high-performance datapath that powers Kubernetes networking, service mesh, and network security.
References
Author's Note: This deep dive is based on the source code up to the date of writing. Implementation details may vary by version and configuration.