55 min read

eBPF vs ETW: Two Generations of Kernel Observability

Why Windows ETW emits events and Linux eBPF computes them -- and what eBPF-for-Windows reveals about the convergence of two operating systems.

Permalink

1. The SOC Analyst Sees the Same Thing Twice

A Security Operations Center analyst opens two Sysmon/Operational event channels side by side. One channel is streaming from a Red Hat Enterprise Linux host; the other is streaming from a Windows Server 2022 domain controller. The XML configuration is the same. The Event IDs are the same. A ProcessCreate record from either host carries the same Image, CommandLine, ParentImage, IntegrityLevel, and Hashes fields. Detection rules written against one channel match the other. To the analyst, the two operating systems are interchangeable.

Underneath, they are not even close.

On the Windows side, every event was emitted by a kernel provider -- Microsoft-Windows-Sysmon, Microsoft-Windows-Threat-Intelligence, Microsoft-Windows-Kernel-Process -- before the Sysmon user-mode service ever ran its XML filter. The kernel produced a fully formatted event, dropped it into a per-CPU ring buffer, and let user space pick it up. Every enabled event made the kernel-to-user trip in full. The filter inside Sysmon's user-mode service is what kept the on-disk log small. The wire between the kernel and the consumer carried the full firehose.

On the Linux side, no kernel module owned by Microsoft is running. The same Sysmon binary is attached to roughly twenty Linux kernel probes through the SysinternalsEBPF library. Each probe is an eBPF program: bytecode that was compiled by clang, verified by the kernel before load, JIT-compiled to native instructions, and attached to a hook inside the kernel. When execve fires, the verified program runs on the producing CPU, reads its arguments out of the kernel context, decides whether the call matches the XML configuration's predicates, and -- only then -- writes a record into a ring buffer. The events that arrive in user space were already filtered inside the kernel. The wire carries only what the configuration cares about.

The output channels match because Sysmon for Linux is engineered to look exactly like Sysmon for Windows. The substrate underneath is engineered for two different decades. ETW is from 2000. eBPF is from 2014. The fourteen-year gap shows up not in features but in how the kernel does its job.

ETW emits. eBPF computes. That gap is the entire generation difference. Everything else in this article is a consequence of it.

This article is about why those two designs exist, why the second one is strictly more powerful, why "strictly more powerful" cost the Linux kernel a new class of CVE, and what Microsoft's microsoft/ebpf-for-windows project -- now in its sixth year of development -- reveals about which design wins at the point of convergence. By the end you will know both substrates well enough to choose between them, understand their failure modes, and see why "two generations" is not marketing language but a literal description of the engineering arc.

2. A Tale of Two Lineages

In 1992, Van Jacobson and Steven McCanne at Lawrence Berkeley Laboratory wrote a small virtual machine for packet filtering. In 2000, a separate Microsoft team shipped a kernel event bus inside Windows 2000. Neither group knew the other existed. Each was solving a different version of the same problem: how do you watch the kernel from user space without owning the kernel?

The two answers ran in parallel for twenty-two years before they collided.

1992 -- The BSD Packet Filter. McCanne and Jacobson published "The BSD Packet Filter: A New Architecture for User-level Packet Capture" at USENIX Winter 1993, describing work that landed in 4.3BSD-Reno earlier in 1992. The motivation was painfully concrete: tcpdump was copying every packet through the kernel-user boundary, then discarding the ones the user did not want. BPF moved that filter into the kernel. A tiny two-register, 32-bit virtual machine evaluated a user-supplied predicate against each packet before any copy; only matching packets crossed into user space. The architectural insight that would survive thirty years is one sentence: filter where the data is produced, not where it is consumed.

eBPF (Extended Berkeley Packet Filter)

A safe, sandboxed virtual machine inside the Linux kernel that runs user-supplied programs at attached hook points. Programs are written in restricted C, compiled to a 64-bit RISC-style bytecode, statically verified before load, and JIT-compiled to native code. The "extended" version, introduced in Linux 3.18 (December 2014), generalized BPF from a packet-filter language into a general kernel-extensibility mechanism.

2000 -- Event Tracing for Windows. Microsoft shipped ETW with Windows 2000. The reference portal describes the design Microsoft had been refining since the late 1990s: a kernel-mediated event bus with three roles -- providers, sessions, and consumers -- and per-CPU lock-free ring buffers. ETW's architectural insight was the inverse of BPF's: event identity and causal order are first-class. A kernel-mediated dispatch makes them cheap. A tcpdump filter wants to throw events away. A security telemetry system wants to keep them, attribute them, and order them.

ETW (Event Tracing for Windows)

A kernel-mediated tracing facility shipped in Windows 2000. Providers (kernel or user-mode components) emit structured events to per-CPU ring buffers; sessions own the buffers and select which providers to enable at which level; consumers receive the event stream either in real time or by reading the on-disk .etl log. ETW is documented at learn.microsoft.com/.../etw/event-tracing-portal.

2003-2005 -- DTrace. Bryan Cantrill, Mike Shapiro, and Adam Leventhal at Sun Microsystems started work in 2003 on what would become the first production-grade dynamic tracing system. DTrace shipped publicly in Solaris 10 in January 2005 and quickly ported to FreeBSD and macOS. Its central idea -- safe in-kernel scripts attached to probes, with a single language for tracing the entire system -- is the spiritual ancestor of every modern kernel observability tool, including eBPF. Wikipedia gives DTrace's initial public release as January 2005, with Sun's internal development starting around 2003. The "DTrace 2003" claim that appears in some retrospectives conflates project inception with public release; we use the 2005 ship date here and note 2003 only as a development start. Linux could not adopt it directly: DTrace is licensed under the CDDL, which is GPLv2-incompatible.

2005 -- SystemTap. Red Hat attempted to fill the Linux DTrace gap with SystemTap. The architectural compromise that doomed it: SystemTap scripts compile to a kernel module, loaded at runtime. Allowing user-supplied kernel modules to be loaded on demand is a privileged operation by definition, so production SystemTap deployments restricted use to local root. That made the observability case study moot: if you already have root, you can use any debugging tool. SystemTap survives as a niche tracing system; it did not become the Linux answer to DTrace.

1992-2014 -- classic BPF stagnates. The original BPF VM kept finding new jobs. Linux Socket Filtering ported the BSD filter into the Linux kernel in 1997. seccomp-bpf in 2012 gave it a second job: filtering system calls for sandboxing. But the language remained a 32-bit two-register packet-filter VM. It could not be extended to general kernel observability without rewriting the instruction set architecture from the ground up.

2014 -- eBPF. Alexei Starovoitov's "extended BPF" patch series landed in Linux 3.18 in December 2014, described in LWN's contemporaneous article on Starovoitov's eBPF patch set. The rewrite was thorough: 64-bit instruction set, eleven registers, maps for in-kernel state, helper calls into kernel APIs, a JIT compiler, and -- the part that mattered most -- a kernel verifier that statically proves safety before any program runs. The verifier is what turned the packet filter into a general kernel extension mechanism. Without it, every BPF program would have to be trusted; with it, untrusted user code can execute in kernel mode.

By the time eBPF shipped, Windows had ETW everywhere. Linux had auditd's pull-based audit log and a handful of perf events. Then Starovoitov rewrote BPF, and the architectural balance shifted overnight. The next decade of Linux observability was built on the new instruction set. The next decade of Windows observability stayed on ETW. The two designs ran in parallel until 2021, when Microsoft announced that eBPF would also run on Windows.

Ctrl + scroll to zoom
Timeline of kernel observability primitives, 1992-2026.

The diagram lays the substrate stories side by side. Each arrow is an architectural decision that constrained what came after. The next two sections walk each design end to end -- ETW first, because it is older and emission-only and easier to internalize.

3. ETW: Pure Event Emission

A natural question that turns out to be the wrong one: why didn't Microsoft just keep extending performance counters? By the late 1990s, Windows already had a mature counter facility -- perfmon, the Windows Performance Counters portal. It exposed CPU percentage, page-fault rate, queue lengths, and hundreds of other scalar metrics. If you wanted to know how loaded your system was, perfmon told you.

It also told you almost nothing useful for security telemetry.

The fix was not a faster perfmon. The fix was an entirely different shape of telemetry. ETW was that shape: push-based, per-event, kernel-attributed, with stable schemas declared up front. The contrast between perfmon (a sampling counter) and ETW (an event bus) is not parametric. The two systems answer different questions. Security needs the event-bus answer.

Provider, session, consumer

ETW's data plane has three roles, every one of them a kernel-mediated object.

A provider is a kernel or user-mode component that calls EventWrite or EtwWrite to emit a structured event. Providers identify themselves by GUID. They declare the schema of their events ahead of time: classic providers via MOF, the Vista-and-later manifest format called WEVT, or TraceLogging for self-describing events. The schema is part of the contract: a consumer that knows the provider's manifest knows the field layout of every event the provider will ever emit.

A session is a kernel object created by StartTrace. It owns a set of per-CPU buffers and a list of enabled providers, with per-provider level and keyword masks. Sessions can write events to disk (.etl files) or be consumed in real time. The .etl file extension stands for "Event Trace Log." It is the on-disk format read by Windows Performance Analyzer and by tracerpt.exe for post-hoc analysis.

A consumer is a user-mode process that calls OpenTrace and ProcessTrace and receives event callbacks. EDR agents like Sysmon, Defender, and the third-party agents that ship with Microsoft Defender for Endpoint are real-time consumers.

Provider, Session, Consumer

ETW's three-role architecture. Providers emit events into per-CPU ring buffers. Sessions are kernel objects that own buffers and select which providers to enable. Consumers are user-mode processes that read the buffers in real time or open the on-disk .etl file. The taxonomy is defined in the ETW provider documentation.

The per-CPU ring buffer

The algorithmic core of ETW is a per-CPU lock-free ring buffer. When a provider on CPU 3 calls EventWrite, the kernel formats the event according to the provider's manifest, stamps it with a QPC timestamp, and memcpys the result into the per-CPU buffer for CPU 3. A kernel writer thread drains the buffer asynchronously into the session's destination -- either an .etl file on disk or a consumer's callback queue. The producer-side cost is constant: a function call plus a buffered memcpy, all on the local CPU, with no cross-CPU synchronization.

QPC (QueryPerformanceCounter)

The Windows monotonic timestamp source used for ETW event timestamps. QPC is backed by hardware timers (TSC on modern x86, generic counter on ARM64) and provides a high-resolution counter that does not go backward.

QPC guarantees monotonic timestamps per CPU. QPC is monotonic per CPU on modern hardware, but cross-CPU ordering still relies on the kernel writer thread's serialization when events from different CPUs are merged into a single output stream. Per-event timestamps from different CPUs can be ordered after the fact, but the merge happens in the writer, not in the producer.

Ctrl + scroll to zoom
ETW dispatch: provider -> per-CPU buffer -> session writer -> consumer.

The cost story

Microsoft's reference portal describes ETW as "high-volume, low-overhead." That qualitative claim has been the consensus practitioner finding for two decades. The most useful practical writeup is Bruce Dawson's ETW Central index, which links to more than forty blog posts on real ETW deployments and measurements. The honest summary, anchored to Dawson's practical experience plus the architectural reason (per-CPU lock-free buffers and a memcpy per event), is that typical telemetry configurations sit in the low single-digit-percent CPU range, and pathological "log everything" configurations can reach measurable user-visible slowdowns -- on the order of 5-10% in the worst cases. These are practitioner estimates, not benchmarked figures; the BenchmarkDotNet documentation for the EtwProfiler diagnoser explicitly acknowledges the cost: "In order to not affect main results we perform a separate run if any diagnoser is used." The overhead is small but it is not zero.

The cost has a structural cause. ETW has no in-kernel filter. The producer pays the full event-formatting cost on every emission, and the only filter is the session's level and keyword mask. If you enable a provider, every event that provider emits flows through the buffer. Filtering happens at the consumer, in user mode, after the event has crossed the boundary.

The Threat-Intelligence provider

ETW providers are not equal. The most architecturally important one for security is Microsoft-Windows-Threat-Intelligence, a kernel-only provider that emits signals only the kernel can see: image loads, remote-thread creations, VirtualProtect changes that flip memory from data to executable. Only a process running under Protected Process Light with the AntiMalware signer can subscribe. That is why Defender, CrowdStrike Falcon, SentinelOne, and Carbon Black all run as PPL-Antimalware: it is the entry ticket to the kernel-only telemetry that distinguishes serious EDR from script-level monitoring.

Sysmon 6.20, released in 2018, was the version that tied ETW into the modern EDR stack as a turnkey configuration. The 2018 Sysmon 6.20 release added the configuration schema that the cybersecurity community converged on. By 2026, the same XML configuration -- including the ProcessCreate, NetworkConnect, ImageLoad, and FileCreate event IDs -- works on both Sysmon for Windows and Sysmon for Linux. Sysmon, Microsoft's own free reference consumer authored by Mark Russinovich and Thomas Garnier, demonstrated that an XML configuration plus an ETW consumer plus protected-process status was enough to build a useful EDR. Sysmon is not Defender; it is the open shape that the commercial EDR vendors built proprietary versions of.

Closing on ETW

ETW emits. Every enabled event crosses the kernel-user boundary, fully formatted, with no in-kernel filtering language whatsoever. The session's level and keyword mask is a coarse on/off switch, not a programmable filter. Aggregation, sampling, and stack-trace folding happen in user mode, after the event is already across the boundary.

Now you can read the question that drove Starovoitov's 2014 rewrite: what if you could filter in the kernel itself? What if you could compute -- not just emit?

4. eBPF: Programmable In-Kernel Computation

The architectural inversion is one sentence. ETW is the producer telling the consumer what happened. eBPF is the consumer telling the producer what to compute. The producer is the kernel; the consumer is a user-mode process that has compiled, verified, and attached a small program that will run inside the kernel at a chosen hook. The roles are inverted, the data flow is inverted, and the trust model is inverted.

The lifecycle

A canonical eBPF program goes through six stages before it does any useful work. The flow below is the same on every Linux kernel since 3.18, with refinements added over the years for BTF (BPF Type Format), CO-RE (Compile Once, Run Everywhere), and link primitives:

1. clang -target bpf -O2 -c prog.c -o prog.o            # ELF with BTF
2. fd = bpf(BPF_PROG_LOAD, &attr)                       # kernel verifier runs
3. for each map referenced:
       map_fd = bpf(BPF_MAP_CREATE, &attr)
4. link = bpf(BPF_LINK_CREATE, kprobe|tracepoint|xdp|lsm|cgroup, fd)
5. at hook fire: JIT-compiled native code runs on the
   producing CPU, reads context, calls bpf_* helpers,
   writes to map or ringbuf
6. user space mmaps the ringbuf and consumes records

The lifecycle is documented in the canonical kernel BPF documentation index. It is worth lingering on stage 2. Between the user-space bpf() syscall and the moment the kernel hands back a file descriptor for the loaded program, a static analyzer runs. That analyzer is the most consequential piece of code in this entire article. We treat it on its own in section 5.

Ctrl + scroll to zoom
eBPF load and attach lifecycle: clang -> verifier -> JIT -> kernel hook.

Hooks: where programs attach

The thing that distinguishes eBPF from a packet filter is its hook surface. A hook is a place inside the kernel where a verified program can be attached, fired at the moment something happens. Linux has a lot of hooks.

Hook (eBPF)

An attachment point in kernel code where a verified eBPF program runs. Different hook types receive different context arguments: a kprobe receives the function's CPU registers; an XDP program receives a packet buffer; an LSM hook receives the security operation's parameters. The hook type also determines what helpers and map types the verifier allows.

The hook taxonomy, drawn from the kernel BPF docs and Cilium's BPF architecture reference, is broad:

  • kprobe and kretprobe -- entry and return of any non-inlined kernel function.
  • fentry and fexit -- BPF trampoline replacement for kprobes, with no int3 trap-frame cost.
  • uprobe -- any user-space symbol in any process.
  • tracepoint -- stable kernel tracepoints with version-locked schemas.
  • perf_event -- sampling-profile hooks tied to perf events.
  • XDP -- driver tail-call, before allocation of an sk_buff.
  • TC -- Linux traffic-control qdisc hooks.
  • LSM -- Linux Security Module hooks (mandatory-access-control points), available since Linux 5.7.
  • cgroup, sched, sock_ops -- policy and socket-state hooks.
Ctrl + scroll to zoom
eBPF hook surface across the Linux kernel.

That hook surface is what makes eBPF the universal Linux instrumentation substrate. Once a developer learns the load-verify-attach lifecycle, the same toolchain instruments a TCP retransmit, a do_sys_open call, an LSM file_open check, and an XDP fast-path drop -- all in the same language with the same verifier and the same JIT.

Maps: in-kernel state

The second piece of architecture eBPF adds over classic BPF is the map -- a kernel-managed key-value store accessible from inside a verified program and from user space. Maps are how eBPF programs hold state between invocations and how they communicate with user space.

BPF Map

A kernel-managed data structure that an eBPF program can read and write from inside the kernel, and a user-space process can read and write through the bpf() syscall. Common map types include hash, array, LRU hash, per-CPU hash, ring buffer, and program array (used for tail calls). Each map has a maximum capacity declared at creation and a verifier-checked size for keys and values.

The kernel hash-map documentation distinguishes shared and per-CPU variants. The decision between them is one of the consequential design choices in writing real eBPF code.

Map typeCross-CPU semanticsUpdate costMemory costBest for
BPF_MAP_TYPE_HASHOne value per key, shared across CPUsAtomic __sync_fetch_and_add or BPF_F_LOCK spinlockmax_entries * (key_size + value_size)State that must be globally consistent
BPF_MAP_TYPE_PERCPU_HASHSeparate value slot per CPUNon-atomic read-modify-writemax_entries * value_size * num_cpusCounters and histograms where rate matters and snapshot consistency does not
BPF_MAP_TYPE_RINGBUFSingle MPSC ring with global FIFO orderReservation-spinlock on producerFixed bufferEvent streams whose user-space order must match cross-CPU producer order

The per-CPU variant exists because cache-coherence cost on a contended hash slot dominates the time spent updating it; per-CPU maps remove that contention entirely at the price of cross-CPU consistency. A per-CPU counter on a 96-vCPU host occupies 96 * value_size bytes per key, but updates are local loads and stores. A shared counter on the same host is value_size bytes per key, but every increment is an atomic.

BPF Ring Buffer (ringbuf)

A multi-producer single-consumer kernel-to-user transport added in Linux 5.8 and documented at docs.kernel.org/bpf/ringbuf.html. Unlike the legacy perf_event_array (one ring per CPU), the BPF ringbuf is a single ring shared across all CPUs, with cross-CPU producer ordering preserved in the user-visible record stream.

The ringbuf documentation is explicit about why the design exists: "more efficient memory utilization by sharing ring buffer across CPUs; preserving ordering of events that happen sequentially in time, even across multiple CPUs (e.g., fork/exec/exit events for a task)." A security telemetry consumer that needs to see fork on CPU 0 before kill on CPU 1 cannot use a per-CPU ring; it needs a single MPSC ring. The trade-off is real: the producer pays a brief spinlock for slot reservation, where a per-CPU ring would pay nothing. For event streams the trade is worth it; for histograms it is not.

The aggregation pattern

The reason eBPF is strictly more powerful than ETW is captured in one bpftrace one-liner. The DSL bpftrace -- inspired explicitly by DTrace -- compiles a single-line query into a verified eBPF program:

kprobe:vfs_read { @[comm] = hist(arg2); }

This program attaches to the vfs_read kernel function. For every call, it indexes a per-CPU map by the calling process's name (comm), buckets the arg2 value (the read length) into a power-of-two histogram, and increments the bucket. Nothing crosses the kernel-user boundary while vfs_read is firing -- not at 10K calls per second, not at 10M. When the user hits Ctrl-C, bpftrace iterates the per-CPU maps from user space, merges the buckets across CPUs, and prints a histogram.

ETW cannot do this. To produce the same histogram with ETW, a consumer would have to subscribe to every vfs_read-equivalent kernel event, receive each one in user mode, compute its bucket, and update an in-process histogram. The kernel-user wire would carry the full firehose. eBPF carries only the final histogram.

JavaScript The bpftrace histogram pattern in pseudocode
// The bpftrace one-liner:
//   kprobe:vfs_read { @[comm] = hist(arg2); }
// lowers (conceptually) to this kernel-side and user-side flow.

// --- inside the kernel, at every vfs_read call ---
function on_vfs_read(ctx) {
const comm = bpf_get_current_comm();
const len  = ctx.regs.rsi;                  // arg2: read length
const bucket = log2(len);                   // 0..63

// per-CPU hash keyed by (comm, bucket); no cross-CPU atomics.
const key = { comm, bucket };
const slot = percpu_map.lookup_or_init(key, 0);
*slot += 1;
}

// --- in user space, on Ctrl-C ---
function print_histogram() {
const merged = {};
for (const cpu of all_cpus) {
  for (const [key, count] of percpu_map.iter(cpu)) {
    merged[key] = (merged[key] || 0) + count;
  }
}
render_power_of_two_histogram(merged);
}

Press Run to execute.

The kernel-side per-event cost is a few instructions plus a non-atomic increment. The user-space cost is paid once, at print time. The wire between kernel and user carries one batch read of the entire per-CPU map. ETW's equivalent would carry every single vfs_read event in full.

The instruction-count and complexity limits

Two distinct limits constrain what the verifier will accept. The constants are easy to confuse, and earlier drafts of this article confused them. The correct distinction comes straight from the kernel headers.

BPF_MAXINSNS is defined as 4096 in include/uapi/linux/bpf_common.h. This is the maximum number of bytecode instructions per program for unprivileged callers. A program longer than 4096 instructions is rejected at load time regardless of what the verifier finds.

BPF_COMPLEXITY_LIMIT_INSNS is defined as 1,000,000 in kernel/bpf/verifier.c. This is the maximum number of explored states the verifier will visit during its symbolic execution. It applies to privileged callers with CAP_BPF, who are allowed to load larger programs but still bound the cost of verifying them.

The two limits answer different questions. BPF_MAXINSNS = 4096 bounds the size of an unprivileged program. BPF_COMPLEXITY_LIMIT_INSNS = 1,000,000 bounds the cost of verification for privileged programs. Conflating them is a common error: production EDRs run with CAP_BPF plus CAP_PERFMON or root and load programs much longer than 4096 instructions, but the verifier's exploration is still bounded.

Linux 5.16 (March 2022) made kernel.unprivileged_bpf_disabled=1 the default. The change followed a series of verifier soundness CVEs, including CVE-2020-8835 and CVE-2021-3490, that were exploitable from unprivileged user space. Production EDRs run with CAP_BPF plus CAP_PERFMON or full root; the unprivileged path is reserved for sandboxed workloads where the kernel team has weighed the risk.

The JIT and the trampoline

Brendan Gregg's BPF Performance Tools, published by Addison-Wesley in 2019 (ISBN-13 9780136554820), reports a 10x to 12x speedup of the JIT over the interpreter on x86-64. The number is qualitative -- the workload, the kernel version, and the program shape all matter -- but the order of magnitude is consistent across kernel docs and measurements. The JIT is what makes eBPF practically usable inside hot kernel paths.

A second performance refinement landed in 2019 with the BPF trampoline patch series. Starovoitov's v1 cover letter introduced fentry and fexit -- BPF program attach points that use a tiny JIT-emitted dispatcher to call the attached programs directly, rather than relying on kprobe's int3 trap mechanism. The framing is worth quoting:

"Unlike k[ret]probe there is practically zero overhead to call a set of BPF programs before or after kernel function." -- Alexei Starovoitov, BPF trampoline cover letter

The v3 patch in the same series explains the structural reason: "To avoid the high cost of retpoline the attached BPF programs are called directly." kprobe goes through an indirect-jump dispatch, which on Spectre-mitigated kernels pays a retpoline penalty per call. The BPF trampoline replaces the indirect jump with a direct call patched in at attach time, eliminating that penalty entirely. The qualitative result is "practically zero overhead" relative to the function call itself. The exact numbers vary; the architectural reason does not.

Tail calls

bpf_tail_call(ctx, &prog_array, index) is a helper that, when the prog_array slot at index contains a loaded program, replaces the current program's execution context with the target program's. The architecture is documented in the Cilium BPF architecture reference, which describes the 33-call nesting ceiling: "This, too, comes with an upper nesting limit of 33 calls, and is usually used to decouple parts of the program logic, for example, into stages." The 33-call cap bounds the worst-case execution time of a chain that the verifier cannot symbolically follow (the destination is a runtime-resolved map slot, not a static call target). We will return to the security implications of tail calls in section 7.

eBPF inverts the observability model. ETW asks the kernel "what happened?" eBPF asks the kernel "compute this and tell me the answer." The asymmetry is the reason a histogram of vfs_read lengths costs nothing on the wire under eBPF, and costs a fully formatted event per call under ETW.

eBPF is strictly more powerful than ETW: programmable filter, programmable aggregation, hooks everywhere. But that power has a cost that does not exist in ETW at all. The verifier.

5. The Verifier: Where Mathematics Meets the Kernel

May 2023. NIST publishes CVE-2023-2163. The advisory describes the eBPF verifier in every Linux kernel since 5.4 quietly accepting programs it should have rejected: "Incorrect verifier pruning in BPF in Linux Kernel >=5.4 leads to unsafe code paths being incorrectly marked as safe, resulting in arbitrary read/write in kernel memory, lateral privilege escalation, and container escape." The fix was a small correction to a state-pruning heuristic. The lesson is bigger than the patch: no in-kernel verifier for a Turing-complete instruction set can be simultaneously sound, complete, and decidable. That is not a bug. It is a theorem.

Rice's theorem in the kernel

Alan Turing proved in 1936 that the halting problem is undecidable: no algorithm can decide, for every possible program, whether that program halts on every input. Henry Gordon Rice extended the result in 1953: any non-trivial semantic property of a program -- including memory safety, type safety, and bounded resource use -- is undecidable for the general case. The verifier has to decide a non-trivial semantic property: does this eBPF program access kernel memory only through valid pointers, with valid offsets, and terminate?

It cannot. Not in general. The verifier has to give up at least one of three properties:

  • Soundness -- never accept an unsafe program.
  • Completeness -- never reject a safe program.
  • Scalability -- run in polynomial time on real programs.

Jia and colleagues at HotOS 2023 formalized this trilemma for in-kernel verifiers. The paper's title is the thesis: "Kernel Extension Verification Is Untenable." The authors argue that any verifier for a kernel extension language with the expressiveness of eBPF must trade off at least one of the three properties, and that real verifiers ship by trading all three approximately.

"Kernel Extension Verification Is Untenable." -- Jia et al., HotOS 2023, sigops.org/s/conferences/hotos/2023/papers/jia.pdf

Ctrl + scroll to zoom
The soundness-completeness-scalability triangle: a verifier can be at most two of the three.

The Linux verifier ships with all three approximately. PREVAIL, the verifier used by eBPF-for-Windows, ships with stronger soundness and weaker completeness. The two designs occupy different points on the triangle, and the difference shows up in production.

The Linux verifier

The kernel verifier documentation describes the algorithm:

"The safety of the eBPF program is determined in two steps. First step does DAG check to disallow loops and other CFG validation. ... Second step starts from the first insn and descends all possible paths. It simulates execution of every insn and observes the state change of registers and stack."

The state the verifier tracks is a register-state lattice. Each register holds a type from a finite set: PTR_TO_CTX (a pointer to the program's context argument), PTR_TO_MAP_VALUE (a pointer into a map entry), PTR_TO_MAP_VALUE_OR_NULL (the return type of bpf_map_lookup_elem, which can be null), SCALAR_VALUE (an integer with min/max range), and so on. Each register also has a min/max range that tightens at every operation.

Verifier (eBPF)

The kernel-side static analyzer that proves termination and memory safety of every eBPF program before load. The Linux verifier is documented at docs.kernel.org/bpf/verifier.html. It uses a register-state lattice plus min/max range tracking and explores all reachable program paths with state pruning to keep the cost manageable.

Consider the canonical pattern: look up a map value, check for null, dereference. Every eBPF tracing program does some version of this.

struct value *v = bpf_map_lookup_elem(&map, &key);   // r0 := PTR_TO_MAP_VALUE_OR_NULL
if (!v) return 0;                                    // branch on r0 == 0
return v->field;                                     // deref r0 + offset(field)

The verifier traces both branches. On the taken branch (r0 == 0), the type stays nullable, and the program returns. On the not-taken branch, the verifier refines the type from PTR_TO_MAP_VALUE_OR_NULL to PTR_TO_MAP_VALUE -- the null qualifier is gone, the dereference is bounds-checked against the map's value size, and the program is accepted.

This refinement is exactly the thing that broke in CVE-2023-2163. The bug was not in the dereference logic; it was in the state pruning that keeps the verifier's exploration tractable. Once the verifier has visited a program point with a given abstract state, it prunes subsequent visits from different predecessors with "the same" state. CVE-2023-2163 was a case where the pruner's notion of "the same state" was narrower than the predecessor's true state. The verifier accepted a program in which a register's true type at a join point did not match the type the verifier had pruned against. The program ran with hidden type confusion. Kernel arbitrary read/write followed.

PREVAIL, the abstract-interpretation verifier

PREVAIL, published by Gershuni and colleagues at PLDI 2019, takes a structurally different approach. Where Linux's verifier is a heuristic abstract interpreter with a discrete type lattice, PREVAIL uses numerical abstract interpretation over the zone domain plus intervals.

Abstract Interpretation

A general framework for static analysis, introduced by Patrick and Radhia Cousot in 1977. The analyzer computes over an abstract domain -- intervals, zones, polyhedra, octagons -- rather than concrete program states. A safe abstract operation must over-approximate every possible concrete behavior. The soundness of the analysis reduces to the soundness of the abstract domain operations, which can be proved once and reused.

In the zone domain, the abstract state can express relational constraints between registers and memory base addresses -- not just "register r0 is in [base, base + size)" but "r0 - map_base is in [0, value_size)." That extra expressiveness is what lets PREVAIL prove pointer-arithmetic safety more directly than the Linux verifier's case enumeration. Walking the same null-check program:

Program pointLinux verifier (register lattice)PREVAIL (zone domain)
After bpf_map_lookup_elemPTR_TO_MAP_VALUE_OR_NULLr0 in 0 U [base, base+sz)
Taken branch (r0 == 0)refined to NULLr0 = 0 (equality)
Not-taken branchPTR_TO_MAP_VALUE (qualifier dropped)r0 - base in [0, sz)
At deref v->fieldbounds-checked derefr0 - base in [off, off+access)

Both verifiers accept the program. The difference is in the proof strategy. Linux's verifier reasons case-by-case over a finite lattice; PREVAIL reasons numerically over an abstract domain whose soundness is proved once and reused. The PREVAIL paper (Gershuni et al., PLDI 2019) showed that the zone-domain approach is sound and runs in polynomial time per fixed abstract domain.

Ctrl + scroll to zoom
Abstract states at each program point of the null-check pattern: Linux verifier vs PREVAIL.

The trade-off is concrete. PREVAIL accepts a broader class of programs the Linux verifier rejects (some bounded loops, some longer programs), and rejects others the Linux verifier accepts (Linux's heuristic pruning is more aggressive than zone-domain reasoning in some patterns). The contrast is a trade, not a strict ordering. Each verifier is sound with respect to its own abstract domain. The Linux verifier's CVE history is what happens when the domain itself is implemented heuristically rather than from a once-and-for-all soundness proof. The work of Paul Chaignon walks through the architectural differences in more detail.

Four CVEs, one pattern

The Linux verifier has shipped four widely-disclosed soundness bugs, each one a case where the verifier accepted a program it should have rejected.

CVEYearSubsystem at faultClass
CVE-2020-8835202032-bit register bounds trackingOut-of-bounds read/write
CVE-2021-34902021ALU32 bitwise-op bounds trackingOut-of-bounds R/W, arbitrary RCE
CVE-2022-232222022*_OR_NULL type-state trackingLocal privilege escalation via type confusion
CVE-2023-21632023Branch-pruning logicArbitrary kernel R/W

The CVE-2020-8835 NVD entry describes a flaw where the verifier "did not properly restrict the register bounds for 32-bit operations, leading to out-of-bounds reads and writes in kernel memory." CVE-2021-3490, also reported on the NVD, identifies the same class of bug in the bitwise-operation paths. The CVE-2022-23222 record is tracked across the SUSE bug, Debian DSA-5050, and the openwall oss-security disclosure thread.

The verifier is a research-grade static analyzer running as kernel code. When it gets the abstract domain wrong, the safety guarantee is a CVE. ETW does not have this failure mode because ETW does not run user-supplied code in the kernel.

ETW has driver signing as its safety mechanism. eBPF has the verifier. Microsoft's eBPF-for-Windows project asked an interesting question: what if you want both?

6. eBPF for Windows: The Convergence

On May 10, 2021, Dave Thaler of Microsoft published a blog post announcing a new project. The opening line is the kind of announcement that sounds modest and is not:

"Today we are excited to announce a new Microsoft open source project to make eBPF work on Windows 10 and Windows Server 2016 and later." -- Dave Thaler, "Making eBPF work on Windows", Microsoft Open Source Blog, May 2021

The promise was a near-source-compatible eBPF surface on NT, so that programs and toolchains written for Linux eBPF -- libbpf, bpftool, BCC, clang -target bpf -- would work on Windows with minimal change. The architectural surprise, visible only once you read the design docs, is that the Linux design does not port directly. The Windows trust model is different. The Windows code-integrity story is different. The choices Microsoft made reveal which parts of eBPF are genuinely portable and which parts are deeply Linux-shaped.

Three execution modes

The microsoft/ebpf-for-windows README decomposes the runtime into three modes:

  1. Native eBPF program (preferred, HVCI-compatible). PREVAIL verifies the bytecode in user mode. On success, the bpf2c tool transliterates each verified BPF instruction to equivalent C, MSVC compiles the C, and the result is a signed .sys kernel driver. The signed driver is what gets loaded into the kernel.
  2. JIT compiler. A user-mode service (eBPFSvc.exe) calls the uBPF JIT to produce x64 or ARM64 native code, loaded into the kernel-mode execution context. Disabled on HVCI hosts because dynamic code generation cannot be SiPolicy-signed.
  3. Interpreter. uBPF's interpreter, debug-only.

The native mode is the architecturally interesting one. It treats eBPF bytecode as a source language for a signed-driver compile, not as a target for a kernel-mode JIT. The choice is forced by Windows' kernel-mode security model.

HVCI (Hypervisor-enforced Code Integrity)

A Windows feature that uses the hypervisor to enforce that only signed code runs in kernel mode. With HVCI on, the kernel will refuse to execute any page that does not match a Code Integrity policy signature. Dynamic code generation -- the kind a JIT does -- is impossible on an HVCI host unless the JIT itself is privileged to bless the pages it produces.

bpf2c: the literal transliterator

The thing that makes the native pipeline work is bpf2c. It takes verified eBPF bytecode and emits portable C that any modern compiler can build into a kernel driver. The transliteration is one bytecode instruction per C statement. A concrete excerpt from droppacket_raw.c, the expected output for the XDP-class droppacket.c sample, shows the shape:

TypeScript bpf2c output for droppacket.c (excerpt)
// Excerpt from microsoft/ebpf-for-windows
//   tests/bpf2c_tests/expected/droppacket_raw.c
// One verified BPF instruction maps to one C statement.

#pragma code_seg(push, "xdp")
static uint64_t
DropPacket(void* context, const program_runtime_context_t* runtime_context)
{
uint64_t stack[(UBPF_STACK_SIZE + 7) / 8];
register uint64_t r0 = 0;
register uint64_t r1 = 0;
// ... r2 .. r6, r10 declarations ...

// EBPF_OP_MOV64_REG pc=0 dst=r6 src=r1 offset=0 imm=0
r6 = r1;
// EBPF_OP_MOV64_IMM pc=1 dst=r1 src=r0 offset=0 imm=0
r1 = IMMEDIATE(0);
// EBPF_OP_STXDW pc=2 dst=r10 src=r1 offset=-8 imm=0
WRITE_ONCE_64(r10, (uint64_t)r1, OFFSET(-8));

// ... one C statement per verified BPF instruction ...

r0 = runtime_context->helper_data[0].address(r1, r2, r3, r4, r5, context);
}

Press Run to execute.

bpf2c

The eBPF-for-Windows transliterator from verified BPF bytecode to portable C suitable for MSVC compilation. The output is a signed-driver source file, one C statement per BPF instruction, that can be compiled and signed through the same pipeline as any other kernel driver. The golden test corpus lives at microsoft/ebpf-for-windows/tests/bpf2c_tests/expected.

Four things stand out in the excerpt. One BPF instruction maps to one C statement; the // EBPF_OP_* comments name the opcode, and the line below it is the equivalent C. The eBPF VM's eleven registers become eleven C uint64_t locals; MSVC's optimizer assigns them to native registers in the final .sys. The #pragma code_seg(push, "xdp") directive names the program section the same way SEC("xdp") does on Linux. And helper calls dispatch through a runtime table -- runtime_context->helper_data[0].address(...) -- so the signed driver remains portable across helper-ABI changes.

The result is a kernel module that is a signed driver in every Windows sense of the term: HVCI checks pass, Kernel Mode Code Integrity (KMCI) is satisfied, the Authenticode chain validates. eBPF-for-Windows native mode does not invent a new in-kernel trust boundary. It composes with the one Windows already has.

Ctrl + scroll to zoom
eBPF-for-Windows native mode pipeline: source -> PREVAIL -> bpf2c -> MSVC -> signed .sys driver -> kernel.

The verifier moved

The most consequential architectural choice in eBPF-for-Windows is not visible in the binary. PREVAIL does not run inside the kernel. It runs inside the user-mode eBPFSvc.exe service, which orchestrates verification and the subsequent compile-and-sign pipeline. The kernel never sees an unverified BPF program. By the time anything enters the kernel, it is either a signed driver (native mode) or a JIT-produced buffer that has already passed verification in user space (JIT mode, on non-HVCI hosts).

This is a deliberate divergence from Linux. Linux runs its verifier inside the kernel because the kernel is the only place that can prevent unprivileged user space from loading unsafe programs. Windows can move the verifier out of the kernel because the kernel-mode trust boundary -- the thing that can run -- is already protected by code signing. The verifier becomes a correctness check rather than a safety check at the kernel boundary; safety at the boundary is enforced by HVCI.

Hook coverage as of 2026

The hook surface on Windows is narrower than Linux's. As of 2026, eBPF-for-Windows exposes XDP-class network hooks, BIND, SOCK_OPS, SOCK_ADDR, and process-creation and process-exit hooks via Windows Filtering Platform callouts plus a process hook surface. There is no full kprobe surface. There are no LSM-equivalent hooks. The project README labels itself "work-in-progress." The networking-subset claim in this article is not marketing softening; it is the actual hook list.

The runtime is not the cross-platform abstraction. The verifier is. PREVAIL is the contract; each OS lifts verified bytecode into its own trust model -- in-kernel JIT on Linux, signed-driver compile on Windows. eBPF-for-Windows is not "same kernel hook, different OS"; it is "same bytecode contract, different OS-specific lifting."

Cross-OS eBPF works for the networking subset today. The general kernel observability case -- arbitrary kprobes, full LSM hooks, deep process introspection -- is still Linux-only because the hooks themselves are Linux-internal. eBPF-for-Windows is a real convergence, but it is a subset convergence. Section 7 zooms out and compares the two designs across the full set of dimensions practitioners actually use to choose.

7. Head-to-Head: Performance and Trust Models

Two designs. One emits, one computes. Practitioners need to know what each one costs, where each one's edges cut, and what attack classes each design enables. The right form for that comparison is a table.

DimensionETWLinux eBPFeBPF for WindowsDTrace
In-kernel filter languageNone (level + keyword mask only)Verified bytecodeVerified bytecodeD scripting language
In-kernel aggregationNoneMaps (per-CPU and shared)MapsAggregations primitive
Producer per-event costConstant: format + memcpy to per-CPU bufferJIT-compiled native code at hookJIT or signed-driver call at hookProbe handler call
VerifierDriver signing onlyLinux in-kernel heuristic verifierPREVAIL in user mode + KMCINone (D is interpreted, safe-by-construction)
Verifier soundness incidentsNot applicable4 widely-disclosed CVEs (2020-2023)None disclosedNone
Hook coverageUniversal across Windows API surfaceUniversal: kprobe, uprobe, tracepoint, XDP, TC, LSM, schedXDP, BIND, SOCK_OPS, SOCK_ADDR, processSolaris/BSD/macOS provider set
Cross-platformWindows onlyLinux onlySource-compatible with Linux subsetSolaris, FreeBSD, macOS (legacy)
TransportPer-CPU ring buffer, .etl filesRingbuf, perf_event_array, mapsRingbuf, mapsPer-CPU buffers
Trust modelManifest registration + driver signingVerifier + CAP_BPF + CAP_PERFMONVerifier + HVCI + driver signingPrivilege check + safe-by-construction
Adoption patternDefender, Sysmon, CrowdStrike, SentinelOne, Carbon BlackCilium, Falco, Tetragon, Tracee, Pixie, Sysmon for LinuxPre-production; Azure test deploymentsSolaris/macOS legacy + bpftrace via inspiration
Best suited forForensic capture across the entire Windows API surfaceHot-path filtering and aggregation with arbitrary kernel hooksCross-platform networking observabilityInteractive debugging on Solaris-lineage systems

The asymptotic argument

Two designs can be compared asymptotically. ETW carries N events of average size S; the kernel-to-user wire cost is Omega(NS) -- the unavoidable lower bound for streaming N events. eBPF can reduce that to O(M) where M is the aggregation size, for workloads that aggregate before the events cross the boundary. The bpftrace histogram from section 4 is the concrete example: vfs_read can fire ten million times per second while the user-side bandwidth is zero, because the per-CPU histogram never crosses the boundary until print time.

The asymmetry is the entire reason eBPF makes sense for high-frequency telemetry. It is also the reason every cloud-native observability tool from 2018 onward is on eBPF. When the producer rate exceeds the user-space consumption rate, you do not have a choice: you either drop events or aggregate them in-kernel. ETW can drop. Only eBPF can aggregate.

The tail-call attack class

bpf_tail_call(ctx, &prog_array, index) is powerful and its power has structural consequences. From the BPF trampoline v3 cover letter, the kernel team is explicit that the trampoline was designed in part as a replacement for tail-call-based chaining: "In many cases it can be used as a replacement for bpf_tail_call-based program chaining." The motivation is structural -- there are three attack classes implicit in the tail-call mechanism, and the trampoline avoids them.

Branch-target injection on the tail-call dispatcher. Pre-mitigation kernels exposed an indirect branch from kernel mode -- the dispatcher selecting its target from a user-controllable prog_array index. That is exactly the shape of a Spectre-v2 gadget. Mitigation: retpolined dispatcher and the BPF trampoline replacement that avoids the indirect branch entirely. The qualitative reason fentry beats kprobe is not a benchmark; it is the avoidance of a retpoline. The v3 patch cover letter spells this out: "To avoid the high cost of retpoline the attached BPF programs are called directly." Real numbers vary by microarchitecture, retpoline implementation, and the rest of the kernel-build configuration, but the structural reason is the same on every machine.

Recursion-bound bypass. The 33-call cap protects the verifier's termination proof for a single program from being bypassed by chaining, but it is a per-execution counter. A sequence of attached programs at different attach points can still produce arbitrary aggregate work. The mitigation lives in per-event scheduling, not in the verifier.

Speculative type confusion. The verifier proves a single program's register-type invariants. The target of a tail call is selected at runtime from a map, so speculative execution can execute a different program under the calling program's type-state. Mitigation: indirect-call hardening shared with the rest of the kernel.

Ctrl + scroll to zoom
Tail-call dispatcher as a Spectre-v2 gadget. Mitigation: BPF trampoline plus indirect-call hardening.

The ETW user-mode bypass

ETW has its own structural attack class, mentioned in section 3 and worth restating in the trust-model context. A process that wants to silence its own ETW emissions can patch ntdll!EtwEventWrite to a ret instruction in its own address space. The kernel buffer never sees the event. EDR vendors monitor for this integrity violation out of band, and use the patch itself as a high-confidence detection signal.

Trust models, side by side

ETW trusts manifest registration plus Code Integrity for kernel drivers. The kernel only emits events; the only adversary-controllable surface is the user-mode provider, and the integrity-violation tell catches the obvious attack.

Linux eBPF trusts the verifier plus CAP_BPF and CAP_PERFMON. The verifier is the kernel-mode safety boundary; capabilities gate who can load programs at all. Both have been the source of soundness CVEs and exploitation paths. Defense in depth: unprivileged eBPF off by default since 5.16, hardening of the indirect-call dispatcher, ongoing verifier work.

eBPF for Windows trusts PREVAIL plus HVCI driver signing. The verifier runs in user mode; the kernel only ever sees a signed driver or a JIT-emitted buffer that has already passed the verifier. The composition is strictly more conservative than Linux eBPF, because it stacks the verifier on top of the signing model rather than replacing it. Microsoft is using the Windows kernel-mode trust mechanism and adding the eBPF verifier to it, not choosing between them.

The next layer up from the kernel substrate is the consumer layer -- the agents and SIEM pipelines practitioners actually ship. That production stack is what determines which substrate practitioners reach for first.

8. Production Adoption: The Agent Layer

The substrate matters because the consumer stack does. On Linux, eBPF is the foundation of every serious cloud-native security and observability project. On Windows, ETW is the same. The portable subset is small but real, and it is growing.

The Linux side

Cilium is the dominant eBPF-based networking project, CNCF-graduated and shipping Kubernetes cluster networking, NetworkPolicy enforcement, and a service mesh implementation. Falco, originally created by Sysdig and now CNCF-graduated, provides eBPF-based runtime threat detection driven by a rules engine. Tetragon, a Cilium subproject, attaches eBPF programs to kprobes and LSM hooks for in-kernel enforcement -- not just observation but the ability to block. Tracee from Aqua Security is an eBPF runtime security tool. Pixie, originally Pixie Labs and now under New Relic, uses eBPF for auto-instrumentation of services running in Kubernetes.

Sysmon for Linux is the most architecturally interesting member of the list. Microsoft, the company that built ETW and Sysmon, ported Sysmon to Linux by replacing the ETW back end with eBPF kprobes via the SysinternalsEBPF library. The XML configuration schema and Event IDs are preserved, so SOC analysts see the same channel from either OS. It is the production demonstration that ETW and eBPF can be made surface-equivalent to a consumer.

The Windows side

Sysmon is the canonical ETW consumer reference design, authored by Mark Russinovich and Thomas Garnier and free from Microsoft. Microsoft Defender for Endpoint is the commercial Microsoft EDR product, ETW-driven and cloud-connected. CrowdStrike Falcon, SentinelOne, and Carbon Black are the major third-party EDRs, all built on ETW. krabsetw is Microsoft's C++ ETW consumer library; the Microsoft.Diagnostics.Tracing.TraceEvent package is the .NET equivalent.

The toolchain layer

The eBPF world comes with a toolchain that does not have a direct ETW counterpart. libbpf is the canonical C library for loading and managing eBPF programs. bpftool is the inspection utility. BCC is the older Python-binding toolkit. bpftrace is the DSL inspired by DTrace. cilium/ebpf is the Go library; aya and libbpf-rs are the Rust libraries. The toolchain coverage tells you something about the substrate: a Go developer can write an eBPF program and have it loaded by their existing service binary, because the load-verify-attach lifecycle has a Go binding.

ETW has its own toolchain -- tracerpt.exe, Windows Performance Analyzer, BenchmarkDotNet, krabsetw -- but the toolchain is shaped around consuming events, not around emitting programs into the kernel. The asymmetry of the toolchains mirrors the asymmetry of the substrates.

The decision guide

Which substrate should I pick first?

Windows EDR or building on Microsoft Defender for Endpoint. Use ETW plus Sysmon plus the Microsoft-Windows-Threat-Intelligence provider. eBPF for Windows is not yet a substitute for Defender-grade kernel telemetry; the hook surface is too narrow.

Linux runtime-security or cluster networking. Use eBPF. Pick libbpf or cilium/ebpf for the language binding. Attach LSM hooks for enforcement; fentry for observability. The verifier will fight you; that is expected.

Cross-platform networking observability with one source surface. Use eBPF for Windows and Linux eBPF together, restricted to the XDP, SOCK_ADDR, SOCK_OPS, and BIND hooks. The Linux source compiles unchanged on Windows for this subset.

Forensic capture across the full Windows API surface. Use ETW into .etl files, analyzed in Windows Performance Analyzer. Nothing else covers that breadth on Windows.

The consumer stack has converged at the surface layer: XML configs, Event IDs, EDR vendor APIs. The substrate has not, and the open problems in the next section are what stands in the way.

9. Open Problems and the Frontier

What can we not do yet? Four open problems will shape the next five years of kernel observability.

9.1 Verifier-driven false rejection

Programs that PREVAIL and a human can both prove safe still get rejected by the Linux verifier, which returns the cryptic "verifier complexity limit reached" error. EDR vendors end up fighting the verifier rather than writing the program they want. The workarounds are real and ugly: __attribute__((noinline)) annotations to force the compiler to emit function boundaries the verifier can prune around, explicit bound assertions that re-derive properties the compiler already knows, bpf_loop() to externalize loops the verifier cannot trace. The HotOS 2023 thesis is exactly that this is not a bug -- it is a property of any heuristic verifier under the soundness-completeness-scalability triangle. The completeness leg is the one the Linux verifier gives up first, every time.

The frontier here is twofold. On one side, the verifier is becoming more capable: bounded loops, bpf_for_each_map_elem, kfuncs, and the trampoline-based attach mechanisms have all expanded what the verifier can prove. On the other side, PREVAIL's polynomial-time abstract-interpretation approach represents an alternative architectural lineage. Neither approach removes the underlying undecidability. Both make the rejection threshold higher.

9.2 Cross-OS eBPF ABI

The eBPF Foundation's RFC 9669, published as an IETF Independent Submission in October 2024, standardized the instruction set architecture for BPF programs. The RFC describes the 64-bit ISA, the encoding of instructions, the memory model, and the verifier's basic obligations. It is the cleanest cross-OS contract eBPF has ever had.

What the RFC does not standardize: helpers, map types, and hook semantics. Those remain Linux-defined-in-practice. The eBPF-for-Windows helper set is a subset, with extensions for Windows-specific concepts. The FreeBSD and illumos ports have their own subsets. A single observability agent that runs everywhere needs more than a standardized ISA; it needs a standardized helper API and a standardized hook taxonomy. Today, EDR vendors writing cross-OS agents ship two distinct programs that share a build system and not much else.

9.3 ETW evasion at the trust boundary

The user-mode EtwEventWrite patching attack class is roughly 2020-vintage but has not gone away. The kernel-emitted Microsoft-Windows-Threat-Intelligence provider is the current best mitigation: kernel signals cannot be patched from user mode, so an attacker who silences user-mode emissions still trips kernel-only signals on mprotect, image load, and remote thread creation.

The deeper structural question is whether any user-mode primitive can ever be tamper-resistant under hostile user-mode code. The short answer is no, which is why the answer keeps moving the trust boundary into the kernel -- through PPL, through LSM, through signed drivers. On Linux, the same pattern shows up: hostile-user-mode-resistant telemetry must run inside the kernel, which is why the LSM hooks are the part of the eBPF hook surface that matters most for EDR.

9.4 Hot-path overhead at scale

Production environments routinely run Falco, Cilium, and a vendor EDR on the same kernel, each attaching probes to the same hook. The marginal cost of an eBPF kprobe on a five-million-events-per-second syscall is not zero, and the cost compounds non-linearly when three different agents attach to the same hook with three different programs.

The current partial mitigations are real. fentry/fexit plus the BPF trampoline removed the per-attach trap-frame cost. kprobe.multi, added in Linux 5.18, lets a single program attach to multiple functions with one trampoline. BPF-link iteration lets one agent observe what another has attached. But none of these compose perfectly: three different vendors with three different agents end up with three different trampolines on the same function. The structural fix is trampoline sharing, and the implementation is attach-type-specific. The multi-agent attach problem is the eBPF version of a familiar systems issue: when N independent consumers each install their own instrumentation at the same point, the cost is N times the cost of one. Linux has solved this once for kprobes (with kprobe.multi) and is solving it again for the BPF trampoline. Whether the same pattern can be made cheap for fentry attaches across LSM hooks is an open implementation question.

The frontier of kernel observability is not "build a new substrate." It is "make the existing substrates compose under multi-tenant production load."

10. Two Generations

Return to the SOC analyst from section 1. The Sysmon Operational channel looks the same on both hosts. Now you know why -- and also why the similarity is a deliberate engineering choice rather than a coincidence.

ETW is mature, has full Windows coverage, is emission-only. It is a catalog of events. Every Windows subsystem registers a provider, every provider declares a manifest, every event has a stable schema. A consumer that knows the manifest knows what to expect. The trust boundary is the kernel-mode driver signing model. The cost is that aggregation, sampling, and filtering all happen in user space, after the event has crossed the boundary.

eBPF is programmable, has filter and aggregation in-kernel, has a verifier. It is a language for asking questions of the kernel, not a catalog of pre-defined answers. The trust boundary is the verifier, which is a research-grade static analyzer running as kernel code. Linux's verifier shipped four widely-disclosed soundness bugs in four years. PREVAIL trades that soundness leg for a more conservative completeness story. The trade-offs are not finished.

eBPF-for-Windows is the convergence experiment. The native mode -- PREVAIL plus bpf2c plus MSVC plus a signed .sys driver -- is the first cross-OS-portable kernel-observability primitive. As of 2026 it covers a networking subset of hooks, not the full Linux surface. That gap is not architectural; it is a list of hooks Microsoft has not yet exposed. The pattern is generalizable: cross-OS observability lives in the verifier, not in the runtime, and each OS lifts verified bytecode into its own trust model.

The generation gap is literal. ETW (2000) is an event bus. eBPF (2014) is a programmable kernel substrate. Both will still ship in 2035. Both will still be the right answer for some workloads. The interesting work for the next decade is in the convergence layer -- helper-API standardization, hook-point taxonomy alignment, verifier completeness -- and in the multi-tenant production engineering that makes ten different agents on one kernel cheaper than ten times one agent.

Kernel observability has matured from event emission to programmable kernel computation. That generation gap is why eBPF-for-Windows -- a small, work-in-progress project -- is one of the more architecturally significant operating-system-telemetry events of the last decade. The portable abstraction is not the runtime. It is the static analyzer.

Frequently asked questions

Is eBPF replacing ETW on Windows?

No. As of 2026, eBPF for Windows covers a networking-heavy subset of hooks -- XDP, BIND, SOCK_OPS, SOCK_ADDR, and process creation and exit -- and is not yet a substitute for Defender-grade kernel telemetry. ETW remains the canonical Windows observability substrate. The convergence between the two is real for the networking subset, and is the work-in-progress for the rest of the surface.

Why does Linux's eBPF verifier have soundness CVEs?

Because it is a heuristic abstract interpreter on a Turing-complete ISA, and Rice's theorem says no such verifier can be simultaneously sound, complete, and decidable. Real verifiers ship with all three approximately, and the soundness leg fails first when state pruning loses information at a join point. CVE-2023-2163, CVE-2022-23222, CVE-2021-3490, and CVE-2020-8835 are all instances of that pattern.

Can I write one observability agent for Linux and Windows?

For the networking subset (XDP, SOCK_ADDR, SOCK_OPS, BIND), yes -- eBPF for Windows is source-compatible with Linux eBPF for those hooks. For arbitrary kprobes or LSM hooks, no -- those hooks are Linux-internal and eBPF for Windows does not expose equivalents. Cross-platform agents typically ship two binaries that share a build system.

Is unprivileged eBPF safe to leave enabled?

Since Linux 5.16 (March 2022), kernel.unprivileged_bpf_disabled=1 is the kernel default. Production EDRs run with CAP_BPF plus CAP_PERFMON or root. Leaving unprivileged eBPF enabled was the entry point for several verifier CVEs, so the conservative default is correct.

What's the difference between kprobe and fentry?

A kprobe is a runtime breakpoint mechanism: the kernel patches a trap instruction at the target address, and the trap handler invokes the attached eBPF program. fentry uses the BPF trampoline -- a small JIT-emitted dispatcher that calls attached BPF programs with a direct call, avoiding the retpoline penalty an indirect dispatch would pay on Spectre-mitigated kernels. Starovoitov's framing: "practically zero overhead" for fentry, relative to the kprobe trap-frame cost.

Does ETW have any programmable filter at all?

No. ETW sessions filter by provider, keyword, and level. That is it. Any per-event computation -- counting, sampling, stack-trace folding, downsampling -- runs in user mode on the consumer side, after the event has crossed the kernel-user boundary. The lack of an in-kernel filter language is the structural reason eBPF can do things ETW cannot, like aggregate ten million vfs_read calls per second into a histogram without saturating the wire.

How does Sysmon for Linux work without ETW?

Sysmon for Linux replaces the ETW back end with eBPF kprobes via Microsoft's SysinternalsEBPF library. The XML configuration schema, Event IDs, and Operational channel output are preserved, so a SIEM consumer sees identical telemetry from either OS. It is the production demonstration that ETW and eBPF can be made surface-equivalent to a consumer.

Study guide

Key terms

ETW
Event Tracing for Windows. The Windows 2000-onward kernel-mediated event bus, with providers, sessions, consumers, and per-CPU ring buffers.
eBPF
Extended Berkeley Packet Filter. A safe, sandboxed kernel virtual machine introduced in Linux 3.18 (2014) that runs verified user-supplied bytecode at attached hook points.
Verifier
The kernel-side static analyzer that proves termination and memory safety of every eBPF program before load. The Linux verifier uses a heuristic register-state lattice; PREVAIL uses zone-domain abstract interpretation.
BPF Map
A kernel-managed key-value store accessible from inside an eBPF program and from user space. Types include hash, array, per-CPU hash, and ring buffer.
Ringbuf
The BPF ring buffer map type (Linux 5.8). A multi-producer single-consumer transport that preserves cross-CPU event ordering.
HVCI
Hypervisor-enforced Code Integrity. The Windows feature that uses the hypervisor to enforce kernel-mode code signing. Blocks dynamic kernel-mode code generation by default.
PREVAIL
The user-mode eBPF verifier used by eBPF for Windows. Based on numerical abstract interpretation over the zone domain plus intervals, with formal grounding in Gershuni et al. PLDI 2019.
bpf2c
The eBPF-for-Windows transliterator that emits portable C from verified BPF bytecode, one C statement per BPF instruction. The C is compiled by MSVC into a signed .sys driver.

Comprehension questions

  1. Why did performance counters fail for security telemetry?

    Three structural reasons: sampling-rate floor (counters aggregate at the consumer's query rate, hiding individual events), no event identity (a count tells you N happened, not which user did what), and no causal order (two counters sampled in sequence are not causally ordered with respect to the events they describe).

  2. What three properties does the soundness-completeness-scalability triangle say a verifier can't have all of?

    Soundness (never accept an unsafe program), completeness (never reject a safe program), and scalability (run in polynomial time on real programs). Rice's theorem implies no decision procedure for a non-trivial semantic property on a Turing-complete ISA can have all three. Real verifiers must trade off.

  3. How does eBPF for Windows lift verified bytecode into the Windows kernel?

    In native mode, PREVAIL verifies the bytecode in user space. On success, the bpf2c tool transliterates each verified BPF instruction to one C statement, MSVC compiles the C to a signed .sys kernel driver, and the kernel loads the driver through the standard Authenticode / HVCI / KMCI signing pipeline.

  4. Name two structural attack-class implications of bpf_tail_call.

    Branch-target injection on the tail-call dispatcher (an indirect jump from kernel mode selecting its target from a user-controllable map slot is a Spectre-v2 gadget) and speculative type confusion (the verifier proves a single program's register types, but a tail call's target is a runtime-resolved map slot, so speculative execution can run a different program under the wrong type-state).