Parag Mali - tag: azure

Privileged Identity Management: How a Two-State Role Assignment Retired Standing Admin

noreply@paragmali.com (Parag Mali) — Mon, 25 May 2026 00:00:00 GMT

**Standing Global Administrator was never a design choice. It was the only posture a single-state role-assignment object could produce.** Microsoft Entra PIM added one field to that object -- `type: eligible | active` -- and everything downstream (activation policies, audit logs, access reviews, six PIM Alerts, PIM-for-Groups, PIM-for-Azure-Resources, GDAP, Lighthouse, PIM with Conditional Access) is a structural consequence of that single change. The pattern works for human users. The open boundary in 2026 is application identities -- service principals, managed identities, OAuth consent grants -- which route around PIM entirely via the Azure Instance Metadata Service endpoint at `169.254.169.254`, the bypass class Andy Robbins documented in June 2022 and MITRE ATT&CK now maps to T1078.004.

1. The Tenant with Zero Standing Global Administrators

At 14:03:01 on a Tuesday in 2026, alice@contoso.com became Global Administrator of her company's Microsoft Entra tenant. At 15:03:01 the same day, she stopped being one. In between, she restored a deleted user, exported an audit log, and produced a single PIM record: Justification reads "incident MSRC-2026-PIM-12345, ticket SNOW-INC-987654"; Approver reads "bob@contoso.com (decided 14:02:17)"; ActivatedAt and ExpiredAt differ by exactly PT1H. The SOC 2 auditor signed it off without follow-up questions.

The 2015-vintage version of the same tenant looked nothing like this. Twelve standing Global Administrators. No multifactor challenge at privilege use. No approval workflow. No justification field. No audit trail beyond ordinary sign-in logs. A single phish of any one of those twelve identities was tenant takeover. The math required no sophistication: the attack surface for "Global Administrator of contoso.com" equalled the union of twelve personal attack surfaces, indefinitely.

What changed between the two tenants is not a habit, not a policy, not a culture shift. It is a single field on a single object inside Microsoft Entra ID.

Key idea: Standing admin was never a deliberate design decision. It was the only deployment posture a single-state role-assignment object could produce. Once Microsoft made the role-assignment object two-state, JIT admin became expressible -- and standing admin became visibly the anti-pattern it had been since 1975.

To explain that field, and to explain why it took fifty-one years to ship, we start where the principle did: a 1975 paper by two MIT researchers who knew what privilege should look like but had no mechanism to enforce it.

2. The Default Wasn't a Decision

Who designed the standing Domain Admin pattern? No one. It was the only assignment category Active Directory shipped with.

A forty-year deployment posture with no author. That is the first thing to internalize. Standing admin is what happens when a data model offers exactly one assignment category and operators still have real work to do. Every later "best practice" was an attempt to talk operators out of the one tool they had been given.

1975: The principle without a mechanism

In September 1975, Jerome Saltzer and Michael Schroeder published The Protection of Information in Computer Systems in the Proceedings of the IEEE [@saltzer-schroeder-1975]. The paper is a survey of secure-systems design, organized around eight named design principles that the authors crystallized from work on Multics and other early protected operating systems. Both authors were affiliated with MIT's Project MAC and the Department of Electrical Engineering and Computer Science [@saltzer-mit-meta].

The sixth principle, named Least Privilege, is the one every later JIT-admin product cites:

Every program and every user of the system should operate using the least set of privileges necessary to complete the job. -- Saltzer & Schroeder, *The Protection of Information in Computer Systems*, 1975, Design Principle (f), the sixth of eight [@saltzer-schroeder-1975] Design Principle (f), the sixth of eight, in the 1975 Saltzer and Schroeder paper. Every program and every user of the system should operate using the least set of privileges necessary to complete the job. The principle is correct, parsimonious, and -- for four decades after publication -- mechanically unenforceable for the temporal case. Static enforcement (ACLs, capability lists, ring boundaries) was tractable in 1975; bounding the time interval during which a privilege is held was not.

Read the principle carefully. It does not say "every user should hold the least set of privileges." It says they should operate using the least set of privileges. The two formulations look identical until you ask what a person does between bursts of administrative work. A user who holds the privilege "permanently active" is operating using it permanently, whether they touch the system or not. The 1975 paper points at the temporal dimension and walks past it. The worked examples cover static mechanisms -- protection rings, access control lists, capability tickets -- not time-bounded ones. The principle was correct. The mechanism did not yet exist.

For the next forty years, every approximation tried to compensate. UNIX sudo (1980) bound elevation to a single command. Kerberos delegation (1988) bound impersonation to a ticket. Windows DACLs and Active Directory groups (1993 and 2000) bound access to a static membership list. None made temporal least privilege a first-class data-model property. None let an operator say "I am eligible to be Domain Admin, but I am not Domain Admin right now."

Microsoft's 2014 *Mitigating Pass-the-Hash v2* whitepaper introduced a three-tier administrative model. Tier 0 is identity-system-critical: domain controllers, ADFS, PKI, anything whose compromise gives forest-wide privilege. Tier 1 is enterprise servers and business-critical applications. Tier 2 is user workstations and end users. The enforcement rule is one sentence: an administrator credential for Tier N must never be exposed to a system at a higher (numerically larger) tier. Microsoft has progressively retired this framing in favour of the Enterprise Access Model, which we revisit in section 6.

2000-2013: Group membership as a boolean

When Active Directory shipped with Windows 2000 on February 17, 2000 [@ms-news-windows-2000-launch], privileged access was structurally a boolean property of the principal. A user was either a member of BUILTIN\Administrators, Domain Admins, Enterprise Admins, or Schema Admins, or they were not. The membership lived in the directory as the member attribute on the group object (and the memberOf back-link on the user). It was set when assignment was made, unset when an administrator manually revoked it. No third state. No attribute could hold one.

A privileged identity whose role assignment is active and permanent. The role's permissions are granted continuously, regardless of whether the principal is currently exercising the privilege. Standing admin is the default state of any pre-PIM tenant and the deployed-reality state of most AD-only environments through 2026.

Kerberos's Privilege Attribute Certificate -- the PAC -- carried the user's group SIDs forward into every Kerberos ticket the user obtained.The Privilege Attribute Certificate is the data structure inside a Kerberos ticket that lists the user's group SIDs. Pre-2016 Active Directory had no per-membership TTL metadata in the PAC. There was nowhere in the existing schema to put an expiry timestamp, which is why on-prem JIT membership later required a separate forest rather than an in-directory mechanism. A ticket's lifetime was bounded; the SID set inside it was not. There was no per-membership TTL anywhere in the system. If you wanted "Alice is Domain Admin between 14:00 and 15:00 today and not otherwise," the directory had no machinery to express it. Alice was Domain Admin permanently, or not at all.

Twenty years of deployment matched the data model exactly. A typical 2010-vintage enterprise ran ten to thirty standing Domain Administrators across business units, because manually adding and removing membership for each task was untenable at human scale. The data model did not punish standing membership; the operator chose the only category the directory offered.

December 2012: Microsoft names the failure mode

In December 2012, Patrick Jungles, Mark Simos, Aaron Margosis, Roger Grimes, Laura Robinson and the Microsoft Trustworthy Computing team published Mitigating Pass-the-Hash and Other Credential Theft, Version 1 [@pth-download-center], [@berkouwer-pth-2013]. It is the first formal Microsoft acknowledgment that credential-theft propagation through Active Directory was not a software defect to be patched but a structural property of standing admin membership.

The argument is direct. If twelve Domain Admins exist, the attack surface of "Domain Admin of contoso.local" is the union of those twelve people's personal attack surfaces. Any one gets phished, or gets hash-extracted from a Tier-1 server they accidentally signed into, and the attacker has Domain Admin permanently. The MIM PAM documentation later restated the failure in one sentence: "Today, it's too easy for attackers to obtain Domain Admins account credentials, and it's too hard to discover these attacks after the fact" [@ms-learn-mim-pam-overview].

2014: The tier model arrives, the mechanism does not

The 2014 update -- Mitigating Pass-the-Hash, Version 2 [@pth-download-center] -- generalized the threat model and introduced the Tier-0 / Tier-1 / Tier-2 framing as a structural mitigation. v2 said two things clearly that v1 had only implied. First, standing membership in Tier-0 groups was the root cause, not a downstream defect. Second, the mitigation pattern -- isolate tiers, reduce the standing count, use dedicated Privileged Access Workstations -- was guidance, not a mechanism. Microsoft Trustworthy Computing did not yet have a product that could mechanically time-bound group membership in Active Directory.

v2 named the problem, drew the threat model, and recommended the structural fix. What it could not do was ship a mechanism. The mechanism would come, but on the wrong side of the cloud boundary.

3. The On-Prem Detour: MIM 2016 PAM, Bastion Forests, and Shadow Principals

Microsoft's first mechanical JIT-admin product was not in the cloud. It was on-premises, and it required a separate Active Directory forest.

Stop and re-read that. To bound the duration of a group membership in pre-2016 Active Directory, Microsoft had to build a different directory and inject SIDs from one into the other across a trust. The reason was the data model. The production forest's member attribute had no TTL field. Adding one meant changing the AD schema. Changing the schema meant a Windows Server release. So while the schema change was in flight, Microsoft shipped the on-prem JIT-admin product on a different architecture: ask the operator to stand up a second forest whose only job was to issue time-bounded SIDs into the first.

August 6, 2015: MIM 2016 ships PAM

On August 6, 2015, Microsoft Identity Manager 2016 reached general availability and shipped a new capability named Privileged Access Management [@ms-learn-mim-pam-overview]. The architecture is the interesting part. MIM PAM uses three primitives that, together, give Active Directory a mechanically time-bounded group membership for the first time:

A bastion forest -- an entirely separate Active Directory forest, sometimes called the "red" forest or "admin" forest, where privileged accounts live.
A one-way PAM trust from the production forest to the bastion forest, configured for selective authentication.
Shadow principal objects in the bastion forest, each carrying a SID that names a real privileged group in the production forest.

A separate Active Directory forest dedicated to housing privileged accounts. In MIM 2016 PAM the bastion forest holds shadow-principal objects whose SIDs point at production-forest privileged groups; a one-way PAM trust lets the production forest accept those SIDs in incoming Kerberos tickets for a bounded duration. An Active Directory object (schema class `msDS-ShadowPrincipal`, introduced in Windows Server 2016) that represents a foreign user, group, or computer in the bastion forest and carries an `msDS-ShadowPrincipalSid` attribute populated with the SID of a production-forest privileged group. Membership in a shadow principal results in that production-forest SID being added to the requesting user's Kerberos PAC for the membership TTL.

The activation flow is direct. A user in the bastion forest requests privilege through the MIM Portal. An approver decides. MIM writes a TTL-bounded membership in the appropriate shadow principal, with the TTL enforced by the Windows Server 2016 temporal-group-membership feature [@teal-esae3]. The bastion KDC injects the production-forest SID into the user's Kerberos PAC. The production forest accepts that SID across the PAM trust. After the TTL expires, subsequent ticket renewals exclude the privileged SID, and the user no longer holds the privilege.

flowchart LR subgraph BASTION["CORP-PRIV bastion forest"] A["Privileged user account"] SP["Shadow principal (msDS-ShadowPrincipal) carries production SID, TTL"] BKDC["Bastion KDC"] A -->|"Time-bound membership"| SP SP --> BKDC end subgraph PROD["CORP production forest"] DA["Domain Admins"] PKDC["Production KDC"] end BKDC -->|"Kerberos ticket carries injected SID via PAM trust"| PKDC PKDC -->|"SID in PAC grants membership for TTL only"| DA

October 15, 2016: Windows Server 2016 makes the mechanism real

For the first fourteen months of MIM 2016's life, the full feature did not work. The temporal-group-membership and shadow-principal schema classes that MIM PAM depends on are AD primitives that arrived only with Windows Server 2016, which reached general availability on October 15, 2016 [@ms-learn-lifecycle-ws2016]. Microsoft Learn states the requirement directly: "With Windows Server 2016, PAM features of time-limited group memberships and shadow principal groups are built into Windows Server Active Directory" [@ms-learn-raise-bastion], and "All domain controllers in the bastion environment for the PRIV forest must be Windows Server 2016 or later" [@ms-learn-raise-bastion].The PAM trust is technically a forest trust with selective authentication enabled. The selective authentication flag is what prevents the bastion forest's privileged identities from being usable for anything other than the explicit shadow-principal SID injection -- without it, the bastion forest would itself become a sprawling privileged-access surface.

This is the moment AD itself gains a temporal least-privilege primitive, forty-one years after Saltzer and Schroeder published the principle. The mechanism is real, but the operational profile is brutal.

Three reasons it did not generalize

MIM PAM solved exactly one problem and could not be extended to the next. Three structural constraints kept it confined to a niche.

First, it was on-premises only. A bastion forest is an Active Directory artifact. Microsoft Entra ID, Office 365, and Azure RBAC role assignments live in a different identity system, with no concept of a forest, no PAM trust target, and no place to plug a shadow-principal object. MIM PAM had no cloud story, and by 2015 the cloud was already where most new Microsoft privileged-access surfaces were being deployed.

Second, the operational complexity filtered out everyone except the most security-mature shops. A bastion forest is a separate Active Directory forest, with its own domain controllers, replication, backup, disaster recovery, and PKI implications. The deployment also requires MIM Service, MIM Portal, MIM Web Service, and SQL Server. Auditing the PAM trust correctly is itself non-trivial work. Microsoft Learn now positions MIM PAM as appropriate only for isolated, non-Internet-connected deployments [@ms-learn-mim-pam-overview]; the verbatim positioning and the MIM 2016 lifecycle details are in the Callout below.

Note: Microsoft Learn states MIM PAM is "not recommended for new deployments in Internet-connected environments" and positions it for "isolated AD environments where Internet access is not available" [@ms-learn-mim-pam-overview]. MIM 2016 itself remains in extended support through January 9, 2029 [@ms-learn-mim-2016], and Microsoft has shipped SP3 compatibility updates for SharePoint Subscription Edition, Exchange SE, and SQL Server 2022 -- but the cloud-first Entra PIM path is the canonical answer for new tenants.

Third, the forest-functional-level dependency delayed real deployment by more than a year. Shadow principals were not usable until Windows Server 2016 reached GA in October 2016. MIM 2016 had been generally available since August 2015. For its first fourteen months in market, the headline JIT-admin feature could not be configured at full fidelity. By the time Windows Server 2016 shipped, Microsoft was already operating its cloud PIM in production.

What the on-prem detour reveals about the cloud's shape

MIM PAM mechanically bounds membership in groups via shadow principals in a separate forest. The cloud has no concept of a forest. So the cloud-native mechanical bound must attach to the assignment object directly, not to the group object indirected through a separate forest. The cloud needed a new assignment-category type, not a new forest topology.

The cloud does not have a forest. It has a role-assignment object. What if that object grew a second state?

4. The Breakthrough: A Two-State Role-Assignment Object

By August 2015, while MIM 2016 PAM was still in late preview for the on-premises case, the Microsoft Identity Division had already shipped something different for the cloud. They shipped a role-assignment object with one new field. That field changed everything that came after it.

The 2015 preview

Alex Simons's August 27, 2015 capability-update post on the CloudBlogs (now migrated to Microsoft Tech Community) is the first public articulation of what Azure AD PIM was building [@simons-2015-aug]. It introduced four surfaces: an eligible assignment category distinct from active, multifactor authentication required at activation, security alerts that watched for privileged-role anomalies, and what the post called Security Reviews -- the precursor to access reviews. The architecture under those four surfaces is the load-bearing part: a single new field on the role-assignment object.

On September 15, 2016, Azure AD Premium P2 reached general availability and carried the first generally-available cloud-native PIM, attributed to Joy Chik (then Corporate Vice President of the Identity Division) and the Identity engineering team [@techcommunity-p2-ga]. Eligible-versus-active was now a billable, supported, production-grade feature.

The one-function spine

Read this carefully. It is the article's central claim.

Key idea: Standing admin was the default not because anyone thought it was secure, but because the role-assignment object had only one state. PIM's contribution is to add a second state -- eligible -- and to make the transition from eligible to active a gated, audited, time-bounded operation that is by definition mediated by PIM.

The principle was Saltzer and Schroeder, 1975. The recognition that standing admin was the failure mode was Mitigating Pass-the-Hash, 2012 and 2014. The on-premises mechanism was MIM 2016 PAM. The cloud answer is a different shape entirely: not a new directory and a SID-injection trust, but a single field on the assignment object itself.

Microsoft Learn documents the resulting terminology in the PIM overview. A principal -- user, group, service principal, or managed identity -- can be eligible or active for a role, and either assignment can be permanent or time-bound [@ms-learn-pim-configure]. The same page elevates a forty-year-old phrase into a product term: "principle of least privilege access -- A recommended security practice in which every user is provided with only the minimum privileges needed to accomplish the tasks they're authorized to perform" [@ms-learn-pim-configure]. The 1975 sentence is now a glossary entry inside a 2026 product, and the product has a mechanism that makes the sentence enforceable.

The formal tuple

Concretely, a PIM-managed role assignment is a 5-tuple. Let $A = (p, r, s, t, d)$ where $p$ is the principal, $r$ is the role, $s$ is the scope, $t \in {\text{eligible}, \text{active}}$, and $d \in {\text{permanent}, \text{time-bound}[s_0, e_0]}$. The activation transition is

$$\text{activate}: A_{t=\text{eligible}} \longrightarrow A_{t=\text{active},\ d=\text{time-bound}[\text{now},\ \text{now}+\Delta]}$$

subject to the per-role activation policy. The interesting part is what the tuple makes expressible:

RoleAssignment = {
    principal:  user | group | service principal | managed identity,
    role:       Entra directory role | Azure RBAC role | group membership | group ownership,
    scope:      directory | management-group | subscription | resource-group | resource | group,
    type:       eligible | active,
    duration:   permanent | time-bound[start, end]
}

activate: eligible_assignment -> active_assignment   // PIM-mediated, gated, audited

A PIM-managed role assignment that grants no privilege until the principal invokes `activate()`. The eligible assignment is the standing relationship between principal and role; the active assignment is the time-bounded materialization that follows when the activation policy is satisfied [@ms-learn-pim-configure]. A PIM-managed role assignment that grants the role's permissions for the duration of the assignment. Active assignments are either permanent (the legacy pre-PIM posture, or an explicit permanent-active PIM assignment) or time-bound (the result of an `activate()` call on an eligible assignment) [@ms-learn-pim-configure]. flowchart TD subgraph Permanent["Permanent duration"] PE["Permanent eligible -- standing eligibility, no privilege held"] PA["Permanent active -- legacy standing admin"] end subgraph TimeBound["Time-bound duration"] TE["Time-bound eligible -- standing eligibility with end date"] TA["Time-bound active -- JIT admin after activate()"] end PE -->|"activate()"| TA TE -->|"activate()"| TA TA -->|"expire or deactivate()"| PE PA -->|"legacy posture being retired"| PE

The grid has only four cells. Permanent active is the pre-PIM world, the standing-admin posture every later best practice has been trying to retire. Time-bound active is the JIT-admin state, materialized only at the moment of work and expired shortly after. The two eligible states -- permanent or time-bound -- are the standing relationships between a principal and a role that grant no privilege at rest. The expressive change is small. The deployment consequences are total.

PIM did not add eight features. It added one field, and everything else is downstream.

This is Aha #1. The reader who came in believing standing admin persisted for forty years because operators lacked discipline now sees it differently. Operator discipline was a fragile workaround for a missing data-model field. The 1975 principle was correct. The 2012-2014 PtH whitepapers were correct. The operators were not the problem. The role-assignment object had one state to be in, and the deployment matched the data model exactly. The fix was a structural change to the data model.

The next nine years of PIM history are about extending that two-state primitive: to Azure RBAC, to security groups, to partner tenants, to the conditional-access plane, and to a detection layer that flags people who try to skip activation entirely. We walk each extension in turn. First, the mechanism itself.

5. Anatomy of an Activation

We have seen what changed. Walk through what happens, end to end, when alice@contoso.com clicks "Activate" on her eligible Global Administrator assignment at 14:00:00 on a Tuesday.

The activation flow, step by step

Six things happen, in order, and each writes audit-log evidence:

The eligible assignment already exists. Alice has been a permanent-eligible Global Administrator since she was hired. The PIM directory object records principal alice@contoso.com, role Global Administrator, scope directory, type=eligible, duration=permanent. Today she holds zero of the role's permissions.
The activation request lands on PIM. Alice clicks Activate in the Entra admin centre, or fires the equivalent Microsoft Graph call. PIM pulls the activation policy for (role=Global Administrator, scope=directory) and prepares to evaluate the gates [@ms-learn-pim-change-default-settings].
The policy gates evaluate. This is the load-bearing part, and the place readers most often misread the docs. The gates are per-role configurable, not universal. Microsoft Learn documents five gates the tenant can independently switch on or off [@ms-learn-pim-change-default-settings]:
- Multifactor authentication at activation if requires_mfa is set.
- Approval routing to named approvers or an approver group if requires_approval is set.
- Justification text capture if requires_justification is set.
- Ticket number capture, optionally tagged with a ticketing-system identifier, if requires_ticket is set.
- Activation duration validation against the per-role configurable maximum -- one to twenty-four hours, with one hour the default for the highest-privileged Entra roles such as Global Administrator and Privileged Role Administrator [@ms-learn-pim-change-default-settings].
PIM materializes the active assignment. Microsoft Learn states the latency directly: "Microsoft Entra PIM creates active assignment (assigns user to a role) within seconds" [@ms-learn-pim-activate]. A new token Alice obtains after this moment will carry the activated role's claims.
The PIM audit log records the entire transaction. A new entry captures the request, the approver's decision and decision time, the justification text, the ticket reference, the activation start, and the planned expiry. The audit log is retained for thirty days by default and can be routed to Azure Monitor for longer retention [@ms-learn-pim-audit-log].
Auto-deactivation fires at the duration boundary. At 15:00:00 -- one hour after activation -- PIM deactivates the assignment within seconds [@ms-learn-pim-activate]. Alice can also call deactivate() explicitly to return early.

sequenceDiagram autonumber participant User as alice participant PIM participant MFA participant Approver as bob participant Graph as Microsoft Graph participant Audit as PIM audit log User->>PIM: Activate Global Administrator PIM->>MFA: Require MFA challenge MFA-->>PIM: MFA passed PIM->>Approver: Route approval request Approver-->>PIM: Approve with justification context PIM->>Graph: Materialize active assignment within seconds PIM->>Audit: Write request, decision, materialization records Note over PIM,Audit: Token issued with activated role claims Note over PIM,Graph: One-hour TTL begins PIM->>Graph: Auto-deactivate at expiry within seconds PIM->>Audit: Write deactivation record

Activation policies are configured, not assumed

Two of the most common misunderstandings the documentation receives are about this configurability. First, MFA at activation is not universally required by PIM. The role's activation policy must be set to require it. Second, the activation maximum is configurable per role per scope inside a one-to-twenty-four-hour range, with the default for Global Administrator and Privileged Role Administrator at one hour [@ms-learn-pim-change-default-settings]. A "PIM tenant" where one role requires MFA and approval and another role requires only justification text is a perfectly valid configuration; both roles are PIM-gated, but their gate sets differ.

A per-role-per-scope configuration of which gates an activation must satisfy: MFA at activation, approval, justification, ticket number, and the activation maximum duration. PIM evaluates the policy at activation time. The gates are independent flags; any combination can be required [@ms-learn-pim-change-default-settings].

Note: PIM's activation maximum duration is configurable per role per scope in the one-to-twenty-four-hour range. The default value for the highest-privileged Entra directory roles -- Global Administrator and Privileged Role Administrator -- is one hour [@ms-learn-pim-change-default-settings]. Other roles default to higher values. Tighten the duration where you can; the activation cost is small, the standing-active surface saving is large.

Authentication context: gating activation, not sign-in

Conditional Access has gated sign-in since 2014. Until 2023, it had no way to gate the activation event itself. The integration between PIM and Conditional Access changes that by attaching an authentication context label to the activation, which Conditional Access can target the same way it targets any other authentication. Microsoft Learn includes the activation policy option "On activation, require Microsoft Entra Conditional Access authentication context" [@ms-learn-pim-change-default-settings].

A label that PIM attaches to the activation event so that Conditional Access policies can target the activation itself, not just the sign-in. Policies such as "activation of Global Administrator requires a compliant device and an MFA challenge issued within the last five minutes" become expressible without bolting on a third-party stack [@ms-learn-pim-change-default-settings].

The activation gate, as code

To make the gate-composition idea concrete, here is the activation policy as a small JavaScript function. Edit the policy or the request and re-run it.

{` function activate(request, policy) { // policy gates are independent; any combination can be required if (policy.requires_mfa && !request.mfa_passed) { return { ok: false, reason: 'MFA challenge failed or absent' }; } if (policy.requires_approval && !request.approval_decision) { return { ok: false, reason: 'Approval pending' }; } if (policy.requires_justification && !request.justification) { return { ok: false, reason: 'Justification text missing' }; } if (policy.requires_ticket && !request.ticket_number) { return { ok: false, reason: 'Ticket number missing' }; } if (request.duration_hours > policy.max_duration_hours) { return { ok: false, reason: 'Requested duration exceeds policy maximum' }; } // activation succeeds: materialize a time-bound active assignment const expires_at = new Date(Date.now() + request.duration_hours * 3600 * 1000); return { ok: true, active_assignment: { principal: request.principal, role: request.role, scope: request.scope, type: 'active', duration: { kind: 'time-bound', start: new Date(), end: expires_at } } }; }

const policy = { requires_mfa: true, requires_approval: true, requires_justification: true, requires_ticket: true, max_duration_hours: 1 }; const request = { principal: 'alice@contoso.com', role: 'Global Administrator', scope: 'directory', mfa_passed: true, approval_decision: 'approve', justification: 'MSRC-2026-PIM-12345', ticket_number: 'SNOW-INC-987654', duration_hours: 1 }; console.log(activate(request, policy)); `}

The function is mechanical and short for a reason. Every PIM gate is independently expressible, the policy is a record, the request is a record, and the active-assignment output is itself a record the system can audit. The complexity of PIM, such as it is, lives in the surrounding infrastructure -- the directory, the audit log, Conditional Access, the alert engine -- not in the gate itself.

The Azure-resource five-minute floor

One operational detail belongs here.Azure resource role assignments under PIM-for-Azure-Resources carry an additional latency floor: an Azure resource role assignment cannot be made for a duration of less than five minutes and cannot be removed within five minutes of being created [@ms-learn-pim-resource-roles]. This is the rare place where the cloud control plane exposes a hard minimum-time bound in its assignment-state machine, and it shapes the lower limit of any tightening strategy on Azure RBAC scopes.

Activation is the per-event control. But what about the standing posture across the tenant -- the eligibility surface, the drift you did not notice, the assignment configuration in places PIM does not reach by default? For that, you need access reviews, and you need to push the eligible/active primitive beyond the original twenty-eight built-in directory roles.

6. Beyond Directory Roles: Extending Eligible and Active Across Four Boundaries

PIM at GA in September 2016 covered roughly twenty-eight built-in Entra directory roles. Everything else -- Azure RBAC, security groups, partner-tenant delegation, the Conditional Access activation event -- was still single-state and permanent-active. The next nine years of PIM history are the story of closing those four boundaries, one at a time.

flowchart TD Core["Two-state assignment object, 2016"] Core --> Azure["PIM for Azure Resources, 2017-2019, RBAC at four scopes"] Core --> Groups["PIM for Groups, GA October 2023, membership and ownership"] Core --> Partner["GDAP May 2022 plus Azure Lighthouse eligible authorizations"] Core --> CA["PIM with Conditional Access authentication context, GA October 2023"]

Boundary 1: PIM for Azure Resources

Between 2017 and 2019, Microsoft extended the eligible-versus-active model from Entra directory roles to Azure RBAC. The extension covers four scopes -- management group, subscription, resource group, and individual resource -- and supports both built-in roles (Owner, Contributor, User Access Administrator, and the security roles) and custom roles [@ms-learn-pim-resource-roles].

The non-obvious operational property of PIM-for-Azure-Resources is that role settings do not inherit down the RBAC hierarchy. A policy you tighten on Owner at the management-group scope does not automatically flow down to Owner on subscriptions, resource groups, or resources beneath it. Each (role, scope) pair is its own policy slot, and each must be configured.

Note: Configure activation policies per role per scope explicitly across the management-group, subscription, resource-group, and resource hierarchy. A tightening at the management-group scope does not flow to subscriptions beneath it. The most common operational defect in mature PIM tenants is the unconfigured policy at a downstream scope, leaving a wide-open activation surface under what looked like a hardened parent.

Boundary 2: PIM for Groups

The PIM-for-Groups timeline is three distinct events. In August 2020, Microsoft previewed the feature under its original name, "Privileged Access Groups," and limited the preview scope to role-assignable security groups [@simons-2020-aug]. In January 2023, Microsoft renamed the feature to "Privileged Identity Management for Groups" in the Entra admin centre; the underlying eligible/active model was unchanged [@ms-learn-pim-for-groups]. In October 2023, more than three years after the preview, PIM for Groups reached general availability with a broader scope -- role-assignable security groups (carried forward), non-role-assignable security groups (newly supported), and Microsoft 365 groups (newly supported), with JIT for both membership and ownership [@ms-techcommunity-pim-groups-ca-ga-2023], [@ms-learn-pim-for-groups], [@ms-learn-pim-groups-role-settings].The three events span more than three years and should not be conflated. August 2020: preview of "Privileged Access Groups," role-assignable security groups only [@simons-2020-aug]. January 2023: rename to "PIM for Groups"; same scope and model [@ms-learn-pim-for-groups]. October 2023: general availability with the broader scope (non-role-assignable security groups plus M365 groups), and JIT for both membership and ownership [@ms-techcommunity-pim-groups-ca-ga-2023]. Two structural exclusions persist throughout: dynamic-membership groups and groups synchronized from on-premises Active Directory [@ms-learn-pim-for-groups]. The scope is broad: any Entra security group and any Microsoft 365 group, except dynamic-membership groups and on-premises-synced groups, can be PIM-enabled [@ms-learn-pim-for-groups].

The interesting design choice is that PIM-for-Groups gates two distinct surfaces per group: membership and ownership. The two surfaces each get their own activation policy [@ms-learn-pim-groups-role-settings].

The extension of PIM eligible/active assignment to Entra security groups and Microsoft 365 groups. Originally previewed in August 2020 as "Privileged Access Groups" (role-assignable security groups only) [@simons-2020-aug]; renamed to "PIM for Groups" in January 2023 [@ms-learn-pim-for-groups]; reached general availability in October 2023 with the broader scope (role-assignable security groups, non-role-assignable security groups, and M365 groups), with JIT for both membership and ownership [@ms-techcommunity-pim-groups-ca-ga-2023]. Excludes dynamic-membership groups and groups synchronized from on-premises environments [@ms-learn-pim-for-groups], [@ms-learn-pim-groups-role-settings]. A group owner can add members. A privileged access group whose membership is PIM-gated but whose ownership is permanent-active offers an unmediated elevation path: a compromised owner adds themselves as a member, bypassing the membership gate they would have had to activate. PIM-for-Groups gates both surfaces because gating membership without gating ownership is a one-bypass-step elevation. The two policies are independent; both must be set.

Boundary 3: Partner tenants -- GDAP and Azure Lighthouse

Until 2022, the Microsoft partner channel -- Cloud Solution Providers and Managed Service Providers -- worked through a model called Delegated Admin Privileges (DAP), in which the partner held standing Global Administrator on every customer tenant they touched. The Nobelium supply-chain attack tradition of 2020-2021 made the structural risk of that posture unignorable [@cisa-aa20-352a]: one compromise of one partner credential meant Global Administrator across hundreds or thousands of customer tenants simultaneously.

In May 2022, Microsoft introduced Granular Delegated Admin Privileges (GDAP) [@ms-learn-gdap], [@crayon-gdap]. GDAP replaces the standing-GA pattern with time-bound (one to seven-hundred-thirty days) and role-scoped delegation between partner and customer tenants. Microsoft Learn's framing makes the design explicit: "GDAP is a security feature that provides partners with least-privileged access following the Zero Trust cybersecurity protocol. It lets partners configure granular and time-bound access to their customers' workloads in production and sandbox environments. Customers must explicitly grant the least-privileged access to their partners" [@ms-learn-gdap].

The May 2022 Microsoft Partner Center capability that replaces legacy DAP's standing-Global-Administrator-on-every-customer-tenant pattern with time-bound (one to seven-hundred-thirty days) and role-scoped delegation between partner and customer tenants. GDAP is the partner-tenant analogue of PIM eligible assignment [@ms-learn-gdap].

The Azure plane has a parallel construct. Azure Lighthouse eligible authorizations, introduced alongside GDAP, extend PIM-for-Azure-Resources eligibility across the tenant boundary [@ms-learn-lighthouse-eligible]. The customer (not the partner) controls the PIM policy on the delegated authorization. One important exception: service principals cannot use eligible authorizations, because there is currently no way for a service principal to elevate its access [@ms-learn-lighthouse-eligible]. The application-identity gap we reach in section 9 reaches into Lighthouse too.

Boundary 4: PIM and Conditional Access authentication context

The October 2023 GA wave closed the activation-gate-versus-sign-in-gate gap. Before October 2023, Conditional Access could gate sign-in into the tenant, but it could not gate the activation event itself. After October 2023, an authentication-context-tagged Conditional Access policy can target activation specifically [@ms-techcommunity-pim-groups-ca-ga-2023]. A policy of the form "activation of any control-plane role requires a compliant device and a fresh MFA challenge" becomes expressible without third-party tooling [@ms-learn-pim-change-default-settings].

The retirement of Tier-0, Tier-1, Tier-2

The umbrella framing has also shifted. Microsoft's 2014 Tier-0 / Tier-1 / Tier-2 model is being progressively retired in favour of the Enterprise Access Model (EAM), which uses control plane, management plane, and data/workload plane as the structural divisions [@ms-learn-eam]. EAM is cloud-native where Tier-0/1/2 was on-premises-centric. Microsoft Learn states the mapping: "Tier 0 expands to become the control plane and addresses all aspects of access control", and "what was tier 1 is now split into the following areas: Management plane ... Data/Workload plane" [@ms-learn-eam].

The post-2021 Microsoft reference architecture that replaces the Tier-0/Tier-1/Tier-2 administrative model with a plane-based division: control plane, management plane, and data/workload plane. EAM is cloud-native and zero-trust-friendly where Tier-0/1/2 was on-premises-centric [@ms-learn-eam]. Microsoft's RaMP -- the Rapid Modernization Plan -- is the post-2018 deployment roadmap that operationalizes EAM [@ms-docs-github-ramp].

The retirement is partial. The practitioner audience still uses Tier-0/1/2 more often than EAM in day-to-day language. The Microsoft Learn page for Securing Privileged Access explicitly cross-references both [@ms-learn-spa-overview].

Coverage is one half of the story. The other half is detection. What does PIM do when someone in the Privileged Role Administrator role simply assigns Global Administrator to a user directly through Microsoft Graph, bypassing the activation workflow entirely?

7. The Detection Layer: Six PIM Alerts and the Assignment-Bypass Class

PIM gates activation. The first question every adversary thinks of, and every architect should think of next, is: what about the assignment itself? What happens when someone in the Privileged Role Administrator role just creates a permanent-active Global Administrator assignment directly, skipping the eligible-to-active workflow entirely?

The answer is the article's second aha moment, and it is deliberately surprising.

The six PIM Alerts

Microsoft Learn documents seven named alerts in the PIM Alerts surface for Microsoft Entra roles [@ms-learn-pim-alerts]. Six of them are behavioural detections; the seventh is a licensing-precondition alert that fires when the tenant lacks the appropriate license.The seventh alert, named "The organization doesn't have Microsoft Entra ID P2 or Microsoft Entra ID Governance," is a low-severity licensing-precondition alert. The "six PIM Alerts" framing in this article refers to the six behavioural alerts; the licensing alert is structurally distinct. The six behavioural alerts, with the canonical names verbatim from the documentation, are:

#	Alert (verbatim)	Severity	What it detects	Configurable threshold
1	There are too many Global Administrators	Low	Tenant exceeds a tunable count and percentage of standing GAs	Minimum count 2-100 and percentage 0-100%
2	Roles are being assigned outside of Privileged Identity Management	High	A privileged role assignment was created via Microsoft Graph or the classic admin centre without going through PIM	None (binary)
3	Roles are being activated too frequently	Low	Post-hoc activation-frequency anomaly	Activation count and time window
4	Administrators aren't using their privileged roles	Low	Staleness on activation; eligible assignment unused	0-100 day threshold
5	Roles don't require multifactor authentication for activation	Low	Configuration drift on the per-role activation policy	None (binary on role policy)
6	Potential stale accounts in a privileged role	Medium	Sign-in staleness on a privileged principal	1-365 day threshold

The third row -- "Roles are being assigned outside of Privileged Identity Management" -- is the load-bearing one. Microsoft Learn rates it High severity because it is the alert that fires when somebody routed around PIM entirely [@ms-learn-pim-alerts]. The verbatim documentation reads: "Privileged role assignments made outside of Privileged Identity Management aren't properly monitored and might indicate an active attack" [@ms-learn-pim-alerts].

The High-severity PIM Alert "Roles are being assigned outside of Privileged Identity Management." It fires when a privileged role is assigned via a path other than PIM -- typically via Microsoft Graph, the classic admin centre assignment surface, or PowerShell. The alert is detective. It fires after the assignment is created [@ms-learn-pim-alerts].

Detective, not preventive -- and why

Read the definition again. The alert fires after the assignment is created. PIM does not block direct assignments outside its workflow.

For most architects this lands hard. The reasonable next thought is "if PIM does not block the bypass, what is the point?" Sit with that thought, then read the design rationale.

The Microsoft Graph endpoints that allow direct role assignment are the integration surface every legitimate administrative tool uses. Identity Governance products use them. CI/CD identity provisioning scripts use them. Break-glass automations use them. Microsoft's own admin centres use them in some configurations. The customer-side tools that scan, audit, remediate, and provision against the tenant use them. A preventive block on direct assignment would break every one of those integrations. It would also break PIM itself; the eligible-to-active materialization step is a write to the same assignment surface.

Note: PIM does not block direct role assignments outside its workflow because blocking would break the Microsoft Graph integration surface every legitimate administrative tool uses. The High-severity assignment-bypass alert is detective: it fires after the assignment is created. Customers who need preventive blocking layer a separate Conditional Access policy on the Graph endpoint, an Azure Policy at the management-group scope, or an entitlement-management workflow on top of PIM.

This is Aha #2. The reader who walked in expecting PIM to be a "deny direct assignments" product walks out understanding why the design says "alert loudly via High severity, then let the customer layer preventive controls based on their tooling estate." The trade-off is named, not hidden.

The 1000-notification ceiling and the SIEM-side correlation

One operational footnote and one wider observation. The notification fan-out has a hard cap: "The maximum number of notifications sent per one event is 1000. If the number of recipients exceeds 1000, only the first 1000 recipients will receive an email notification" [@ms-learn-pim-alerts]. Very large tenants whose privileged groups exceed the cap should not rely on email-notification fan-out alone.The detection layer beyond PIM Alerts is Microsoft Sentinel UEBA, which builds dynamic behavioural profiles for users, hosts, IP addresses, applications, and other entities and emits anomaly scores against AuditLogs operations including role-eligibility additions and activations [@ms-learn-sentinel-ueba]. Sentinel UEBA is the closest 2026 Microsoft-shipped activation-anomaly-scoring surface; it is detective SIEM correlation, not synchronous gating.

The wider observation is that the PIM detection layer is one piece of a larger pipeline. PIM Alerts give you the High-severity assignment-bypass detection. Microsoft Sentinel UEBA gives you per-user behavioural-anomaly scoring against the audit-log events [@ms-learn-sentinel-ueba]. Entra ID Protection gives you sign-in-risk and user-risk classifications for the principal whose token was used. The mature 2026 deployment correlates all three; the assignment-bypass alert is the floor of that pipeline, not the ceiling.

Microsoft solved the JIT-admin problem with a two-state assignment object, four extension surfaces, and a six-alert detection layer. Did the rest of the industry agree? Look at what AWS and Google bet on, and at the third-party vault market that predates both.

8. Competing Architectures: AWS Sessions, GCP Bindings, and the Vault Model

Microsoft bet on a two-state assignment object. The rest of the industry placed different bets.

AWS bet on the session credential. Google bet on the conditional binding. The third-party PAM market bet on the vault. HashiCorp bet on the ephemeral credential. Each architecture is a different answer to one question: what should be the bounded unit of privilege? PIM bounds the assignment state; AWS bounds the session; GCP bounds the binding; CyberArk and Vault bound the credential. The methods are architecturally distinct, and they coexist in real estates more often than they compete.

AWS: bound the session

AWS IAM Identity Center plus the Security Token Service AssumeRole API bound the session, not the assignment. Permanent role-bindings -- permission sets attached to identities -- are themselves standing. The temporary part is the session that materializes when the identity calls AssumeRole. AWS documents this directly: "Temporary security credentials are short-term, as the name implies. They can be configured to last for anywhere from a few minutes to several hours. After the credentials expire, AWS no longer recognizes them or allows any kind of access from API requests made with them" [@aws-temp-creds].

The session lifecycle is concrete. AssumeRole returns an access key, a secret key, and a session token, with a minimum fifteen-minute and a maximum twelve-hour session duration; the API operation default is one hour [@aws-roles-use]. IAM Identity Center permission sets ship with a one-hour default and a one-to-twelve-hour configurable range [@aws-sessionduration].

The AWS Security Token Service API by which a principal materializes a time-bounded session credential -- access key, secret key, session token -- from a permanent role-binding. The session is the ephemeral artifact; the binding is permanent [@aws-temp-creds], [@aws-roles-use].

The AWS approach has clear strengths in multi-account AWS Organizations and in programmatic access. It is also the natural fit for any workload that needs short-lived credentials. The gaps relative to PIM: no built-in approval workflow, no equivalent of the PIM Alerts surface, and no eligible-versus-active distinction on the role-binding itself. A standing AssumeRole grant is, structurally, standing privilege; what is bounded is the session that consumes it.

Google Cloud: bound the binding

Google Cloud IAM took a different route. IAM Conditional Bindings let an allow policy include a Common Expression Language predicate that is evaluated at request time. The canonical temporal pattern is request.time < timestamp(...), which expires the binding at a wall-clock instant [@gcp-conditions]. There is a practical ceiling of one hundred conditional bindings per allow policy.

On top of conditional bindings, Google launched Privileged Access Manager (PAM) in public preview in May 2024 [@gcp-iam-release-notes], [@gcp-pam]. PAM adds the entitlement-and-grant workflow that PIM ships natively: eligible principals, eligible roles, max duration, justification, approvers, and notifications, with grant duration enforced by the underlying conditional binding revocation. Audit-event correlation is documented in a separate page [@gcp-pam-audit].

A Google Cloud IAM role binding that includes a Common Expression Language predicate evaluated at request time. The most common temporal pattern, `request.time < timestamp(...)`, expires the binding at a wall-clock instant; Google Cloud Privileged Access Manager layers an entitlement-and-grant workflow on top [@gcp-conditions], [@gcp-pam].

The GCP approach is the closest hyperscaler analogue to PIM's eligible/active model in architecture, but the PAM productization shipped in preview in May 2024 [@gcp-iam-release-notes] -- nearly a decade after Azure AD PIM's 2016 GA -- and the alert and detection surfaces are correspondingly less mature.

The third-party vault: CyberArk, BeyondTrust, Delinea

The longest-standing answer is the one the third-party PAM market built. CyberArk, BeyondTrust, and Delinea -- all three 2024 Gartner Magic Quadrant Leaders for Privileged Access Management [@cyberark-press-2024], [@beyondtrust-press-2024], [@delinea-press-2024] -- bound the credential, not the assignment or the session. The credential exists permanently in the vault; access to the credential is bounded by session brokering, periodic password rotation, and full session recording.

The vault model has structural strengths PIM's role-assignment-state model cannot match. The vault covers heterogeneous estates that include Windows, Linux, network devices, databases, mainframes, and OT/SCADA appliances -- every system whose credentials cannot be re-architected to a cloud-IAM eligible-active object. Vault-and-broker products provide session recording for SOX and PCI-DSS evidence collection, and they integrate with credential-rotation workflows for legacy vendor appliances whose hard-coded credentials cannot be eliminated.

Most large enterprises run both Entra PIM (for Entra and Azure role assignments) and a third-party PAM product (for SSH, on-premises service accounts, database passwords, network devices). The two markets are complements more than substitutes.

HashiCorp Vault and OpenBao: bound the credential's lifetime

HashiCorp Vault took the credential-bounded idea and made it ephemeral through dynamic secrets: a credential materialized on demand by Vault for a configured backend (a database, a cloud IAM, a PKI), returned with a lease and TTL, and revoked at the backend when the lease expires [@vault-databases]. The OpenBao fork, governed under the Linux Foundation, preserves the same dynamic-credential semantics [@openbao].OpenBao was created in late 2023 after HashiCorp moved Vault from the open-source MPL to the Business Source License. The Linux Foundation announced on April 30, 2024 that OpenBao would join LF Edge as one of four new projects (alongside EdgeLake, InfiniEdgeAI, and InstantX) at the Open Networking and Edge (ONE) Summit [@lfedge-openbao-2024]. The dynamic-secret primitive -- "create a credential, hand it out, revoke it at lease expiry" -- is preserved on both code lines.

A credential materialized by Vault on demand for a configured backend -- database, cloud IAM, or PKI -- returned with a lease ID and TTL; at lease expiry Vault revokes the credential at the backend. The canonical 2026 open-source primitive for replacing hard-coded application credentials [@vault-databases].

The Vault story matters for our purposes because it is the strongest 2026 coverage of the application-identity surface -- dynamic database credentials, Kubernetes service-account tokens, cloud-IAM short-lived credentials. PIM does not cover that surface today; Vault does. This previews the open boundary in section 9.

What is bound, in one comparison table

Method	What is bound	Mechanism	Default duration	Approval workflow	Detection layer	Partner tenant	Application identities	License
Entra PIM	Assignment state	eligible -> active transition with policy gates	1h (Global Admin)	Built-in approver routing	Six behavioural PIM Alerts plus Sentinel UEBA	GDAP + Lighthouse	Not yet (open boundary)	Entra ID P2 or Entra ID Governance
AWS IAM Identity Center + STS	Session credential	AssumeRole returns access/secret/session token	1h	Not built-in	Not equivalent to PIM Alerts	Not directly comparable	Strong (short-lived creds native)	Included in AWS
GCP IAM + PAM	Policy binding	CEL predicate plus entitlement-and-grant	Per entitlement	Built-in via PAM	Audit events plus Cloud Audit Logs	Cross-org via folders	Service-account impersonation	Included in GCP
CyberArk/BeyondTrust/Delinea	Credential knowledge	Vault stores, broker hands out, rotates	Per session policy	Built-in approver routing	Session recording, full SIEM integration	Per-tenant deployment	Coverage via shared accounts	Per-seat commercial
HashiCorp Vault / OpenBao	Credential lifetime	Lease-based revocation, dynamic secrets	Per backend, per lease	Optional plugins	Audit log; lease events	N/A	Strong (dynamic secrets)	Open source / commercial

The five methods occupy four positions on the "what is bound" axis: assignment-state (PIM), session-credential (AWS), policy-binding (GCP), and knowledge-of-credential (CyberArk and Vault). The methods are architecturally distinct, and the right enterprise answer in heterogeneous estates is some composition of more than one.

PIM is the most mature JIT-admin product in the cloud, and it has the most complete coverage of the user-principal surface. The remaining gaps are not about catching up to the competitors; they are about a class of identity the eligible/active model was never designed to gate.

9. What the JIT-Admin Pattern Does NOT Close

For all the architectural elegance of the two-state assignment object, PIM does not close the JIT-admin problem. It closes a sub-problem, very well, and leaves five structural limits an honest treatment must name.

9.1 Standing eligibility is itself standing privilege

PIM bounds the active duration. It does not bound the eligibility duration. A user with a permanent-eligible Global Administrator assignment is one activate() call away from the role's permissions for the next hour. If that user has been phished -- credential plus MFA bypass via a session-cookie capture, say -- the attacker can satisfy the gates. The MFA challenge passes. The justification text is whatever the attacker types. The approval, if required, routes to the legitimate approver, who may approve a legitimate-looking request that actually came from the attacker.

PIM produces an audit-log record of every step. It does not produce a structural impossibility. Eligibility is itself a security-critical property of the identity, and standing eligibility is the modern analogue of standing membership: a long-lived relationship between principal and role that a successful credential compromise can exercise.

9.2 Approver collusion

The approval gate is two-phishee resistant only when the requester and approver are independently compromisable. Two-phishee collusion -- the requester and the approver are the same adversary, or two adversaries cooperating -- defeats the workflow at the mechanism layer. The usual mitigations raise the bar: named approvers rather than approver groups (which can be compromised at the group level), CA-gated approval actions, and four-eyes alternatives. None close the class.

9.3 The application-identity gap

This is the article's heaviest limit, and it deserves the most space.

PIM's eligible-active state machine is currently defined over principal in (user | group). Service principals, managed identities, and OAuth consent grants do not flow through PIM activation. Their role assignments are permanent and active by default, and there is no eligible category that applies to them. Microsoft Learn's documentation for Workload ID Premium and Conditional Access for workload identities makes this explicit: ID Protection workload-identity risk detections cover service principals in single-tenant, non-Microsoft SaaS, and multitenant apps, but "Managed Identities aren't currently in scope" [@ms-learn-workload-identity-risk]. Conditional Access for workload identities applies similarly only to service principals owned by the organization, and CA policies "assigned to a group that contains a service principal are not enforced for that service principal" [@ms-learn-ca-workload-identity].

Andy Robbins's three-part Managed Identity Attack Paths series, published June 6-8, 2022 on the SpecterOps blog, is the canonical demonstration of how this gap is exploited [@robbins-mip-part1], [@robbins-mip-part2], [@robbins-mip-part3]. The mechanism is direct. An Azure compute resource -- an Automation Account [@robbins-mip-part1], a Logic App [@robbins-mip-part2], or a Function App [@robbins-mip-part3] -- carries an attached managed identity. The managed identity holds standing role assignments at whatever scope the operator granted, often Owner or Contributor on a subscription.

From inside the resource, any code can fetch an OAuth access token for the managed identity by calling the Azure Instance Metadata Service endpoint at http://169.254.169.254/metadata/identity/oauth2/token. No human in the loop. No MFA challenge. No PIM activation. The audit log records a service-principal token issuance, not an alice-clicked-Activate event.

Managed Identity assignments are an extremely effective security control... But Managed Identities introduce a new problem: they can quickly create identity-based attack paths in Azure that may lead to escalation of privilege opportunities. -- Andy Robbins, *Managed Identity Attack Paths, Part 1: Automation Accounts*, June 6, 2022 [@robbins-mip-part1] An Azure-managed service principal whose credentials are issued and rotated by Azure itself. The underlying Azure resource (a VM, App Service, Function App, Logic App, AKS cluster) retrieves the OAuth access token via the Instance Metadata Service endpoint. Managed identities are not currently in scope for PIM activation; their role assignments are permanent and active [@ms-learn-managed-identities-overview]. The Azure Instance Metadata Service endpoint at `http://169.254.169.254/metadata/identity/oauth2/token`, a link-local non-routable address reachable only from inside the Azure resource itself, that returns an OAuth 2.0 access token for the attached managed identity. The address is the credential: any process running on the resource can fetch the token without storing or presenting any secret. sequenceDiagram autonumber participant Attacker participant FunctionApp as Compromised Function App participant IMDS as IMDS endpoint 169.254.169.254 participant ARM as Azure Resource Manager participant PIMUnused as PIM activation (unused) Attacker->>FunctionApp: Code execution via supply-chain or vuln FunctionApp->>IMDS: GET /metadata/identity/oauth2/token IMDS-->>FunctionApp: OAuth access token for managed identity FunctionApp->>ARM: Action as Owner on subscription ARM-->>FunctionApp: Action succeeds Note over PIMUnused,Attacker: No human, no MFA, no activation, no PIM audit

MITRE ATT&CK maps the class explicitly. T1078.004 -- Valid Accounts: Cloud Accounts cites Robbins's Part 1 as primary reference for the managed-identity case [@mitre-t1078-004]. The page reads: "In Azure environments, adversaries may target Azure Managed Identities, which allow associated Azure resources to request access tokens. By compromising a resource with an attached Managed Identity, such as an Azure VM, adversaries may be able to Steal Application Access Tokens to move laterally across the cloud environment" [@mitre-t1078-004].

T1548.005 -- Temporary Elevated Cloud Access explicitly names PIM as an instance of the JIT-access pattern adversaries abuse: "Many cloud environments allow administrators to grant user or service accounts permission to request just-in-time access to roles... Just-in-time access is a mechanism for granting additional roles to cloud accounts in a granular, temporary manner" [@mitre-t1548-005].

T1548.005 (Temporary Elevated Cloud Access) lists Microsoft's *Approve just-in-time access requests* documentation as citation [1] of the technique, recognizing PIM as a canonical implementation of the JIT-access pattern adversaries abuse [@mitre-t1548-005]. Being named in the ATT&CK framework is, in the security domain, the most explicit acknowledgement an adversary model can give a defensive product.

Note: Three anchors to walk away with: Andy Robbins's June 2022 Managed Identity Attack Paths series [@robbins-mip-part1], [@robbins-mip-part2], [@robbins-mip-part3]; MITRE ATT&CK T1078.004 citing Robbins as primary [@mitre-t1078-004]; the IMDS endpoint at 169.254.169.254 as the technical mechanism [@ms-learn-managed-identities-overview]. If your tenant has any managed identity with Owner or User Access Administrator at a subscription scope, you have an unmediated bypass path around PIM until that role assignment is tightened.

9.4 The assignment-bypass is detective, not preventive

The High-severity assignment-bypass alert documented in §7 is detective by design (see Aha #2). The structural limit it leaves open is that preventive blocking is not the PIM product's default: customers who want it layer a Conditional Access policy on the Microsoft Graph endpoint or an Azure Policy at the management-group scope [@ms-learn-azure-policy], accepting that some legitimate Graph integration may need an exception.

9.5 Customer-owned PIM policy in CSP and Lighthouse scenarios

In the partner-managed case, the customer (not the partner) controls the PIM policy on a delegated authorization [@ms-learn-lighthouse-eligible]. This is the right place to put control, but it is also the place misconfiguration is most common. A customer whose Lighthouse eligible authorization is set with permissive activation policies (no MFA, no approval, large maximum duration) has an unmediated partner activation surface, and the partner cannot tighten the customer-side policy. The MSP-managed case is the operational gotcha most frequently raised at PIM-deployment review boards.

Aha #3: The gap is a data-model problem, not a patchable defect

This is the third aha moment, and it lands differently from the first two.

Key idea: The application-identity gap is not a backlog item. Extending the eligible-active state machine from principal in (user | group) to principal in (user | group | service principal | managed identity | OAuth consent grant) is a data-model extension that would require changes to the role-assignment object schema, the Microsoft Graph role-management endpoints, the PIM evaluation pipeline, the audit-log schema, the Sentinel detection schema, and every downstream IGA tool. The 2024+ Microsoft responses extend some controls to application identities. They do not yet introduce an eligible/active assignment-category type for application principals.

Microsoft has shipped partial responses. Entra Workload ID Premium [@ms-entra-workload-id-product] is a separate three-dollar-per-workload-identity-per-month SKU [@ms-entra-workload-id-product] that unlocks Conditional Access for workload identities [@ms-learn-ca-workload-identity] (with the explicit managed-identity exclusion clause) and ID Protection workload-identity risk detections [@ms-learn-workload-identity-risk]. The PIM page on access reviews documents that "Using Access Reviews for Service Principals requires a Microsoft Entra Workload ID Premium plan in addition to a Microsoft Entra ID P2 or Microsoft Entra ID Governance license" [@ms-learn-pim-access-reviews]. Microsoft's flagship Ignite 2025 announcement was Microsoft Entra Agent ID for AI agents [@ms-entra-ignite-2025]; the announcement is identity for AI workloads, not an eligible-active type extension for service-principal role assignments.

Robbins's class is closed-form within the 2026 PIM architecture. Closing it requires a new architecture, not a patch.

None of these limits is a defect. Each is a deliberate design boundary, and naming them is the academic honesty the topic deserves. The interesting question: where is active research happening, and what would closing the gap actually look like?

10. Open Problems: Where Active Research Is Happening

The five limits in section 9 are settled architectural boundaries. The open problems are different. Each is something nobody has shipped a complete solution to as of 2026, but each has named partial results and named anchors.

10.1 JIT-gating application identities

The data-model extension previewed in section 9's Aha #3 is the largest open problem in this space, and the one Microsoft is responding to most publicly.

What has been tried. Entra Workload ID Premium at three dollars per workload identity per month [@ms-entra-workload-id-product]. Conditional Access for workload identities, which lets the tenant block service-principal sign-ins based on IP range, ID-Protection risk score, or authentication context [@ms-learn-ca-workload-identity]. ID Protection workload-identity risk detections that flag suspicious sign-ins, leaked credentials, and admin-confirmed compromise for service principals [@ms-learn-workload-identity-risk]. Service-principal access reviews, gated behind Workload ID Premium plus Entra ID P2 or Governance [@ms-learn-pim-access-reviews]. Microsoft Entra Agent ID, the flagship Ignite 2025 announcement, brings first-class identity to AI agents [@ms-entra-ignite-2025] -- parallel to, but not the same as, an eligible-active type extension on application role assignments.

An identity used by a software workload to authenticate to other services. In Microsoft Entra ID the term encompasses application objects, service principals, and managed identities [@ms-learn-workload-identities-overview]. As of 2026, workload identities are not in scope of the eligible/active assignment-category model. The 2024+ Workload ID Premium SKU extends sign-in-time controls and risk detection to service principals, but does not yet introduce an eligible category for service-principal role assignments.

What is the conjecture? Closing this gap requires extending the role-assignment object's principal axis to include service principals, managed identities, and OAuth consent grants as first-class subjects of the eligible-active state machine. That extension would require a defined activate() semantics for non-human principals -- itself the hard problem, because the canonical user activation flow assumes an interactive MFA challenge.

Microsoft Learn states the difficulty bluntly: workload identities "can't perform multifactor authentication. Often have no formal lifecycle process. Need to store their credentials or secrets somewhere" [@ms-learn-workload-identities-overview]. The non-interactive case requires either programmatic policy gates (request from this caller, from this IP range, against this entitlement) or a delegation model where a human approver supplies the gate-passing event on the workload's behalf.

10.2 Real-time activation-anomaly blocking

The PIM Alert "Roles are being activated too frequently" is post-hoc. It fires after the activation has already occurred and after the count crosses a threshold. The phished-but-still-authentic activation -- the attacker who supplies a valid MFA, a plausible justification, and a real ticket number -- is observationally indistinguishable from a legitimate emergency activation at the mechanism layer. The only signal that distinguishes them must come from behavioural telemetry.

What has been tried. Microsoft Defender for Cloud Apps ships an out-of-the-box user-and-entity behavioural analytics (UEBA) and machine-learning anomaly-detection layer; the documented policy weighs more than thirty risk indicators across eight risk-factor groups (risky IP, login failures, admin activity, inactive accounts, location, impossible travel, device and user agent, activity rate), with a seven-day initial learning period and a June 2025 transition to a dynamic threat-detection model [@ms-learn-dfca-anomaly]. Microsoft Sentinel UEBA scores anomalies post-event against AuditLogs operations including role-eligibility additions and activations [@ms-learn-sentinel-ueba]. Microsoft Defender for Identity correlates on-premises and cloud sign-in patterns for behavioural-anomaly detection. Neither Sentinel UEBA nor Defender for Cloud Apps is a synchronous gate. Both are detective layers that fire after the activation event has already created consequences.

The academic upper bound for what character-level and LSTM detectors achieve on adjacent tasks comes from Hendler, Kels, and Rubin's 2019 work on AMSI-based detection of malicious PowerShell code, which reports a true-positive rate of nearly 90% at a false-positive rate of less than 0.1% on the PowerShell-misuse classification problem [@arxiv-hendler-1905]. That is the ceiling a probabilistic activation-anomaly classifier could approach. It is not enough to gate synchronously without false-positive operational pain, which is why the deployed surface is post-hoc UEBA scoring rather than pre-commit blocking.

The conjecture. Synchronous gating on behavioural signal at activation time would require Conditional Access (or its successor) to subscribe to an activation-event hook and consume a risk score from ID Protection, Defender for Cloud Apps, or Sentinel UEBA in the few hundred milliseconds before PIM materializes the active assignment. The architectural primitives exist; the synchronous risk-evaluation hook does not yet ship.

10.3 Hybrid-bridge JIT

A single approval workflow spanning the on-premises (MIM PAM / shadow principals) and cloud (Entra PIM) boundaries is not a shipping product. Microsoft has Entra Cloud Sync and Entra Connect for directory synchronization; neither bridges the activation workflow. MIM 2016 is on extended support through January 9, 2029 [@ms-learn-mim-2016]; Microsoft Learn states the path forward is cloud-first PIM with on-prem AD progressively scoped down to the few resources that cannot move [@ms-learn-mim-pam-overview].

MIM 2016 PAM is in extended support, not active development, and Microsoft Learn explicitly states it is "not recommended for new deployments in Internet-connected environments" [@ms-learn-mim-pam-overview]. SP3 ships compatibility updates for SharePoint SE, Exchange SE, and SQL Server 2022 [@ms-learn-mim-2016], but the product line is in maintenance posture. The on-premises half of a hybrid-bridge JIT story requires a different architectural choice than re-investing in MIM.

10.4 Coverage-as-code

How do you evaluate PIM policy coverage in CI/CD for a tenant with two hundred custom Azure roles and fifty directory roles, and gate every PR that touches the role-management policies?

Best partial results. Microsoft Cloud Security Benchmark v3 Privileged Access controls (PA-1, PA-2, ...) give Boolean per-recommendation pass/fail evaluation [@ms-learn-mcsb-v3-pa] -- close, but per-recommendation Boolean rather than composable policy. The PowerShell cmdlets Get-MgPolicyRoleManagementPolicy and Get-MgPolicyRoleManagementPolicyAssignment read role-management policies via Microsoft Graph; the cmdlets ship in the Microsoft.Graph.Identity.SignIns module, despite the Identity Governance branding [@ms-learn-graph-pim-policy-cmdlet].The PIM role-management-policy cmdlets are commonly mis-attributed to the Microsoft.Graph.Identity.Governance PowerShell module because of the Identity Governance branding. They are actually in Microsoft.Graph.Identity.SignIns. The Import-Module line that gets the cmdlets into scope is Import-Module Microsoft.Graph.Identity.SignIns [@ms-learn-graph-pim-policy-cmdlet]. The EntraOps Privileged EAM community project on GitHub, maintained by Thomas Naunheim, demonstrates the "track changes and history of privileged principals and their assignments as code" idiom against the Enterprise Access Model classification [@entraops-github]. Azure Policy itself operates on Azure resource configurations and does not directly evaluate PIM role-management policy state [@ms-learn-azure-policy], which is the data-model gap that drives the GitOps-flavoured drift-detection community pattern.

{` // Take an array of role-management policy assignments // (the kind Get-MgPolicyRoleManagementPolicyAssignment returns) // and assert tenant-wide PIM coverage invariants.

The conjecture. A full coverage-as-code primitive needs Azure Policy (or its successor) to evaluate PIM role-management policy state with the same first-class semantics it applies to Azure resource configuration. That extension would let a tenant declare an invariant -- "every role in the control plane has requires_mfa=true and max_duration_hours <= 1" -- and have the platform enforce it continuously across drift, the way Azure Policy already enforces resource invariants.

10.5 Adaptive-cadence eligibility reviews

Should eligible membership be access-reviewed at higher cadence than active assignments? Eligible membership is standing privilege; active membership is bounded. The argument for adaptive cadence -- reviewing eligibility more frequently when behavioural signals or organizational events suggest the principal may no longer need the role -- is intuitive but mechanically unshipped.

Best partial result. The 2024+ ML-based access-review recommendations [@ms-learn-review-recommendations] -- inactive-user 30-day Deny, user-to-group-affiliation Deny -- are within-cycle reviewer-assist features. They help reviewers decide during a configured access review. They are not cross-cycle adaptive-cadence triggers that fire a new review off-schedule when conditions warrant.

These are research problems. The practitioner does not have the luxury of waiting for them to be solved. What does Monday morning look like for the architect who has read this far and now has to deploy?

11. Practical Guide: Monday Morning for the 2026 Tenant Architect

You have read ten thousand words. You are responsible for a Microsoft 365 tenant that audits against SOX, SOC 2, and ISO 27001. You have a budget for Entra ID P2 (or Entra ID Governance) per privileged user. What do you do on Monday?

Work in this order. The list is ordered by cost-to-impact, with the cheapest, highest-impact items first.

Step 1: Baseline the Tier-0 surface

Every directory role at "Privileged" classification or above should be PIM-eligible-only. The exceptions are the two emergency-access permanent-active Global Administrator accounts (break-glass), which we return to in Step 4.

Activation requires MFA, approval, justification, and ticket number for control-plane and management-plane roles. Maximum activation duration is one hour for Global Administrator and Privileged Role Administrator, and four hours for less-privileged roles. Configure per role per scope; remember that PIM-for-Azure-Resources policies do not inherit.

Import-Module Microsoft.Graph.Identity.Governance
Connect-MgGraph -Scopes 'RoleManagement.Read.Directory','User.Read.All'
$gaRoleId = (Get-MgRoleManagementDirectoryRoleDefinition `
    -Filter "displayName eq 'Global Administrator'").Id
Get-MgRoleManagementDirectoryRoleAssignment `
    -Filter "roleDefinitionId eq '$gaRoleId'" `
    -ExpandProperty Principal |
    Select-Object @{n='User';e={$_.Principal.AdditionalProperties.userPrincipalName}}, RoleDefinitionId

This lists every standing-active Global Administrator in the tenant. Compare against your break-glass roster and your active PIM activations. Anything else is technical debt.

Step 2: Configure access reviews

Quarterly for Tier-0 and control-plane roles. Semi-annually for Tier-1 and management-plane. Annually for Tier-2 and data/workload-plane [@ms-learn-pim-access-reviews]. Turn on the ML-based review recommendations: the 30-day inactive-user Deny recommendation is the reviewer-assist baseline, and the user-to-group-affiliation Deny recommendation helps reviewers spot principals who are organizationally distant from the rest of the group's membership [@ms-learn-review-recommendations].

Step 3: Turn on every PIM Alert and tune the GA-count threshold

Enable all six behavioural PIM Alerts. Tune the "There are too many Global Administrators" alert to a minimum count of two and a percentage of 50% [@ms-learn-pim-alerts]. The expected steady-state count is "fewer than five standing GAs, most of which are break-glass." The High-severity assignment-bypass alert is non-negotiable; route it to a 24x7 SOC queue with an incident-response runbook.Microsoft Secure Score's "Limit the number of Global Administrators" recommendation targets fewer than five standing GAs as the canonical baseline.

Step 4: Break-glass discipline

Two emergency-access permanent-active Global Administrator accounts. Not one, not three.

Note: One break-glass account is a single point of failure: if it is locked, lost, or compromised, the tenant has no emergency entry path. Three or more begin to expand the blast radius unnecessarily. Two balances the two failure modes. FIDO2 hardware keys, stored in physical safes, with continuous sign-in alerting.

Note: Conditional Access policies can lock you out. Break-glass accounts must be excluded from every CA policy that could prevent their sign-in. Compensate with continuous sign-in alerting on every break-glass authentication event; alerts are the substitute for the gate you are deliberately removing.

Step 5: Extend PIM to the four boundaries

PIM-for-Groups: gate ownership of every directory-role-assignable group, every privileged-access security group, and every group that grants management-group-level Azure RBAC. Membership alone is insufficient; ownership is a backdoor to membership.

PIM-for-Azure-Resources: gate Owner, User Access Administrator, and Contributor at the management-group scope, then explicitly at every subscription, every resource group, and every resource where the role is assignable. Inheritance does not flow; configure per scope.

GDAP and Lighthouse: every CSP partner authorization must be eligible, not active. Set the customer-side PIM policy explicitly. Audit annually.

PIM with Conditional Access: attach an authentication-context tag to activation policies on the privileged Entra roles. Add a CA policy that requires a compliant device and a fresh MFA challenge on activation. The activation gate becomes structurally tighter than the sign-in gate, which is the correct ordering for high-privilege actions.

Step 6: Continuous detection

Pipe PIM activation events (via Microsoft Graph audit logs, surfaced in the AuditLogs and MicrosoftGraphActivityLogs Azure Monitor tables) to your SIEM. Cross-correlate with Entra ID Protection sign-in risk and Microsoft Sentinel UEBA anomaly signals [@ms-learn-sentinel-ueba]. KQL templates to write: (a) GA activations outside business hours; (b) activations from non-compliant devices; (c) the assignment-bypass alert correlated with the activating principal's recent sign-in risk score; (d) managed-identity token issuance against subscription-scoped Owner.

Step 7: Mind the application-identity surface

This is the longest-running open item. Inventory every managed identity in the tenant. For each, document the role assignment, the scope, and the resource that holds it.

Apply the "Owner and User Access Administrator at subscription scope is dangerous" rule first; tighten those to Contributor or a custom role wherever possible. Where a managed identity must hold a high-privilege role at a high scope, treat the underlying resource (Function App, Logic App, VM, AKS cluster) as a Tier-0 asset for the purposes of patching, network exposure, and code-review process. Until PIM gates application identities natively, the Tier-0-asset framing is the substitute control.

That is the playbook for the user-principal side of the JIT-admin problem. The application-identity side is still being written. The next iteration of this material will be about the data-model extension that closes Robbins's gap, or the architectural successor that arrives in its place.

12. Frequently Asked Questions and Closing

Three classes of question come up every time this material is taught. The first is conceptual ("what does eligible actually mean?"). The second is operational ("do I need MFA?"). The third is adversarial ("what about managed identities?"). Each appears below.

No. Eligible assignments are permanent in most tenants -- they are the standing relationship between principal and role -- but they grant no privilege until you activate. Only the *active* state is bounded. Your admin rights still exist; they are simply not exercised continuously [@ms-learn-pim-configure]. Only if the role's activation policy is configured to require it. PIM's activation gates -- MFA at activation, approval, justification, ticket number, and activation maximum duration -- are per-role, per-scope flags the tenant sets independently. A role with `requires_mfa=false` and `requires_approval=false` is a valid (if loose) PIM configuration [@ms-learn-pim-change-default-settings]. One hour for the highest-privileged Entra directory roles, including Global Administrator and Privileged Role Administrator. The configurable range is one to twenty-four hours per role per scope [@ms-learn-pim-change-default-settings]. Tighten where you can; the activation cost is small, the standing-active surface saving is large. No. Conditional Access gates the sign-in event. PIM bounds the assignment state. A compromised CA-gated GA still has GA privileges once they sign in -- the gate that mattered (activation) was never traversed. CA and PIM compose; PIM is not a substitute for CA, and CA is not a substitute for PIM. No. PIM alerts via the High-severity "Roles are being assigned outside of Privileged Identity Management" alert when a direct assignment happens [@ms-learn-pim-alerts]. The detection is intentional rather than preventive: blocking direct assignment would break the Microsoft Graph integration surface every legitimate administrative tool uses. Preventive controls -- Conditional Access on the Graph endpoint, Azure Policy at the management-group scope, or entitlement-management workflows -- are added separately based on the tenant's tooling estate. No. PIM's eligible/active state machine is defined over user and group principals. Service principals, managed identities, and OAuth consent grants route around PIM activation entirely. Andy Robbins's June 2022 *Managed Identity Attack Paths* series [@robbins-mip-part1], [@robbins-mip-part2], [@robbins-mip-part3] is the canonical demonstration; MITRE ATT&CK T1078.004 [@mitre-t1078-004] cites Robbins as primary reference. Workload ID Premium plus Conditional Access for workload identities extends sign-in-time controls to service principals (with managed identities still excluded), but does not yet introduce an eligible category for workload-identity role assignments [@ms-learn-ca-workload-identity], [@ms-learn-workload-identity-risk]. Microsoft has shifted the framing to the Enterprise Access Model: control plane, management plane, and data/workload plane [@ms-learn-eam]. The retirement of Tier-0/1/2 is partial; the practitioner community still uses the legacy terms day to day. The underlying principle -- privilege boundaries you do not cross with a single credential -- is preserved across both framings.

Closing

Read the section 1 vignette again. The 2026 tenant where alice@contoso.com is Global Administrator for exactly one hour, with an audit log so complete the SOC 2 auditor signs it without questions, is not a configuration choice. It is the visible behaviour of an identity system whose role-assignment object carries one more field than the 2015 version did. Standing admin did not retire because operators got more disciplined. Standing admin retired because the data model grew a second state.

The forty years between Saltzer and Schroeder's 1975 paper and the 2015 Azure AD PIM Preview were not lost time. UNIX sudo, Kerberos delegation, DACLs, AD groups, MIM PAM, Pass-the-Hash v1 and v2, the Securing Privileged Access roadmap -- each built up the structural understanding that least privilege required a temporal mechanism, not just a static one, and that the temporal mechanism had to live on the assignment object itself, not on the group, the credential, the session, or any indirection through a separate forest. The single new field on the role-assignment object is what those forty years were preparing.

What remains undone is the application-identity boundary. The same role-assignment object Microsoft retrofitted to gate user activation does not yet gate the managed identity attached to a Function App. The IMDS endpoint at 169.254.169.254 is the canonical 2026 bypass path that proves it. Closing that gap, when it comes, will not be a patch to the existing eligible/active state machine. It will be the next chapter -- the one where the state machine learns to apply to a principal that cannot perform an interactive MFA challenge, and the activation semantics are reinvented for the non-interactive case.

The story is not finished. But the first chapter -- the chapter where standing admin became visibly the anti-pattern it had always been -- is.

Hyper-V Enlightenments, VMBus, and the Synthetic Device Model

noreply@paragmali.com (Parag Mali) — Thu, 14 May 2026 00:00:00 GMT

Hyper-V's guest OSes do not see emulated 1990s hardware. They see a published, versioned hypervisor ABI called the **Top-Level Functional Specification**, a transport called **VMBus** that consists of two ring buffers per channel, and a catalogue of synthetic devices whose backends live in the privileged root partition. This design is what makes Windows and Linux equally fast inside Hyper-V, and it is also why the host-side parsers in `vmswitch.sys` keep producing critical CVEs. The 2024 OpenHCL paravisor moves those parsers into the guest's own trust boundary in memory-safe Rust, which is the most consequential change to the Hyper-V device model since 2008.

1. The Type-1 hypervisor foundation

Open Task Manager on a modern Windows 11 desktop, switch to the Performance tab, and look at the line that says "Virtualization: Enabled." That single line hides one of the most consequential design choices in modern operating systems: when Microsoft shipped Hyper-V with Windows Server 2008 in June 2008 [@ms-hyperv-server-overview], they did not bolt a virtualization product on top of Windows. They put a small hypervisor underneath it.

That ordering matters more than it sounds. In the older Microsoft Virtual Server 2005 model, Windows ran on the bare metal and a user-mode service emulated PC hardware for guests inside it. In the Hyper-V architecture documented by Microsoft in 2008 [@ms-hyperv-architecture], the hypervisor boots first and Windows itself becomes a guest of the hypervisor. Microsoft calls this guest the root partition. Every other VM on the box is a child partition.

A hypervisor that runs directly on the physical hardware rather than inside a host operating system. Hyper-V, VMware ESXi, and Xen are Type-1; VirtualBox and the original Microsoft Virtual Server are Type-2 (hosted). In a Type-1 design no general-purpose OS sits between the hypervisor and the silicon, which lets the hypervisor enforce isolation directly using CPU virtualization extensions like Intel VT-x and AMD-V.

The root partition is not just another VM. It is a privileged partition: it owns the physical I/O devices, runs the parent stack of synthetic-device backends, and brokers everything that touches real hardware. Children get virtual processors and a slice of memory, and they communicate with the root over a software bus called VMBus that we will spend most of this article taking apart.

flowchart TD HW["Physical hardware (CPU, RAM, NICs, NVMe)"] HV["Hyper-V hypervisor (microkernel)"] Root["Root partition (Windows Server)"] VSP["Virtualization Service Providers (VSPs): vmswitch.sys, storvsp.sys, ..."] C1["Child partition: Windows VM"] C2["Child partition: Linux VM"] VSC1["VSCs: netvsc, storvsc, ..."] VSC2["VSCs: hv_netvsc, hv_storvsc, ..."] HW --> HV HV --> Root HV --> C1 HV --> C2 Root --> VSP VSP -. "VMBus channel" .-> VSC1 VSP -. "VMBus channel" .-> VSC2 C1 --> VSC1 C2 --> VSC2

The hypervisor itself is small by design. The Hyper-V architecture page on Microsoft Learn [@ms-hyperv-architecture-perf] describes it as a microkernel: it does the minimum a hypervisor must do (CPU scheduling, memory partitioning, interrupt routing, an inter-partition message bus) and pushes everything else, including the device models, out to the root partition. This is the opposite of the early VMware ESX design, where the hypervisor itself contained large device drivers.The microkernel choice was pragmatic, not ideological. A monolithic hypervisor with built-in NIC and storage drivers would have been a catastrophic certification problem: every NIC firmware update would risk a hypervisor patch. By delegating I/O to the Windows root partition, Microsoft re-used the entire Windows driver stack.

The split also explains why Hyper-V "feels Windows-shaped" even though it is technically not Windows. The root partition is Windows, with all of its drivers, its WMI, its event log, its Get-VM PowerShell cmdlets. The hypervisor underneath is a small, separate binary (hvix64.exe on Intel, hvax64.exe on AMD) that you almost never have a reason to think about. Microsoft itself goes further: in the same architecture document, it stresses that all device-model traffic flows through the root: "the management operating system hosts virtual service providers (VSPs) that communicate over the VMBus to handle device access requests from child partitions" (Microsoft Learn: Overview of Hyper-V [@ms-overview-hyper-v]).

This sets up the question the rest of the article answers: if the hypervisor is small, the guest is unmodified Windows or Linux, and the root partition owns the real devices, then how does a guest actually do disk and network I/O at gigabit-or-better speeds without paying enormous costs to traverse all of these boundaries?

The short answer is in three pieces: enlightenments (the guest knows it is virtualized and uses hypercalls), VMBus (the inter-partition transport), and the VSP/VSC pair (split drivers that share memory through VMBus rings). The next section starts with the first of those three.

2. Enlightenments: what "knowing you are virtualized" buys you

In the early 2000s, the dominant intuition was that a hypervisor's job is to fool the guest. A perfectly faithful emulation of an Intel 440BX motherboard, a DEC 21140 NIC, and an IDE controller is what made VMware Workstation a useful product in 1999. It is also what made Microsoft Virtual Server 2005 too slow to saturate gigabit links: every out instruction on a fake NIC port trapped to the hypervisor, was decoded against an in-memory chip model, and produced a synthetic interrupt that itself trapped on the way out. The Microsoft Virtual Server retrospective on Wikipedia [@wikipedia-virtual-server] notes that the architecture had no paravirtualization support and that performance was constrained relative to later hardware-assisted designs.

Hyper-V's answer was to drop the pretence. If the guest knows it is in a VM, it can use a fast path designed for VMs instead of pretending to drive imaginary chips. Microsoft calls this knowledge an enlightenment, and the Hyper-V feature discovery page [@ms-tlfs-feature-discovery] is the contract a guest uses to learn what enlightenments the hypervisor offers.

A modification or feature in a guest operating system that takes advantage of running under a specific hypervisor. An enlightened guest detects the hypervisor (on x86, by reading the `cpuid` leaves at `0x40000000` and above), then opts in to using paravirtual interfaces (hypercalls, synthetic timers, synthetic interrupt controllers, shared TSC pages) instead of trapping on emulated hardware. An unmodified guest would still boot, but slower.

Detection is the cheap part. The Linux kernel's Hyper-V overview document [@kernel-hyperv-overview] describes four cooperating mechanisms, layered atop one another: implicit traps that the hypervisor handles transparently, explicit hypercalls the guest issues on purpose, synthetic registers exposed as model-specific registers (MSRs) in the architectural CPU register file, and VMBus for high-bandwidth device traffic. Each layer builds on the one below it.

Key idea: The contract between Hyper-V and its guests is published. Microsoft maintains the Top-Level Functional Specification as a public document under the Open Specification Promise. That single decision is why Linux ships an in-tree Hyper-V driver stack and why VMBus is not a black box.

The hypercall page

The first thing an enlightened guest does is set up a hypercall page. The TLFS Hypercall Interface page [@ms-tlfs-hypercall] describes the dance: the guest writes its identity into HV_X64_MSR_GUEST_OS_ID (MSR 0x40000000), then writes a guest-physical address and an enable bit into HV_X64_MSR_HYPERCALL (MSR 0x40000001). The hypervisor responds by populating that page with the right opcode for the current CPU: vmcall on Intel, vmmcall on AMD. From that moment on, "make a hypercall" is a normal call into a known address rather than an opcode the kernel must hand-assemble per CPU vendor.This trick neatly externalises the vendor-specific calling convention. Microsoft can later swap to a new opcode (say, on ARM64, where the equivalent is an HVC instruction) without any guest code change. The guest just learns the new page contents.

The same TLFS page documents two hypercall classes: simple hypercalls (one operation, returns or faults) and rep (repeated) hypercalls that take a counter and a start index, so a long-running operation can yield mid-flight without losing work. Three calling conventions exist: a memory-based one for large parameter blocks, a register-only fast variant for the very common case of one or two inputs, and an XMM-register variant that lets a guest pass up to 112 bytes of input through SSE registers.

That XMM variant is unusual enough to flag. Most kernel ABIs do not touch SSE in privileged code because saving and restoring the full SSE state is expensive. Hyper-V's hypercall ABI uses XMM precisely because the round-trip cost of a hypercall is dominated by the VMEXIT itself, so squeezing a few more bytes into registers is cheaper than spilling them to memory and reading them back.

Synthetic interrupts and synthetic timers

A guest's virtual processor has its own emulated local APIC by default, but an enlightened guest can also use a Synthetic Interrupt Controller (SynIC), defined in the TLFS. Each virtual processor gets 16 SINT slots, a per-CPU shared message page, and a per-CPU shared event page. SINTs are how VMBus signals events to the guest without going through the legacy LAPIC fast path.

One of 16 logical interrupt sources per virtual processor that the Hyper-V Synthetic Interrupt Controller can signal. SINTs are reachable through MSRs (`HV_X64_MSR_SINT0` through `HV_X64_MSR_SINT15`) and back the doorbell mechanism for VMBus channels and for synthetic timers. They are paravirtual: they would not exist on a bare-metal CPU.

The clock side is even more interesting. The Linux kernel Hyper-V clocks documentation [@kernel-clocks] describes a reference TSC page that the hypervisor maintains in shared memory: it contains a scale factor and an offset such that

$$ \text{guest_time} = (\text{TSC} \times \text{scale}) >> 64 + \text{offset} $$

ticks at a constant 10 MHz frequency regardless of the underlying TSC. The guest's clock_gettime and gettimeofday can read TSC, multiply, shift, add, and return, all in user space via vDSO, with no kernel transition and no hypercall.

A web server that calls `clock_gettime` once per request, on a million-requests-per-second box, is a ridiculous workload that real systems run constantly. Without enlightenments, every call would be a `rdmsr` on a virtualised TSC or a trap into the hypervisor. With the reference TSC page, the same call is four arithmetic ops and a memory load. The kernel doc explains that this scale and offset survive live migration: "in the case of a live migration to a host with a different TSC frequency, Hyper-V adjusts the scale and offset values in the shared page so that the 10 MHz frequency is maintained" (Linux kernel: Hyper-V clocks [@kernel-clocks]).

Synthetic timers complete the picture. Each virtual CPU has four synthetic timers programmable via MSRs; they fire SINTs into the SynIC. The guest does not need to touch an emulated PIT or HPET. Combined, SynIC + synthetic timers + the reference TSC page mean that an enlightened guest can do most of its time-keeping and inter-partition signalling without ever touching the legacy interrupt/timer chip surface.

The TLFS as a contract

All of this is published. The Top-Level Functional Specification [@ms-tlfs] is the document a guest author reads to know which MSRs to write, which cpuid leaves to query, which hypercalls exist, and which features the hypervisor signals via feature flags. Microsoft maintains it under the Open Specification Promise. That promise is a deliberate contractual choice. Without it, Linux could not ship drivers/hv/ in-tree and Microsoft could not credibly claim that Linux is a first-class Hyper-V guest. The TLFS is the artefact that makes the rest of the architecture cooperative rather than reverse-engineered.

The next layer up uses these primitives to build something more ambitious: a general-purpose inter-partition transport.

3. VMBus: the inter-partition transport

If enlightenments are the alphabet, VMBus is the language that synthetic devices speak. The Linux kernel VMBus document [@kernel-vmbus] puts the definition tersely: "VMBus is a software construct provided by Hyper-V to guest VMs. It consists of a control path and common facilities used by synthetic devices that Hyper-V presents to guest VMs. The common facilities include software channels for communicating between the device driver in the guest VM and the synthetic device implementation that is part of Hyper-V, and signaling primitives to allow Hyper-V and the guest to interrupt each other."

There is a lot in that paragraph. Let me unpack it, because this is the architectural core.

A software-only inter-partition communication bus provided by Hyper-V. It has a control path (channel offer, open, close, rescind), and per-device data channels built on shared memory ring buffers. VMBus is not a real bus in any hardware sense; nothing on the PCIe topology is named VMBus. It is a contract between guest drivers and the hypervisor.

Channels and the offer protocol

Every synthetic device a guest sees corresponds to a VMBus channel. The root partition advertises (OfferChannel) the list of devices a guest is permitted to use. The guest's VMBus driver iterates the offers, matches each to a class GUID (synthetic SCSI is one GUID, synthetic NIC is another, the input-style vmbusrhid device is a third), and binds an in-kernel device driver to each one. The reverse operation, RescindChannel, lets the host revoke a device cleanly, which is what happens during live migration when an SR-IOV virtual function gets pulled out from under a running VM.

sequenceDiagram participant Root as Root partition (VSP) participant HV as Hyper-V hypervisor participant Guest as Guest VM (VSC) Root->>HV: OfferChannel(class_guid, instance_guid) HV->>Guest: ChannelOffer message via SynIC Guest->>HV: OpenChannel(ringbuf_gpa, signal_event) HV->>Root: Channel opened loop steady-state I/O Guest->>Root: write descriptor + payload to ring, signal SINT Root->>Guest: write response to ring, signal SINT end Root->>HV: RescindChannel(instance_guid) HV->>Guest: ChannelRescind via SynIC Guest->>Root: CloseChannel

Two ring buffers, one channel

Each open channel is two unidirectional ring buffers in shared memory: one for guest-to-host messages, one for host-to-guest. Each ring has a 4 KiB header page that holds the read index, the write index, and control flags, plus a power-of-two payload region. The guest tells the hypervisor which guest-physical pages back the ring through an object called a GPA Descriptor List (GPADL), built up via the vmbus_establish_gpadl API.

The kernel doc reveals a small but durable engineering detail. It maps the ring buffer twice in the guest's kernel virtual address space: header page first, ring contents next, and then the ring contents again, contiguously. Why? Because that lets a copy loop walk past the end of the ring without writing wrap-around code; the next byte after the ring's last byte is the ring's first byte, by virtual-memory arrangement. It is the same trick used inside the Linux page cache for fbdev and inside DPDK's mempool. It costs a little address space; it saves a branch on every payload byte.The Linux kernel doc is explicit that this double-mapping convenience exists in the guest only. If you are writing a userspace tool that ingests a captured VMBus ring (for forensics or debugging) you must implement wrap-around manually. This is exactly the kind of detail that source code documentation captures and prose articles forget.

The total amount of GPADL-shared memory a single guest can hold is capped per Windows version. The kernel doc records the numbers: roughly 1280 MiB on Windows Server 2019 and later, roughly 384 MiB on earlier hosts (Linux kernel: VMBus [@kernel-vmbus]). For a guest with 30+ channels (multiple netvsc subchannels, multiple storvsc subchannels, vPCI, KVP, time sync, VSS, balloon, framebuffer), that ceiling is real but not yet limiting at typical ring sizes of 1 to 16 MiB per direction.

The doorbell

Shared memory alone is not enough. The guest can write into the ring all it wants; the host will not look until it is told to. Conversely, the host can write into the ring; the guest will not check until something signals it. That signal is the doorbell, and it is implemented via the Synthetic Interrupt Controller SINTs introduced in the previous section.

When the guest enqueues a request and the host's read pointer is already chasing it (i.e., the host is still processing the last batch), the guest can suppress the doorbell entirely. Only the first request after the host has caught up triggers a hypercall. This is interrupt coalescing in software, and it is the single most important performance lever on a software data plane: the round-trip cost of a VMEXIT is amortised across many packets.

Note: This same shape, shared memory rings plus an event-channel doorbell, was the central insight of Xen's split-driver paravirtualization model in 2003 [@xen-pv-wiki]). Hyper-V's contribution was not the shape; it was packaging the shape so unmodified Windows guests could use it via in-box drivers, and publishing the protocol so unmodified Linux could too.

VSPs and VSCs

The two endpoints of a channel have specific names. The Virtualization Service Provider (VSP) is the kernel module in the root partition that owns the device backend. The Virtualization Service Client (VSC) is the guest-side driver that talks to the VSP through the channel. Microsoft's own architecture page is precise: "the Hyper-V-specific I/O architecture consists of virtualization service providers (VSPs) in the root partition and virtualization service clients (VSCs) in the child partition. Each service is exposed as a device over VM Bus, which acts as an I/O bus and enables high-performance communication between VMs that use mechanisms such as shared memory" (Microsoft Learn: Hyper-V architecture [@ms-hyperv-architecture-perf]).

**VSP** (Virtualization Service Provider): a kernel module in the root partition that exposes a synthetic device backend to guests over a VMBus channel. Examples: `vmswitch.sys` (synthetic NIC), `storvsp.sys` (synthetic SCSI), the `vmbusrhid` server (synthetic input). **VSC** (Virtualization Service Client): the matching driver in the guest that consumes the channel and presents an OS-native device interface (a NIC, a SCSI controller, a keyboard) to the rest of the kernel.

The split is symmetric in transport (both sides use the same ring) but asymmetric in trust. The VSP runs in the most privileged context on the box, the root partition's kernel. The VSC runs in a normal guest kernel. Every byte that flows from guest to host crosses a trust boundary and gets parsed by code with full system privilege. The next two sections will return to this fact at length, because it is where the security story lives.

Why this works for closed-source guests

The Xen project tried something similar in 2003 with netfront/blkfront rings and event channels, but Xen PV required a paravirtualised guest kernel: the guest had to know it was running on Xen at compile time. Closed-source guests like Windows could not be modified, so Xen's wiki [@xen-pv-wiki]) eventually documents PV-on-HVM as a workaround.

Hyper-V finessed this with hardware virtualization. The guest kernel runs unmodified inside VT-x or AMD-V; CPU-level privilege separation handles the privileged instructions. The only thing the guest needs to do to opt into VMBus is load a driver. Every supported Windows version since Windows 7 / Server 2008 R2 ships those drivers in-box. Linux ships them in-tree from kernel 2.6.32 onward. There is no separate "install paravirt drivers" step, which is why Hyper-V "just works" for almost any guest you point at it.

The transport is settled. What rides on it is a catalogue.

4. Synthetic device classes: storage, network, input, video, vPCI

A modern Hyper-V guest, on first boot, sees a small zoo of devices that have nothing to do with PC hardware. There is no IDE controller, no PS/2 keyboard, no Cirrus VGA. There is a synthetic SCSI controller, a synthetic NIC, a synthetic keyboard and mouse, a synthetic framebuffer, and (often) a synthetic PCI passthrough channel. Each is a VSP/VSC pair on top of VMBus.

The Linux kernel VMBus document [@kernel-vmbus] enumerates the catalogue: synthetic SCSI controller (storvsc), synthetic NIC (netvsc), synthetic framebuffer (synthvid), synthetic keyboard, synthetic mouse, PCI passthrough, plus the non-device services: heartbeat, time sync, shutdown, memory balloon, KVP exchange, and online backup (VSS).

flowchart LR subgraph Guest nv["netvsc (NIC)"] st["storvsc (SCSI)"] sv["synthvid (framebuffer)"] kb["hyperv-keyboard"] ms["hyperv-mouse"] pc["pci-hyperv (vPCI)"] kvp["hv_kvp (KVP)"] ts["hv_utils (timesync, shutdown, heartbeat)"] end subgraph Root vsw["vmswitch.sys"] sto["storvsp.sys"] sfb["synthvid VSP"] rhid["vmbusrhid VSP"] vpci["vPCI VSP"] kvpd["KVP daemon"] tsd["IS daemons"] end nv -- "VMBus channel" --- vsw st -- "VMBus channel(s)" --- sto sv -- "VMBus channel" --- sfb kb -- "VMBus channel" --- rhid ms -- "VMBus channel" --- rhid pc -- "VMBus channel" --- vpci kvp -- "VMBus channel" --- kvpd ts -- "VMBus channel" --- tsd

Synthetic SCSI: storvsc

The storvsc VSC presents itself to the guest as a SCSI host bus adapter. Disks attached to the VM appear as SCSI LUNs hanging off that HBA. The wire protocol uses ring buffers carrying SRB (SCSI Request Block) style commands. To scale, storvsc can open multiple sub-channels, one per host CPU, so that I/O completion interrupts and request submission spread across cores rather than serialising on a single VMBus channel.

This is also why Hyper-V's "Generation 2" VMs work. A Generation 2 VM [@ms-gen1-gen2-vms], introduced in Windows Server 2012 R2 in 2013, has no IDE controller in the boot path at all. UEFI loads the OS loader from a synthetic SCSI device, the OS loader hands off to the kernel, and the kernel binds storvsc to the same device. The legacy IDE emulator simply never runs. That removes a lot of attack surface and lets boot volumes grow up to 64 TB on VHDX.

Synthetic NIC: netvsc

netvsc is the synthetic NIC. The wire protocol historically wrapped Microsoft's NDIS-style RNDIS frames around payloads sent through the channel ring, which is why some Linux discussions mention "RNDIS frames over VMBus." The Linux driver lives in drivers/net/hyperv/ and the kernel netvsc documentation [@kernel-netvsc] describes how it can spread receive-side traffic across multiple VMBus subchannels via Receive Side Scaling.

netvsc is also the one device class where Hyper-V composes with hardware passthrough. Section 8 will take this apart in detail; for now, note that the same netvsc VSC can run alongside an SR-IOV virtual function in the guest, with netvsc acting as the slow-path failover and the VF carrying the steady-state traffic.

Synthetic input: vmbusrhid

The synthetic keyboard, the synthetic mouse, and a few related input streams ride on a server in the root partition called vmbusrhid (the name is shorthand for "VMBus relay HID"). It is a small surface in bytes, but architecturally it has the same shape as netvsc: guest-controllable messages parsed in kernel mode in the root partition. Anyone evaluating the trust boundary should treat it the same way as netvsc, even though the data rate is six orders of magnitude lower.

Note: A path that carries 100 keystrokes per second is, on the wire, almost free. As an attack surface, it is identical to a path that carries a million packets per second: both are guest-controlled bytes parsed by privileged code. Section 7 walks through why the security community treats vmbusrhid the way it treats vmswitch.sys.

Synthetic video: synthvid

synthvid is a synthetic framebuffer. It is what lets you connect to a Hyper-V VM through the Virtual Machine Connection client without dragging in an emulated VGA. It is intentionally simple: there is no 3D acceleration in the synthetic path. Workloads that need GPU acceleration use a different mechanism, vPCI / DDA, to assign a real GPU to the VM.

vPCI: synthetic PCI passthrough

The most subtle device class is pci-hyperv, which exposes a virtual PCIe topology to the guest. The Linux kernel vPCI document [@kernel-vpci] describes the trick: a passthrough device is offered to the guest initially over VMBus (the channel carries the device's PCI configuration space and BARs), and once the guest's vPCI driver has constructed a real PCI device object for it, the device dual-identifies as a normal PCIe device. The vendor driver can then load against it.

This is the mechanism behind both Hyper-V's Discrete Device Assignment (DDA) [@ms-dda] and Azure's Accelerated Networking, which we will return to in Section 8. The DDA planning document is explicit that Microsoft formally supports DDA for GPUs and NVMe storage as device classes; other PCIe devices are "likely to work" but require vendor support.

Generation-1 vs Generation-2: a quick decoder

Putting the device classes side by side clarifies why the move from Generation-1 to Generation-2 VMs simplified so much:

Element	Generation-1 VM (legacy)	Generation-2 VM (since 2013)
Firmware	BIOS	UEFI with Secure Boot
Boot disk	Emulated IDE	Synthetic SCSI (`storvsc`)
Network on boot	Emulated DEC 21140 fallback	Synthetic NIC (`netvsc`)
Input	Emulated PS/2 + `vmbusrhid`	`vmbusrhid` only
Display	Emulated VGA + `synthvid`	`synthvid` only
Max boot VHDX	2 TB	64 TB
Source	Microsoft Learn: Gen 1 vs Gen 2 [@ms-gen1-gen2-vms]	Same

Generation-2 is what the Hyper-V architecture wanted to be from the beginning: an all-synthetic stack with no fallback to imaginary 1990s chipsets. The two-generation existence was not a design preference; it was the cost of supporting older operating systems whose boot loaders only knew about BIOS and IDE. Today, every modern Windows and modern Linux supports Generation-2; Generation-1 remains for legacy guests.

Counting boundary crossings

The shape of the hot path is now visible. To send one network packet from a guest:

The guest writes one descriptor and one payload copy into the netvsc TX ring (one memory copy).
The guest possibly fires a doorbell (one hypercall, often suppressed if the host has not caught up).
The host's vmswitch.sys reaps the descriptor, parses it, and forwards it through the virtual switch to a real NIC.

A single packet's hot path is at most one hypercall and one memory copy in the guest, plus host-side ring traversal. Section 8's comparison table will quantify how this stacks up against virtio and SR-IOV, but the scale is clear: paravirt I/O on Hyper-V is orders of magnitude cheaper per packet than full PC emulation, and the gap closes only when you go all the way to hardware passthrough.

The catalogue is set. Now, who actually wrote the Linux side of all this?

5. Linux Integration Services: Microsoft writes Linux drivers

In December 2009, Microsoft did something quietly historic. Linux kernel 2.6.32 merged a set of drivers under drivers/staging/hv/, contributed by Microsoft itself, that taught the Linux kernel to be an enlightened Hyper-V guest. The kernel.org Hyper-V index page [@kernel-hyperv-index] is the maintained landing page for that work. Over the next several releases the drivers moved out of staging/, settled at drivers/hv/, drivers/net/hyperv/, drivers/scsi/storvsc_drv.c, and drivers/pci/controller/pci-hyperv.c, and became the default in every mainstream distribution.

That set of drivers is collectively called Linux Integration Services (LIS).

The set of in-kernel Hyper-V guest drivers that Microsoft contributes to upstream Linux. Includes `hv_vmbus` (the VMBus core), `hv_netvsc` (synthetic NIC), `hv_storvsc` (synthetic SCSI), `hv_utils` (KVP, time sync, shutdown, heartbeat, VSS), `pci-hyperv` (vPCI), and `hv_balloon` (memory ballooning). The same code that Microsoft maintains in the Linux tree powers Linux guests on Hyper-V on Windows Server, on Azure, and on developer Hyper-V on Windows 11.

The reason this matters is bigger than convenience. In 2009, Linux had a long, painful history with Hyper-V's competitors. VMware shipped open-vm-tools but the deepest paravirt drivers (VMXNET3, PVSCSI) lived in vendor packages. Xen's PV drivers existed in-tree but their evolution depended on Citrix and the Xen project. By contributing the full driver stack upstream and committing to keep it there, Microsoft chose a different route: they put the spec (the TLFS) and the implementation (LIS) in the open at the same time.

Microsoft did not just publish a hypervisor specification and hope Linux would adopt it. They wrote the Linux drivers themselves and upstreamed them, and then they kept doing it for fifteen years.

You can see the maintenance pattern in any current kernel. The drivers/hv/ directory has continuous commit activity from Microsoft engineers. Kernel-doc files like the VMBus [@kernel-vmbus], clocks [@kernel-clocks], vPCI [@kernel-vpci], overview [@kernel-hyperv-overview], and CoCo VM [@kernel-coco] pages are written by the same engineers who write the drivers. Several of those documents are the most lucid descriptions of the architecture that exist anywhere in public.One unexpected consequence: the Linux kernel docs are often easier to read for the architecture than Microsoft's own customer-facing docs. The customer docs answer "how do I configure this?"; the kernel docs answer "what is actually happening?" When researching this article, I found that the cleanest single description of VMBus channel lifecycle is the Linux kernel doc, not the TLFS.

What "in-box" really means

Both major guests now ship VMBus support without any post-install step:

On Windows, the VMBus client stack is built into every supported Windows version since Windows 7 / Windows Server 2008 R2. The legacy Integration Services package, which once shipped as an ISO you mounted into the VM, is no longer needed on supported Windows.
On Linux, the drivers are in-tree from kernel 2.6.32 (December 2009) onward and ship in every mainstream distro.

The kernel.org Hyper-V overview document [@kernel-hyperv-overview] explicitly warns against installing legacy LIS packages on top of a kernel that already has the in-tree drivers: it can break MSI-X handling and PCI passthrough. This is the kind of operational footgun that survives precisely because the in-box answer is correct and the LIS package is a holdover from earlier kernels.

A practical smoke test

You can confirm a Linux guest is using its enlightenments without any vendor tooling. The kernel exposes cpuid leaves and Hyper-V detection through dmesg and through /sys. A small script makes it concrete:

{ // This logic mirrors what \dmesg | grep -i hyperv` and a peek into // /sys/devices/virtual/misc/vmbus would tell you on a real Linux Hyper-V guest.

const guestObservations = { cpuidSig: '0x40000000', // Microsoft's vendor signature for Hyper-V guestOsIdMsr: 0x40000000, // HV_X64_MSR_GUEST_OS_ID, written by the guest hypercallMsr: 0x40000001, // HV_X64_MSR_HYPERCALL, returns the hypercall page vmbusModuleLoaded: true, netvscDevice: '/sys/class/net/eth0/device/driver', netvscDriverName: 'hv_netvsc', storvscModuleLoaded: true, };

function isEnlightenedHyperVGuest(o) { if (o.cpuidSig !== '0x40000000') return false; if (!o.vmbusModuleLoaded) return false; if (o.netvscDriverName !== 'hv_netvsc') return false; return true; }

console.log( isEnlightenedHyperVGuest(guestObservations) ? 'Yes: Hyper-V enlightened, using netvsc + storvsc' : 'No: running on emulated PC hardware or non-Hyper-V hypervisor' ); `}

The point is not the script itself (anyone can write a few lines of awk against dmesg); it is that the verification surface is public. The CPU vendor signature, the MSRs, the kernel module names, the /sys paths are all documented. There is nothing to reverse-engineer.

Why this earned trust

Two pieces of practical evidence persuaded the Linux community that LIS was not a strategic trap:

The drivers stayed upstream. From 2009 to the present, Microsoft has maintained the drivers/hv/ tree, responded to maintainer feedback, and shipped patches through the normal kernel process.
The TLFS stayed accurate. Successive Hyper-V releases either matched what the TLFS said or updated the TLFS. There was no second, secret protocol.

The combination put Microsoft in the unusual position of being the most open hypervisor vendor for Linux guest support. (VirtIO on KVM has a richer cross-vendor story; that comparison is Section 8.) This open posture is also what set up the 2024 OpenVMM open-sourcing as a credible move rather than a stunt.

But before we get to OpenVMM, we need to look at a different way Hyper-V matters: not just as a substrate for VMs, but as a substrate for in-VM security boundaries inside Windows itself.

6. VBS and HVCI: Hyper-V as the trust anchor inside Windows

Up to this point the article has treated Hyper-V as a virtualization product: a thing that hosts VMs. Starting in Windows 10 and Windows Server 2016 [@ms-server-2016], Microsoft began using the same hypervisor for a different job: enforcing security boundaries inside a single OS install. The umbrella name is Virtualization-Based Security (VBS).

The mechanism is simple in description and subtle in consequences. The hypervisor splits a single guest's address space into two Virtual Trust Levels (VTLs). The lower one, VTL0, runs the normal Windows kernel and user mode (this is where explorer.exe and your browser live). The higher one, VTL1, runs a much smaller stack called the Secure Kernel plus a set of isolated user-mode services called trustlets. A compromise of VTL0, even of ntoskrnl.exe, cannot read or write VTL1 memory because the hypervisor enforces that boundary using the same hardware machinery (Intel EPT / AMD NPT, plus Intel VT-d / AMD-Vi for DMA) that it uses to isolate one VM from another.

A Hyper-V construct that partitions a single guest's address space into multiple privilege tiers enforced by the hypervisor. VTL0 hosts the normal kernel and user mode; VTL1 hosts the Secure Kernel and trustlets. The hypervisor presents each VTL with its own separate set of memory mappings, system registers, and interrupt state, so code running at VTL0 cannot read VTL1's memory even if it has run-as-NT-AUTHORITY-SYSTEM privilege. flowchart TD HV["Hyper-V hypervisor"] subgraph Guest["A single Windows guest"] subgraph VTL0["VTL0 (normal world)"] User0["User mode: apps"] Kernel0["NT kernel"] end subgraph VTL1["VTL1 (secure world)"] SK["Secure Kernel"] Trustlets["Trustlets: LSAIso, BIOiso, ..."] end end HV --> Guest HV -. "EPT + IOMMU enforcement" .-> VTL0 HV -. "EPT + IOMMU enforcement" .-> VTL1 Kernel0 -. "VTL switch (hypercall)" .-> SK

What lives in VTL1

The flagship inhabitant of VTL1 is Hypervisor-protected Code Integrity (HVCI), which moves kernel-mode page-table integrity checking into the Secure Kernel. With HVCI on, no VTL0 driver can mark a kernel page as both writable and executable; the Secure Kernel mediates the page tables and refuses the request. The result is that attackers who already have code execution in the NT kernel cannot trivially load arbitrary unsigned kernel code or build new executable JIT pages on the fly.

The other tenants of VTL1 are trustlets. The most familiar is lsaiso.exe (LSA Isolation), which holds the cached domain credentials that historically lived in lsass.exe and were the prime target for tools like Mimikatz. With Credential Guard on, those secrets move to a trustlet whose memory is unreadable from VTL0; even SYSTEM-level malware in the normal world cannot extract them. Other trustlets handle biometric template storage, key isolation for code integrity policy, and similar small, security-sensitive workloads.

Why the hypervisor is the right place for this

Putting these protections inside the hypervisor rather than inside the kernel has a property that no in-kernel mitigation can match: the protected component does not share an address space with the attacker. A defence built inside ntoskrnl.exe (PatchGuard, KASLR, control-flow guard) lives in the same memory the attacker is trying to corrupt. A defence built inside VTL1 lives in memory the attacker cannot touch, because the page tables that map it are themselves invisible from VTL0.

Note: Pre-VBS Windows had decades of memory-safety bugs in the NT kernel. After VBS, exploiting one of those bugs no longer immediately yields the attacker the ability to read LSASS secrets or load arbitrary kernel code. The attacker now needs a second bug, in the much smaller Secure Kernel codebase. The defender's effective budget went up by a large multiplier without rewriting a single line of NT.

How this connects back to VMBus

VBS would not be possible without the work the previous sections described. The Secure Kernel is what runs in VTL1; it needs to communicate with VTL0 for ordinary system services (the lsaiso.exe process must respond to authentication requests from VTL0 callers, the HVCI mediator must answer page-table requests, and so on). The signalling and shared-memory primitives that make those calls cheap are the same SynIC and shared-page primitives that VMBus uses between partitions.

In other words, the architecture Microsoft built in 2008 to give a Windows VM a fast network card became, in 2016, the architecture that gives a single Windows install a security boundary stronger than its own kernel. The same hypervisor, the same trust-mediation primitives, two completely different applications.

Windows Server 2019 [@ms-server-2019] extended this further with Hyper-V isolation for containers, where a container's lightweight VM gets its own kernel inside a tiny VTL0 of its own. The pattern is consistent: every time Windows wanted a stronger isolation primitive, the answer was "use the hypervisor."

This dual-use is the reason a serious Windows security review touches the Hyper-V codebase even on machines that nobody thinks of as virtualization hosts. A Hyper-V escape (a guest-to-host VMBus exploit) is not just "an exploit against Azure"; it is also, on a typical Windows 11 desktop with VBS enabled, an exploit against the boundary that protects LSASS secrets from kernel-mode malware.

That makes the next section's question urgent: how strong is the VMBus boundary, in practice?

7. VMBus security: every message is a parser at the trust boundary

Here is the part of the architecture worth being honest about. The same property that makes VMBus fast, namely that the host-side VSP runs in the root partition's kernel and parses guest-supplied bytes directly, also makes the VSP the most consequential piece of attack surface in the entire stack. Microsoft itself prices it that way: the Hyper-V Bug Bounty Program [@ms-bounty-hyperv] pays up to USD 250,000 specifically for guest-to-host escapes that hit this surface, which is among the highest payouts Microsoft offers for any category of vulnerability.

Key idea: Every byte that crosses a VMBus channel from a guest is a byte that a kernel-mode parser in the most privileged partition on the host has to interpret. The performance argument for a software data plane and the security argument against it are the same argument, looked at from opposite directions.

The historical record

Three CVEs make the pattern concrete:

CVE-2017-0075 is the Hyper-V escape that the Qihoo 360 Vulcan Team demonstrated at Pwn2Own 2017. The NVD entry [@nvd-cve-2017-0075] describes it as a Hyper-V flaw that "allows guest OS users to execute arbitrary code on the host OS via a crafted application." The reachable code was in a VMBus message handler on the host side.
CVE-2021-28476 is the canonical example. The NVD record [@nvd-cve-2021-28476] classifies it as a critical Hyper-V remote code execution vulnerability with a CVSS score of 9.9. The Akamai writeup with Guardicore and SafeBreach [@akamai-cve-2021-28476] traces the bug to vmswitch.sys, the synthetic-NIC VSP, and shows it had been present in production since the August 2019 vmswitch build. The exploit primitive is exactly what the architecture invites: a guest crafts an OID-style RNDIS request, sends it through the netvsc VMBus channel, and the host's kernel parser misvalidates a length, producing memory corruption in the most privileged kernel on the box.
CVE-2024-21407 is a more recent Hyper-V remote code execution vulnerability patched in March 2024 (NVD [@nvd-cve-2024-21407]). Its existence demonstrates that the bug class did not vanish; the same shape (guest-controlled message, host kernel parser, escalation to host code execution) keeps reappearing.

The MSRC bounty page ranges from \$5,000 for low-impact bugs to \$250,000 for full guest-to-host escapes (Microsoft bounty page [@ms-bounty-hyperv]). That price point is not a marketing number; it is Microsoft signalling what its threat model says these bugs are worth. A defender pricing their own controls should treat any VSP code path that parses guest-controlled data as a category that justifies the same level of attention as remote internet-facing services.

Why the bug class is structural

The pattern in all three CVEs is the same:

A guest writes carefully crafted bytes into a VMBus channel ring.
The guest fires the doorbell.
The host's VSP, running in the root partition's kernel, dequeues the message.
The VSP parses the message in C or C++ kernel code.
A memory-safety mistake (length confusion, missing bounds check, integer overflow) becomes a write or read primitive in the host kernel.

There is no exotic mechanism here. The exploit surface is "kernel C code parsing untrusted input," which has been the dominant source of remote-code-execution bugs in operating systems since the 1990s. The novelty is the location: the parser sits below the most privileged supervisor on the box, with full access to every other tenant's memory.

sequenceDiagram participant Mal as Malicious guest VM participant Ring as VMBus ring (shared memory) participant SInt as Synthetic Interrupt Controller participant VSP as Host VSP (e.g., vmswitch.sys, kernel) Mal->>Ring: Write crafted RNDIS-style message Mal->>SInt: Hypercall: signal channel event SInt-->>VSP: SINT delivered on host CPU VSP->>Ring: Read message header note over VSP: Length confusion / missing bounds check VSP->>VSP: Out-of-bounds write in root partition kernel note over VSP: Result: arbitrary code in the most privileged partition

Mitigations short of a rewrite

Microsoft's first line of defence is the same one every kernel team uses: ASLR, control-flow integrity, kernel hardening, fuzzing the parsers, code review of every new device class, and, on Azure specifically, isolating each tenant's compute hypervisor so a single compromised host does not become a multi-tenant disaster. The MSRC bounty program is partly a procurement mechanism for this same effort: pay researchers to find and report bugs before attackers find them in the wild.

A second line of defence is Generation-2 VMs (Microsoft Learn [@ms-gen1-gen2-vms]), which remove the legacy emulators (IDE, PS/2, PIC) from the host data path entirely. Every emulator removed is one fewer parser in the most privileged kernel.

A third is the Microsoft Hyper-V architecture page [@ms-hyperv-architecture-perf]'s "minimise root-partition exposure" guidance: configure hosts with the smallest set of root-partition services that the workload requires, since every service is potential surface.

These all help, but none of them change the structural fact that VSPs parse guest-controlled data in C/C++ kernel code. The next architectural shift, the one that does change that fact, is what Section 9 is about.

Side channels and the Spectre era

VMBus also has to defend against side-channel attacks across the partition boundary. The same Spectre / Meltdown / L1TF mitigations that apply to a multi-tenant hypervisor in general apply to Hyper-V specifically. Microsoft's broader hypervisor mitigation strategy interacts with VMBus mostly indirectly: the SynIC, the hypercall page, and the timer subsystem all needed audit and adjustment when these classes of attacks emerged. The detail is largely outside the scope of an article about the device model, but the takeaway is consistent with the rest of this section: any shared CPU resource between partitions is a potential attack surface, and "shared via the hypervisor's bus" is no exception.

The structural answer to all of this, the one Microsoft itself has been working toward, is to change the languages and the trust boundaries. To set that up, the next section first widens the field by comparing VMBus to its peer in the KVM world, virtio.

8. VMBus vs virtio: two answers to the same question

Hyper-V is not the only hypervisor with a paravirt I/O story. The KVM world evolved its own answer to the same problem at roughly the same time, and it ended up with a different design with different trade-offs. The standard is virtio.

The original virtio paper, Rusty Russell's "virtio: Towards a De-Facto Standard For Virtual I/O Devices" [@rusty-virtio-paper], was published at OLS 2008, the same year Hyper-V shipped. The proposal was explicit in its motivation: every hypervisor was reinventing paravirt drivers, and a single hypervisor-independent specification could let one guest driver work everywhere. OASIS later standardised virtio 1.0 in 2016, then virtio 1.1 in 2019 [@oasis-virtio-1-1], then virtio 1.2 as a Committee Specification in 2023 [@oasis-virtio-1-2].

A hypervisor-independent paravirtual I/O specification, governed by OASIS. A virtio device is presented to the guest over a transport (PCI, MMIO, or s390 channel I/O) that advertises capability bits. The data plane is a generic ring layout called a **virtqueue**: a ring of descriptors, an `avail` ring (guest-to-host), and a `used` ring (host-to-guest). Each device class (virtio-net, virtio-blk, virtio-scsi, virtio-fs, virtio-gpu) defines its own message format on top of virtqueues.

The same shape, viewed sideways

Architecturally, virtio and VMBus are sibling answers to the same shaped problem.

flowchart LR subgraph virtio_pci["virtio over PCI"] gv["Guest virtio driver"] vq["virtqueue (descriptors + avail + used)"] host_be["Host backend (vhost-net, vhost-user, OpenVMM)"] gv -- "PIO doorbell write" --> host_be gv -- "shared memory" --- vq host_be -- "shared memory" --- vq host_be -- "MSI-X" --> gv end subgraph vmbus["Hyper-V VMBus"] gv2["Guest VSC"] ring["Two ring buffers + GPADL"] vsp["Host VSP (kernel)"] gv2 -- "Hypercall doorbell" --> vsp gv2 -- "shared memory" --- ring vsp -- "shared memory" --- ring vsp -- "SINT" --> gv2 end

Both:

Use shared-memory rings for payload.The phrase "shared-memory rings" hides a small subtlety: a ring buffer is a circular buffer with separate read and write indices. Producer and consumer can run concurrently as long as they only touch their own index, which is what makes ring buffers a wait-free communication primitive on cache-coherent hardware.
Use a doorbell for signalling.
Batch many requests per doorbell so per-message hypercall cost amortises.
Have per-class device protocols layered on top of a common transport.

The differences are where the world bites:

Dimension	VMBus	virtio (1.2)
Transport	Software-only "bus", channel offer/open/close	PCI, MMIO, s390 channel I/O
Doorbell	Hypercall (`HV_SIGNAL_EVENT`)	PIO write to a doorbell BAR
Reverse signal	Synthetic interrupt (SINT)	MSI-X
Standardisation	Microsoft-owned, Open Specification Promise [@ms-tlfs]	OASIS-ratified, multi-vendor
Windows in-box drivers	Yes, every supported version	No; out-of-box signed VirtIO INFs from cloud vendors
Device classes beyond I/O	Yes: KVP, time sync, VSS, balloon	Limited; non-I/O often built on virtio-vsock or out-of-band agents
Cross-hypervisor portability	Hyper-V only	Universal: KVM, QEMU, Cloud Hypervisor, Firecracker, Xen HVM, OpenVMM
Spec governance	Single vendor under OSP	Multi-vendor with formal conformance clauses
Source for Linux side	drivers/hv/ [@kernel-hyperv-index]	drivers/virtio in the Linux tree

Where each design wins

Virtio's strongest claim is portability. The same Linux guest VM image, with the same in-tree virtio drivers, runs on KVM, QEMU, Cloud Hypervisor, AWS Firecracker, and (since 2024) Microsoft's own OpenVMM, which added virtio backend support. A workload that has to move between cloud providers benefits from this directly: the guest does not need a different driver stack per host.

Virtio also has a richer multi-vendor governance story. The spec is OASIS-ratified, with explicit conformance clauses; multiple commercial hypervisors implement it; multiple SmartNIC vendors implement virtio data planes in hardware (the vDPA and VDUSE work, described by Red Hat [@redhat-vdpa] and the Linux kernel VDUSE doc [@kernel-vduse]).

VMBus's strongest claim is integration. Every supported Windows ships with the VSCs in-box; there is nothing for an admin to install. The transport carries not just I/O but a service catalogue: KVP for guest configuration, time sync, VSS for online backup, the heartbeat and shutdown channels. The TLFS, while owned by Microsoft, is published under the Open Specification Promise and is a single document a guest author can read end-to-end.This is why "VirtIO drivers for Windows" exist as a separate project (the Fedora/Red Hat-signed virtio-win package) for KVM clouds: out of the box, Windows does not know virtio. The Hyper-V world inverts the problem: out of the box, Linux does not need any third-party install because the drivers are upstream.

Where they coexist

The most interesting recent development is that the two camps have stopped being purely competitive. Microsoft's OpenVMM [@github-openvmm] implements both VMBus and virtio backends, so a Linux guest using virtio drivers can run on a Microsoft-developed VMM, and a Windows guest using VMBus drivers can run on the same VMM. This is partially ideological (Microsoft is no longer pretending its way is the only way) and partially pragmatic (a single VMM that supports both transports is simpler than maintaining two).

Beyond the protocol-level comparison, both VMBus and virtio sit inside a larger composition with hardware passthrough, where the transport becomes the slow path and a real PCIe device carries the steady-state traffic.

Hardware passthrough as a complement

The composition that runs almost every modern Azure VM is VMBus + SR-IOV, packaged as Accelerated Networking [@ms-accelerated-networking]. The same VM gets both a synthetic NIC (netvsc over VMBus) and an SR-IOV virtual function. The Linux netvsc driver documentation describes the failover mechanic: "If SR-IOV is enabled in both the vSwitch and the guest configuration, then the Virtual Function (VF) device is passed to the guest as a PCI device. In this case, both a synthetic (netvsc) and VF device are visible in the guest OS and both NIC's have the same MAC address. The VF is enslaved by netvsc device. The netvsc driver will transparently switch the data path to the VF when it is available and up." (Linux kernel: netvsc [@kernel-netvsc]).

When live migration starts, Azure revokes the VF, the data plane falls back to the netvsc/VMBus path, the VM moves, and a new VF on the destination host gets re-attached, all without dropping TCP connections. The VMBus path was never the production hot path, but its existence is what enables migration. The KVM world's analogue is vDPA, which gives a virtio-shaped guest interface backed by a hardware data plane.

A modern Azure NIC stack is pushing this even further. Azure Boost [@ms-azure-boost] moves both storage and networking data planes off the host CPU into dedicated FPGAs, with a stable Microsoft-engineered NIC interface called MANA [@ms-mana]. Microsoft's documentation reports up to 200 Gbps of network bandwidth and 6.6 million IOPS on local storage with this design, with the host's vmswitch still acting as the live-migration fallback path. The architectural insight is that the VMBus-based slow path is the durable invariant; what changes is whether the steady-state data plane is software, an SR-IOV VF, or a SmartNIC firmware path. Frameworks like DPDK [@dpdk-about] sit on top of whichever data plane the VM exposes.

What none of this changes is the property Section 7 cared about: as long as a host-side VSP exists and parses guest-controlled bytes in kernel C/C++, the bug class is open. The next section is about the architectural move that closes it.

9. OpenVMM and OpenHCL: the 2024 open-source pivot

In 2024, Microsoft did two things that would have been hard to imagine a decade earlier. First, they open-sourced OpenVMM [@github-openvmm], a Rust implementation of the virtualization stack including the VSPs and the VMBus protocol. Second, they introduced OpenHCL [@ms-openhcl-deep-explainer], a "paravisor" configuration of OpenVMM that runs inside a confidential VM as a higher-trust mediator between the workload and the (now-untrusted) host.

Both moves are explained by the same trend the article has been circling: confidential computing fundamentally inverts the trust boundary, and the device model has to follow.

A higher-privileged software layer that runs *inside* a guest VM (not on the host) and mediates the guest's interaction with the hypervisor. In the Hyper-V model, a paravisor lives in VTL2 of the same VM whose workload runs in VTL0; the host hypervisor is outside the VM's trust boundary. The paravisor presents the workload with a familiar VMBus + VSP interface while internally talking to a hardware-isolated confidential VM substrate (AMD SEV-SNP or Intel TDX).

What changed in confidential computing

The classical Hyper-V trust model places the root partition at the apex of trust. The guest trusts the host. Memory the guest writes is, in the worst case, readable by the host. In confidential computing, that is no longer acceptable. A regulated workload (a healthcare database, a financial processor) needs to run in a VM whose contents are protected even from a malicious or compromised hypervisor. AMD's SEV-SNP and Intel's TDX are CPU features that encrypt and integrity-protect VM memory in hardware so that a compromised host cannot read the guest's secrets.

Azure Confidential Computing [@ms-confidential-computing] made these capabilities available as a product starting around 2022. The Azure confidential VM options page [@ms-coco-vm-options] documents the SKUs.

This breaks the old VMBus story. In the classical model, the host's vmswitch.sys reads the guest's network packets out of the VMBus ring. In a confidential VM that protection demands you can no longer let the host see those bytes; that defeats the entire point. So the question becomes: where does the synthetic-device backend live, if not in the host?

The paravisor answer

The Linux kernel's Hyper-V CoCo VMs document [@kernel-coco] describes the design directly: "Paravisor mode. In this mode, a paravisor layer between the guest and the host provides some operations needed to run as a CoCo VM. The guest operating system can have fewer CoCo enlightenments than is required in the fully-enlightened case ... some aspects of CoCo VMs are handled by the Hyper-V paravisor while the guest OS must be enlightened for other aspects."

OpenHCL is that paravisor. It runs in a higher-trust virtual trust level inside the same confidential VM (VTL2), it has access to the encrypted-memory primitives the CPU provides, and it presents the workload (in VTL0) with the same VMBus + VSP world a non-confidential VM would see. The workload OS does not need to be heavily modified; it sees what looks like Hyper-V, talks to what look like normal VSPs, and never has to know that those VSPs are now inside its own VM rather than on the host.

flowchart TD HW["Confidential CPU (SEV-SNP / TDX)"] HV["Host hypervisor (untrusted by the workload)"] subgraph CoCoVM["Confidential VM (memory encrypted)"] VTL2["VTL2: OpenHCL paravisor (Rust VSPs)"] VTL0["VTL0: workload OS (Windows or Linux, lightly enlightened)"] VTL0 -- "VMBus, looks normal" --- VTL2 end HW --> HV HV --> CoCoVM HV -. "no access to guest plaintext" .-> CoCoVM

The Rust rewrite

The other half of the story is memory safety. Recall Section 7's CVE list: every headline Hyper-V escape in the past decade involved a parser bug in C/C++ kernel code. OpenVMM's choice to implement the entire VMM, including the VSPs, in Rust is a direct response to that history. Rust's ownership model rules out, by construction, a large class of memory-safety bugs (use-after-free, out-of-bounds access on slices, double-free) that produced those CVEs.

This does not magically eliminate every vulnerability. A logic bug in a state machine, an integer-overflow on a length field, a side-channel timing leak: all of these still exist in Rust. But the categories that produced CVE-2017-0075, CVE-2021-28476, and CVE-2024-21407 are exactly the categories Rust was designed to make hard.

Garbage-collected languages are wrong for a kernel-mode parser: GC pauses are unacceptable in a hypervisor-adjacent fast path, and you cannot afford a runtime that allocates memory during interrupt handling. Rust's compile-time memory safety with no GC is, today, the only mature option that gives you both the safety and the predictability a VSP needs. Microsoft's choice is consistent with the rest of the industry; comparable rewrites of low-level systems infrastructure (Cloudflare's `cf-cmd`, Mozilla's `quiche`, the Android Bluetooth stack) have all converged on Rust.

What you can actually look at

OpenVMM is not a press release; it is a public repository that ships:

The full Rust source tree at github.com/microsoft/openvmm [@github-openvmm].
A separate repository for the Linux kernel fork that the paravisor runs on top of, at github.com/microsoft/OHCL-Linux-Kernel [@github-ohcl-linux].
Project documentation centred at openvmm.dev [@openvmm-dev].
Both VMBus and virtio backends, so the same VMM can host Windows guests on VMBus and Linux guests on virtio.
Documentation through the deeper Microsoft Tech Community explainer [@ms-openhcl-deep-explainer] and the original announcement [@ms-openhcl-announce] describing the paravisor's role.

For a security researcher or a regulated-cloud customer, this is a meaningful change. For the first time, the VMBus + VSP stack is auditable end-to-end in source.

If you want to see how a VSP actually consumes a channel, the OpenVMM repository contains the Rust modules that implement the VMBus channel state machine. Cloning the repo and grepping for `Channel::open` and `RingBuffer` shows the same offer/open/close/rescind pattern Section 3 described, expressed in Rust types whose lifetimes the compiler checks. Reading the same logic in Rust after reading the Linux C version in `drivers/hv/channel_mgmt.c` is a useful exercise; the abstraction is identical, and the safety guarantees diverge.

What still has to be solved

The kernel CoCo doc is candid about an open architectural problem that OpenHCL alone cannot solve: "Unfortunately, there is no standardized enumeration of feature/functions that might be provided in the paravisor, and there is no standardized mechanism for a guest OS to query the paravisor for the feature/functions it provides. The understanding of what the paravisor provides is hard-coded in the guest OS." (Linux kernel: CoCo VMs [@kernel-coco]).

In other words, the TLFS gave us a portable contract between guests and Hyper-V hypervisors. The paravisor world does not yet have an equivalent portable contract between guests and paravisors. Today's guests have OpenHCL-specific knowledge baked in. A future "paravisor TLFS" would let any compliant paravisor host any compliant guest, the same way the original TLFS did for the hypervisor. That standard does not exist yet, and writing it is the most consequential open problem in this corner of the architecture.

The architecture is moving. Section 10 takes stock of what that means for engineers building or operating on this stack today.

10. Engineering takeaways and open problems

A working architecture is one where the trade-offs are visible. Hyper-V's enlightenments + VMBus + VSP/VSC stack is a working architecture in exactly that sense: every property it has, including the security ones, is a consequence of design choices a reader can name.

What the design optimises for

Three explicit optimisations:

In-box drivers for closed-source guests. Hardware virtualization handles privileged CPU instructions; the guest only needs to load a VMBus client driver to opt in to the fast path. Every supported Windows ships those drivers in-box. Every modern Linux ships them in-tree. There is no "install paravirt drivers" step, which is a large reason "it just works."
A single transport that carries everything. VMBus carries 12+ device classes plus non-device services (KVP, time sync, VSS, balloon, heartbeat). One protocol, one set of primitives, one debugging surface. This is the engineering equivalent of "everything is a file" applied to inter-partition communication.
Live migration. Because the data plane is software in the root partition, the VM is not bound to a specific host. The VSPs serialise their state during migration without guest cooperation. This is the property that makes VMBus the durable invariant under hardware-passthrough acceleration: SR-IOV gives you throughput; VMBus gives you mobility.

What it pays for those properties

Two costs:

The host CPU is on the data plane. A software ring serviced by vmswitch.sys cannot match a 100 GbE NIC's line rate per host CPU core. Microsoft's answer is hybrid composition with SR-IOV (Accelerated Networking [@ms-accelerated-networking]) and SmartNIC offload (Azure Boost + MANA [@ms-azure-boost]). The KVM analogue is vDPA [@redhat-vdpa]. Both of these accept the structural truth that for the highest throughputs, the host CPU has to leave the data plane.
The host kernel parses guest-controlled bytes. Section 7's CVE record is the catalogue of what that costs. The architectural answer is OpenHCL: move the parser into the guest's own trust boundary and rewrite it in Rust.

A four-property idealisation

It is useful to write down what an idealised paravirt I/O stack would do, so it is clear which properties any real stack today is trading away.

The four idealised properties:

Zero hypercalls per packet in steady state.
Live-migration parity with a software baseline.
Cross-vendor / cross-hypervisor portability of the guest driver.
No host-side memory-unsafe parser of guest-controlled data.

Approach	(1) Zero hypercall	(2) Live migration	(3) Portability	(4) No unsafe host parser
VMBus + in-kernel VSP	partial (batched)	yes	no	no
virtio + vhost-net	partial (batched)	yes	yes	no
SR-IOV / DDA	yes	no	no	yes
Accelerated Networking (VMBus + SR-IOV)	yes (steady)	yes	no	no
vDPA	yes	partial	yes	no
OpenHCL paravisor + VMBus	partial	yes	partial	yes
Azure Boost + MANA	yes	yes	no	partial

No single approach today matches all four properties. The Hyper-V production composition is roughly (VMBus baseline) + (Accelerated Networking for throughput) + (OpenHCL for confidential workloads). The KVM-world composition is (virtio baseline) + (vDPA / SmartNIC for throughput). SmartNIC-based stacks (Azure Boost, AWS Nitro, Google's offload) approach the same four-corner problem from yet another angle.

This is a synthesis, not a single-source claim: the matrix combines properties documented separately in the Microsoft Accelerated Networking docs [@ms-accelerated-networking], the Linux kernel CoCo doc [@kernel-coco], the Discrete Device Assignment doc [@ms-dda], the SR-IOV overview [@ms-sriov-overview], the Linux netvsc driver doc [@kernel-netvsc], the VDUSE userspace interface [@kernel-vduse], the vPCI doc [@kernel-vpci], and the OpenHCL explainer [@ms-openhcl-deep-explainer]. Each individual cell is sourced; the ranking is the author's reading of those sources.

Practical pitfalls for operators

A few things the customer-facing docs do not always say plainly:

vmbusrhid is not low-risk. The keyboard/mouse channel is a kernel-level RPC surface from guest to root. Treat it the same way you would treat netvsc when modelling threat exposure.
Generation-2 VMs reduce attack surface. Choosing Generation-2 for new workloads removes the legacy IDE/PS/2/PIC emulators from the host data path entirely (Microsoft Learn: Gen 1 vs Gen 2 [@ms-gen1-gen2-vms]).
Mixing in-box and out-of-band Integration Services breaks things. Modern Windows and modern Linux already have the drivers; installing the legacy LIS package on top can break MSI-X handling and PCI passthrough (Linux kernel: overview [@kernel-hyperv-overview]).
DDA is not SR-IOV. Discrete Device Assignment covers any PCIe device passthrough, but Microsoft formally supports only GPUs and NVMe as device classes (Microsoft Learn: DDA planning [@ms-dda]).
Confidential VMs do not have the same device set. Hardware constraints reduce or alter the device classes available; always validate the specific synthetic devices your workload depends on are present in the target SKU (Linux kernel: CoCo [@kernel-coco]).

Note: 1. Confidential VM (SEV-SNP / TDX)? Use the OpenHCL paravisor mode (Azure CoCo VM options [@ms-coco-vm-options]). 2. Need ≥40 Gbps with live migration? Use Accelerated Networking; on Boost-enabled SKUs, Boost adds another tier of offload. 3. Need ≥100 Gbps and accept binding to host? Use Discrete Device Assignment / SR-IOV. 4. Maximum guest portability across hypervisors? Use virtio; for bandwidth-sensitive workloads, vDPA. 5. Default Hyper-V workload, broad device coverage, native migration? VMBus + VSP (the default).

Open problems worth watching

The substantive open problems are:

A standardised paravisor feature-enumeration interface. OpenHCL is the first auditable paravisor, but there is no portable contract a guest can use to query "what does this paravisor support." The TLFS gave us this for hypervisors; the paravisor analogue is missing (Linux kernel: CoCo [@kernel-coco]).
Confidential-VM-friendly live migration with paravirt devices. Hardware-attested state cannot be cloned trivially; today's pragmatic answer is to constrain migration in CoCo VMs. A general solution is open.
A formal model of the VMBus offer/rescind state machine. The kernel docs describe it narratively. A model that the VSP code could be checked against would let static analysis rule out the bug class behind the headline CVEs.
Live-migrating stateful SR-IOV VFs without device cooperation. Vendor proposals exist; an industry standard does not.
Erasing memory-unsafety in legacy VSPs. The Rust rewrite path in OpenVMM is correct; the multi-year engineering effort to convert every existing VSP is real. CVE-2024-21407 is recent enough to remind everyone the bug class is still producing fresh entries.

What to remember in five years

The most important sentence in this article is one I have been quietly preparing throughout: the durable architectural invariant in Hyper-V is shared-memory ring + doorbell, with a published guest-side contract. Everything else, including the choice of programming language for the VSP, the question of whether the data plane is software or hardware, and even whether the trust boundary places the VSP on the host or in a paravisor, is implementation. The transport is the invariant. That is the lesson the next decade of CoCo VMs and SmartNIC offload is converging toward: keep the contract stable, and let everything else change.

FAQ

No. The drivers (`hv_vmbus`, `hv_netvsc`, `hv_storvsc`, `hv_utils`, `pci-hyperv`, `hv_balloon`) have been in the upstream Linux kernel since 2.6.32 in December 2009 and ship in every mainstream distribution. The legacy LIS package is a holdover from the era before in-tree support and can in fact break MSI-X handling and PCI passthrough if installed on top of a modern kernel (Linux kernel: Hyper-V overview [@kernel-hyperv-overview]). Because the trust gradient is asymmetric. The VSP runs in the root partition's kernel, the most privileged context on the box; the VSC runs in a normal guest kernel. Bytes flowing from guest to host get parsed by code with full system privilege. A VSC bug typically harms only the guest; a VSP bug can be a cross-tenant compromise. The pattern is visible in the CVE record: CVE-2017-0075 [@nvd-cve-2017-0075], CVE-2021-28476 [@nvd-cve-2021-28476], and CVE-2024-21407 [@nvd-cve-2024-21407] all hit host-side parsers. For live migration. SR-IOV gives you near-bare-metal throughput but binds the VM to a specific physical NIC; you cannot migrate that state. Keeping a VMBus-backed `netvsc` device in the same guest gives the hypervisor a software path it can fall back to during migration windows. The Linux kernel netvsc doc describes this failover explicitly: when SR-IOV is enabled, the VF is enslaved by netvsc and the data path switches transparently when the VF is up (Linux kernel: netvsc [@kernel-netvsc]). OpenHCL is a *configuration* of OpenVMM, not a separate codebase. OpenVMM is the Rust virtualization stack at github.com/microsoft/openvmm [@github-openvmm]; OpenHCL is OpenVMM run as a paravisor inside a confidential VM's higher-trust virtual trust level (VTL2), so that the synthetic-device backends sit inside the guest's own trust boundary rather than on a host the guest cannot trust. The same Rust code can run as a host-side VMM (when paired with a hypervisor on the host) or as an in-guest paravisor (when running inside a SEV-SNP or TDX VM). Both directions exist with caveats. OpenVMM, when used as a host VMM, supports both VMBus and virtio backends, so a Linux virtio guest can run on a Microsoft-developed VMM (github.com/microsoft/openvmm [@github-openvmm]). Native Hyper-V on a Windows Server host historically expects VMBus-driven guests; there is no in-box virtio device emulation on a stock Hyper-V Server. KVM hosts can technically present a VMBus-shaped device, but in practice the production answer on KVM is virtio. Generation-2 VMs use UEFI with Secure Boot, boot from synthetic SCSI, and have no emulated IDE, PS/2, or PIC in the data path (Microsoft Learn: Gen 1 vs Gen 2 [@ms-gen1-gen2-vms]). Every emulator that is removed is one fewer parser running in the most privileged kernel on the host, so the host-side attack surface is meaningfully smaller. Generation-1 still exists for legacy guests that only know how to boot from BIOS + IDE. VBS uses the Hyper-V hypervisor to split a single Windows install into VTL0 (the normal kernel and apps) and VTL1 (the Secure Kernel and trustlets like `lsaiso.exe`). The hypervisor enforces that VTL0 cannot read or modify VTL1's memory, even with kernel privileges. So an attacker who already has SYSTEM-level code execution in the normal world cannot trivially extract LSASS secrets or load arbitrary unsigned kernel code; the hypervisor stops them. This works on any modern Windows machine with the right CPU features, regardless of whether you ever run a VM yourself (Microsoft Learn: Windows Server 2016 What's New [@ms-server-2016]).

Inside Azure Confidential VMs: SEV-SNP, Intel TDX, and the Paravisor that Makes Them a Cloud Product

noreply@paragmali.com (Parag Mali) — Wed, 13 May 2026 00:00:00 GMT

**Azure Confidential VMs are Windows or Linux guests that the cloud operator's hypervisor cannot read or silently modify.** They are built on two distinct CPU primitives -- AMD SEV-SNP (Reverse Map Table + Virtual Machine Privilege Level + SNP_REPORT) and Intel TDX (Secure Arbitration Mode + the signed TDX Module + RTMR0-3) -- and wrapped on Azure by the open-source Rust paravisor OpenHCL running inside the trust boundary at VMPL0 or the L1 TD seat.

Inside that boundary the paravisor synthesises a vTPM whose quotes chain to the SEV-SNP or TDX hardware report, and Microsoft Azure Attestation runs a customer-defined policy v1.2 file (with JmesPath claim rules) against the evidence to release HSM-backed keys via Secure Key Release.

The Generation-2 integrity rail closes the SEVered and SEVurity ciphertext-remapping class architecturally, but four 2024-era papers (CacheWarp, WeSee, Heckler, Ahoi) demonstrate that side-channel and notification-injection seams remain. Read this if you need to draw the Azure CVM stack from silicon to MAA, decide between SEV-SNP and TDX SKUs, and write an attestation policy that says exactly what you mean.

1. Even the cloud operator must not see your memory

A Windows Server VM is running a SQL query on Azure right now. It is joining a million-row variant table against a patient-genome reference, building an index in RAM, and serving the answer back to a clinician's web portal. The customer who owns that VM has every reason to want the query to succeed and every reason to make sure that nobody else can ever read the index it builds: not the hypervisor it runs on, not the host firmware below it, not the Microsoft engineer holding the on-call pager, not even a court-ordered datacentre raid carried out with full physical access to the rack.

As of 2026, that is not a thought experiment. It is the contract Azure signs when you provision a DCasv5 or DCesv5 confidential VM [@msdocs-overview-products]. And the contract has a shape -- an architecturally enforced shape rooted in two distinct CPU mechanisms, wrapped in an open-source Rust paravisor [@openhcl-blog], verified by a policy-driven attestation service [@msdocs-maa-overview], and dented by four published 2024 attacks that this article will name in order.

The Confidential Computing Consortium defines the contract in one sentence: "Confidential Computing protects data in use by performing computation in a hardware-based, attested Trusted Execution Environment" [@ccc-about]. That sentence finishes a longer thought. Data at rest gets BitLocker and full-disk encryption. Data in transit gets TLS. Data in use -- the gigabytes that sit in DRAM while a process actually computes against them -- has historically been the unencrypted leg of a three-legged stool.

A virtual machine whose memory and CPU state are cryptographically protected from the host hypervisor and the cloud operator's infrastructure, and whose configuration is bound to a hardware-rooted attestation report a remote verifier can check. The Confidential Computing Consortium's framing is the canonical one: "These secure and isolated environments prevent unauthorized access or modification of applications and data while in use" [@ccc-about]. A computing environment whose confidentiality, integrity, and attestability are enforced by hardware mechanisms below the level of the operating system. A TEE may be process-scoped (Intel SGX enclaves), VM-scoped (AMD SEV-SNP, Intel TDX), or board-scoped (AWS Nitro Enclaves). The Confidential VM is the VM-scoped specialisation.

Three concrete workloads make the contract operationally legible. A regulated clean room running joint analytics over patient genomes between an academic medical centre and a pharmaceutical sponsor, where the contract literally forbids the sponsor's staff from reading raw genotypes. A multi-party anti-money-laundering analytic between two competing banks who will share encrypted features but not raw transactions. A sovereign-cloud control plane that must not leak to the hyperscaler's host kernel under any subpoena. In each case the threat model treats the cloud operator as semi-trusted at best and adversarial at worst, and in each case the customer wants the cipher engine to live below the operator's reach.

Note: Encryption at rest hides bytes on storage. Encryption in transit hides bytes on the wire. Encryption in use is the missing third leg -- the one that asks the cipher engine to live inline with the memory controller, so that a VM's working set never appears in plaintext to anyone but the VM itself. That is what AMD SEV-SNP and Intel TDX do at the silicon layer, and what Azure productises with the OpenHCL paravisor and Microsoft Azure Attestation [@ccc-about; @msdocs-azure-cvm].

The architecture that makes this contract real takes vocabulary from Internet standards as well as silicon. RFC 9334, published in January 2023, gives us the verifier / evidence / relying party language we will use throughout the article [@rfc9334]. An attester (the guest VM plus the paravisor) generates evidence (a hardware attestation report plus a vTPM quote). A verifier (Microsoft Azure Attestation in Azure's case) checks the evidence against a policy and emits an attestation result (a signed JWT). A relying party (Azure Key Vault, or any customer service) consumes the result and decides whether to release a secret. The article you are reading is, at heart, a tour of how a SEV-SNP or TDX guest, an OpenHCL paravisor, and Microsoft Azure Attestation realise that abstract diagram on commodity silicon.

That leads to the obvious question. How can a CPU enforce that even the hypervisor cannot read RAM? And once it can, why does a single mechanism turn out to be insufficient -- why does the architecture need a separate integrity rail on top? The next two sections trace the wrong answers that came first.

2. Why enclaves were not enough

In August 2016 David Kaplan stood on the USENIX Security stage in Austin and described "two new x86 ISA features developed by AMD" that he called "the first general-purpose memory encryption features to be integrated into the x86 architecture" [@usenix-kaplan-2016]. Kaplan was, in the conference biography's words, the "lead architect for the AMD memory encryption features" [@usenix-kaplan-2016]. His argument was deceptively simple. An enclave that lives inside a single process is the wrong unit of confidential computation for a cloud workload. The workloads customers actually run -- database engines, analytic services, language runtimes -- want gigabytes of working memory, multiple threads, and an unmodified operating system. None of that fits inside a roughly 96-MiB SGX enclave [@costan-devadas-2016].

Two design ancestors set the shape of the problem before either AMD or Intel solved it.

The first ancestor is the Trusted Platform Module. The TCG TPM specification dates back to 2003, when "the first TPM version that was deployed was 1.1b" [@wiki-tpm]. TPM 2.0 was announced on April 9, 2014 [@wiki-tpm] and standardised as ISO/IEC 11889. The TPM contributed three concepts that remain load-bearing two decades later: platform configuration registers (the extend-only PCR digests that a measured-boot chain builds), attestation identity keys, and a quote operation that signs PCR state with a key whose origin a remote verifier can trust. The TPM is not a TEE in the modern sense -- it does not host computation -- but it is the first widely deployed device that lets a remote party gain cryptographic assurance about what a machine is running. Every confidential VM design ships a TPM-shaped attestation surface inside it.

The second ancestor is Intel Software Guard Extensions. Designed at the HASP 2013 workshop and delivered on Skylake in 2015 [@costan-devadas-2016], SGX introduced the enclave: a process-scoped TEE backed by the Enclave Page Cache, a CPU-managed memory region whose contents are decrypted only inside the cache. Programs enter and leave through ENCLU-family instructions; cross-domain calls use a partitioned model called ECALL / OCALL; remote attestation is mediated by Intel through a quoting enclave. SGX worked, in the strict sense that the threat model included even a malicious operating system. But three things kept it from generalising.

A CPU-protected DRAM region that holds an SGX enclave's working memory in encrypted, integrity-checked form. On early Skylake / Kaby Lake parts the EPC was capped at approximately 128 MiB physical with between ~93 and 96 MiB usable depending on BIOS reservation after reserved EPCM metadata accounting [@costan-devadas-2016]. Anything beyond the cap paged through the encrypted-page-eviction path with a substantial performance cliff, which is one of the architectural reasons SGX did not generalise to whole-VM cloud workloads.

The EPC cap was the first. A working set of ~96 MiB is fine for a key-wrapping service or a small ML model, but it is not a cloud-database VM. The second was the partitioned programming model. Real applications had to be split into trusted and untrusted halves with explicit ECALL / OCALL boundaries, which is a refactoring tax that few existing codebases would pay. The third was the side-channel question: Foreshadow [@foreshadow], SgxPectre [@sgxpectre], and SGAxe [@sgaxe] each demonstrated that a determined attacker with microarchitectural access could extract secrets from SGX, often without ever defeating the cipher itself.Microsoft's response was Haven, an OSDI 2014 project that put a Windows library OS (Drawbridge) inside an SGX enclave to run unmodified Windows binaries. Haven worked as a proof of concept but was effectively obviated by the EPC cap and by the slow pace of SGX silicon delivery in Xeon-class CPUs. The library-OS-in-an-enclave became one of several dead ends on the road to whole-VM TEEs.

Microsoft staked Azure publicly to "data in use" on September 14, 2017, when Mark Russinovich announced Azure confidential computing on the company blog: "Microsoft Azure is the first cloud to offer new data security capabilities with a collection of features and services called Azure confidential computing" [@russinovich-azure-2017]. The same post named the initial backing TEEs. "Initially we support two TEEs, Virtual Secure Mode and Intel SGX. Virtual Secure Mode (VSM) is a software-based TEE that's implemented by Hyper-V in Windows 10 and Windows Server 2016" [@russinovich-azure-2017]. VSM was already the substrate of Credential Guard and HVCI inside the operating system; pulling it up as a "TEE the cloud customer can target" was the bridge between the in-OS Secure Kernel story and the eventually-needed silicon-rooted CVM.

The industry got organised two years later. The Confidential Computing Consortium formed under the Linux Foundation on October 17, 2019. The press release names the founding premiere members verbatim: "Alibaba, Arm, Google Cloud, Huawei, Intel, Microsoft and Red Hat" and the general members "Baidu, ByteDance, decentriq, Fortanix, Kindite, Oasis Labs, Swisscom, Tencent and VMware" [@lf-ccc-press]. An earlier Microsoft Open Source blog post on August 21, 2019, announced the formation with a slightly different membership list (including IBM but not Huawei) [@ms-ccc-blog]; the October press release is the formal founding roster.

Across three load-bearing AMD whitepapers -- SME/SEV (2016), SEV-ES (February 17, 2017), and SEV-SNP (January 9, 2020) -- the PDF cover-page metadata records "David Kaplan" as the named author [@amd-mem-enc-whitepaper; @amd-sev-es-whitepaper; @amd-snp-whitepaper], and the USENIX Security 2016 biography corroborates "lead architect for the AMD memory encryption features" [@usenix-kaplan-2016]. Across the parallel Intel artefacts -- the September 2020 TDX whitepaper and the Architecture Specification doc 344425-001 -- PDF metadata names only "Intel Corporation" as the institutional author and does not enumerate individual architects [@intel-tdx-spec-344425]. We name David Kaplan throughout because the documentary record names him; we deliberately do not name individual Intel architects because the documentary record does not. flowchart TD Data["Customer data"] --> Rest["At rest -- BitLocker, SED, KMS"] Data --> Transit["In transit -- TLS 1.3, IPsec"] Data --> Use["In use -- ?"] Use --> CVM["Confidential VMs -- SEV-SNP / Intel TDX"] CVM --> Para["Paravisor -- OpenHCL"] Para --> MAA["MAA verifier"]

If a TEE has to be smaller than a single page cache, the unit of confidential computation is wrong. What if the unit were a whole VM, and the cipher engine lived inline with the memory controller? The next section is the first time someone tried.

3. Generation 1 and 1.5: confidentiality without integrity

April 2016. David Kaplan, Jeremy Powell, and Tom Woller publish the AMD whitepaper AMD Memory Encryption [@amd-mem-enc-whitepaper]. The paper introduces two features in a single document. Secure Memory Encryption (SME) is a chassis-wide bulk cipher: a per-boot AES-128 key, managed by the on-die AMD Secure Processor, encrypts main memory transparently to the operating system. Secure Encrypted Virtualization (SEV) takes the same engine and gives each VM its own AES key tagged into an Address Space Identifier (ASID) in the cache, so two co-resident VMs cannot read each other's memory and neither can the hypervisor. The "C-bit" in the guest page table marks which pages are encrypted [@amd-mem-enc-whitepaper]. The first silicon to ship SEV was the first-generation EPYC "Naples" launched June 20, 2017 [@wiki-epyc].

A high physical-address bit in an AMD SEV guest's page-table entries that signals to the memory controller "this page is encrypted with my VM's key." The C-bit is the per-page opt-in that lets a SEV guest mix encrypted private memory with explicitly shared bounce buffers in the same address space. Its absence means a page is cleartext to the hypervisor; its presence means the AES engine in the memory controller decrypts on every read and encrypts on every write [@amd-mem-enc-whitepaper].

The threat model was clear and the architecture was honest about it. The hypervisor sees ciphertext on every encrypted page. What the architecture did not do, and what the original whitepaper did not claim, was integrity. The hypervisor remained authoritative over the nested page tables -- it could remap which host physical page a given guest physical address pointed to, and the cipher engine would happily decrypt whatever blob it found under the same key.

That gap produced the architectural lesson.

SEVered (Morbitzer et al., EuroSec 2018)

In May 2018, four authors from Fraunhofer AISEC -- Mathias Morbitzer, Manuel Huber, Julian Horsch, and Sascha Wessel -- published a paper whose abstract is unambiguous: "We present the design and implementation of SEVered, an attack from a malicious hypervisor capable of extracting the full contents of main memory in plaintext from SEV-encrypted virtual machines" [@severed-arxiv]. The attack did not break the cipher. It exploited the fact that a malicious hypervisor could remap a page known to contain a particular plaintext (say, a known string in a network response served by the guest) and observe that the same ciphertext block now appeared at the address corresponding to the secret it wanted. Because there was no architectural binding between a guest physical address and the ciphertext that should sit there, the hypervisor could read the entire VM by chaining such remappings.

We present the design and implementation of SEVered, an attack from a malicious hypervisor capable of extracting the full contents of main memory in plaintext from SEV-encrypted virtual machines. -- Morbitzer, Huber, Horsch, Wessel, EuroSec'18 [@severed-arxiv]

The architectural lesson, stated as bluntly as the paper deserves, is that confidentiality without integrity is not confidentiality.

Key idea: Confidentiality without integrity is not confidentiality. The hypervisor that can move ciphertext between addresses is the hypervisor that can read it. The integrity of the guest-physical-to-host-physical mapping is as load-bearing as the cipher itself.

SEV-ES (February 2017): half a fix

AMD's first response was SEV-ES, dated February 17, 2017 in the whitepaper's PDF cover page [@amd-sev-es-whitepaper]. SEV-ES introduced register-state encryption on VMEXIT. Before SEV-ES, every VM exit handed the hypervisor a complete dump of guest CPU registers, including pointers into otherwise-encrypted memory. SEV-ES encrypted the saved register state under the guest key, surfaced a new #VC (VMM Communication) exception (vector 29), and required the guest to use a deliberately shared page called the Guest-Hypervisor Communication Block (GHCB) for everything that genuinely needed to cross the boundary -- emulated I/O, MMIO, time, the works.

A page that a SEV-ES (and later SEV-SNP) guest deliberately shares with the hypervisor for the purposes of communicating about events the hypervisor genuinely needs to handle: emulated I/O, MMIO accesses, certain control-plane operations. The GHCB is the explicit, audited "side channel" through the trust boundary. Everything else stays encrypted [@amd-sev-es-whitepaper].

SEV-ES closed one channel and left the other open. The integrity of the GPA-to-HPA mapping was still the hypervisor's problem to behave on, and the cipher was still XEX-mode AES without any keyed authentication. Two more papers made the architectural pressure unbearable.

ICUP (Buhren et al., CCS 2019) and SEVurity (Wilke et al., S&P 2020)

In August 2019, Robert Buhren, Christian Werling, and Jean-Pierre Seifert published Insecure Until Proven Updated [@icup-arxiv]. The abstract makes the operational point cleanly: "We demonstrate that it is possible to extract critical CPU-specific keys that are fundamental for the security of the remote attestation protocol. This effectively renders the SEV technology on current AMD Epyc CPUs useless when confronted with an untrusted cloud provider" [@icup-arxiv]. The mechanism was a firmware rollback against the AMD-SP that exposed attestation keys.

In May 2020, Wilke, Wichelmann, Morbitzer, and Eisenbarth published SEVurity: No Security Without Integrity at IEEE S&P [@sevurity-uzl]. Their two new methods, the project-page abstract records verbatim, "allow us to inject arbitrary code into SEV-ES secured virtual machines. Due to the lack of proper integrity protection, it is sufficient to reuse existing ciphertext to build a high-speed encryption oracle" [@sevurity-uzl]. The architectural diagnosis was now overdetermined: integrity had to enter the design, not as a side feature, but as a load-bearing rail.The same Buhren-led group escalated to physical fault injection in August 2021 with One Glitch to Rule Them All, voltage-glitching the AMD Secure Processor on Zen 1 / 2 / 3 to extract custom payloads [@one-glitch-arxiv]. The PSPReverse GitHub artefact contains the supporting tooling [@pspreverse-github]. This is the physical-fault lower bound on the AMD-SP: an adversary with the right glitcher can subvert the security processor itself. The SEV-SNP design assumes a logical adversary; physical-access adversaries remain a known residual that §8 will revisit.

Intel's parallel road: TME and MKTME

Intel's bottom-of-stack cipher engine ran on a parallel track. In December 2017, Intel published Architecture Memory Encryption Technologies Specification, document 336907 rev 1.1 [@intel-mem-enc-spec-336907], introducing Total Memory Encryption (TME). The multi-key successor, MKTME (later TME-MK), surfaced publicly through a September 7, 2018 Linux-kernel RFC by Alison Schofield archived on LWN: "Multi-Key Total Memory Encryption API (MKTME) ... allows multiple encryption domains, each having their own key. While the main use case for the feature is virtual machine isolation" [@lwn-mktme]. TME-MK is the per-keyID memory cipher that the eventual Intel TDX architecture will mount its trust-domain model on top of.

Three papers, two vendors, one architectural verdict: confidentiality without integrity is not confidentiality, and the architecture has to change. What did AMD and Intel actually build in response?

flowchart LR SME["SME (2016) -- Bulk memory cipher"] SEV["SEV (Naples, 2017) -- Per-VM AES key"] ES["SEV-ES (Feb 2017) -- + Register-state cipher"] SNP["SEV-SNP (Jan 2020) -- + Integrity rail"] SME --> SEV SEV -- "SEVered -- (EuroSec 2018)" --> ES ES -- "ICUP (CCS 2019) -- SEVurity (S&P 2020)" --> SNP

4. Generation 2: the integrity rail

January 9, 2020. AMD publishes the 20-page SEV-SNP whitepaper, sole-authored by David Kaplan, with the title Strengthening VM Isolation with Integrity Protection and More [@amd-snp-whitepaper]. Eight months later, in September 2020, Intel publishes the first public TDX whitepaper (document 343961-002US, filename tdx-whitepaper-final9-17.pdf, PDF creation date Thursday September 17, 2020) and the companion Architecture Specification doc 344425-001 dated September 1, 2020 [@intel-tdx-spec-344425]. Two vendors, two different architectural answers, one shared diagnosis: the hypervisor must be excluded from the GPA-to-HPA mapping, not just from the ciphertext.Wikipedia describes Intel TDX as "proposed by Intel in May 2021" [@wiki-tdx], but the PDF cover-page metadata extracted from both the TDX whitepaper and the Architecture Specification places the public release in September 2020. Where Wikipedia and the Intel-authored PDFs disagree, the PDFs are the primary record.

AMD SEV-SNP: four ingredients

SEV-SNP keeps the per-VM AES cipher from SEV and the register-state encryption from SEV-ES, and adds four new architectural ingredients that together close the integrity gap.

The first is the Reverse Map Table (RMP). The RMP is a system-wide per-page metadata table consulted on every nested page-table walk. Each entry binds a host physical page to the tuple (assigned ASID, expected guest physical address, VMPL, immutable bit, validated bit). If the hypervisor tries to remap a guest physical address to a different host page, the RMP entry will fail to match and the CPU raises an #NPF(rmpfault). The architecture's own description is verbatim: "SEV-SNP adds strong memory integrity protection to help prevent malicious hypervisor-based attacks like data replay, memory re-mapping, and more to create an isolated execution environment" [@amd-sev-portal]. This is the integrity rail. It is not a separate keyed MAC over memory; it is a structural binding that turns SEVered-class remappings into faults.

A system-wide AMD SEV-SNP data structure that records, for every host physical page, the guest ASID it belongs to, the guest physical address it is mapped at, the VMPL ACL, an immutable flag, and a validated flag. Every nested page-table walk consults the RMP; mismatches raise `#NPF(rmpfault)`. The RMP is the architectural answer to SEVered: the hypervisor remains in charge of nested page tables, but the RMP says what each host page is allowed to be used for [@amd-snp-whitepaper; @amd-sev-portal].

The second is the PVALIDATE instruction. A SEV-SNP guest must explicitly validate a page before it uses it for confidential storage. The hypervisor cannot fake validation; if the page has not been validated by the guest, accesses fault. This pushes the responsibility for tracking "is this page really part of my private memory" into the guest, where the hypervisor cannot lie about it.

The third is the Virtual Machine Privilege Level lattice.

A four-level privilege lattice (VMPL0 highest, VMPL3 lowest) introduced by AMD SEV-SNP. Each RMP entry includes per-VMPL access-control bits, so a single SEV-SNP guest can split itself into multiple ring-shaped partitions where a higher-VMPL component (for example, a paravisor at VMPL0) sees pages that a lower-VMPL component (the customer's kernel at VMPL2) cannot. VMPL appears as a field inside the SNP_REPORT, so a remote verifier can tell which VMPL produced a given quote [@amd-snp-whitepaper].

The fourth is the attestation report. The SNP_REPORT is an ECDSA-P384 signed blob produced by the AMD-SP, carrying fields including the launch measurement, the guest policy, the user-supplied report_data nonce, the issuing vmpl, the unique chip_id, and the tcb_version. The signing key is the Versioned Chip Endorsement Key (VCEK), derived per chip per TCB version from a long-lived endorsement key, and the certificate chain runs VCEK_cert -> ASK -> AMD root [@amd-sev-portal].

The AMD SEV-SNP attestation signing key. Derived deterministically from each chip's individual endorsement secret and the current TCB version (firmware level), so a single chip exposes one VCEK per TCB version. The certificate chain anchors back to AMD's root via the AMD Signing Key (ASK). The VCEK is what makes SEV-SNP attestation chain to silicon: the verifier checks the SNP_REPORT signature against a VCEK certificate AMD will only issue for genuine AMD-SP firmware [@amd-snp-whitepaper; @amd-sev-portal]. SEV-SNP adds strong memory integrity protection to help prevent malicious hypervisor-based attacks like data replay, memory re-mapping, and more in order to create an isolated execution environment. -- AMD SEV-SNP whitepaper, January 2020 [@amd-snp-whitepaper] sequenceDiagram autonumber participant Guest as Guest CPU access participant NPT as Nested Page Walker participant RMP as Reverse Map Table participant AES as AES engine (memory ctrl) Guest->>NPT: Resolve GVA -> GPA -> HPA NPT->>RMP: Lookup (HPA) RMP-->>NPT: ASID, expected GPA, VMPL alt RMP entry matches request NPT->>AES: Decrypt under VM key AES-->>Guest: Plaintext else Mismatch (SEVered-style remap) RMP-->>Guest: #NPF (rmpfault) end

Intel TDX: a different geometry, the same end-state

Intel reached the same architectural conclusion with a different mechanism. Rather than bake integrity into microcode plus the AMD-SP, Intel introduced a new CPU mode and a separately signed software module that runs in it. The Intel TDX overview is verbatim: "A CPU-measured Intel TDX module enables Intel TDX. This software module runs in a new CPU Secure Arbitration Mode (SEAM) as a peer virtual machine manager (VMM) ... hosted in a reserved memory space identified by the SEAM Range Register (SEAMRR)" [@intel-tdx-overview].

The ingredients are seven, not four.

A new CPU privilege state introduced by Intel TDX. Code running in SEAM is hosted in a physical-memory range identified by the SEAM Range Register (SEAMRR) that the legacy VMM cannot inspect. Only the signed Intel TDX Module runs in SEAM, and it does so as a peer VMM that mediates every interaction between the legacy hypervisor and a Trust Domain [@intel-tdx-overview].

The Intel TDX Module is the second ingredient: a CPU-measured firmware binary, loaded by the SEAMLDR at boot, that mediates every entry into and exit from a Trust Domain via SEAMCALL and SEAMRET instructions. The Intel-signed intel-tdx-module-1.5-base-spec-348549002.pdf is the canonical specification for the current generation [@intel-tdx-module-base-348549].

The third is the Trust Domain, a VM-shaped container that carries a Shared Bit in the guest physical address. A clear shared bit means the page is private; a set shared bit means the page is deliberately shared with the hypervisor for I/O bounce buffers. The fourth is TME-MK memory encryption, derived from the December 2017 TME spec [@intel-mem-enc-spec-336907] and the September 2018 MKTME Linux-kernel RFC [@lwn-mktme]: AES-128 in XTS mode, with the keyID embedded in the upper physical-address bits, gives one key per Trust Domain.

The fifth ingredient is the structural analogue of AMD's RMP, the Physical-Address-Metadata table (PAMT). The Intel TDX overview enumerates the architectural elements precisely: "Intel TDX uses architectural elements such as SEAM, a shared bit in Guest Physical Address (GPA), secure Extended Page Table (EPT), physical-address-metadata table, Intel Total Memory Encryption -- Multi-Key (Intel TME-MK), and remote attestation" [@intel-tdx-overview].

The sixth ingredient is the measurement registers. The MRTD is the build-time measurement of the initial TD image, similar to a TPM PCR fixed at launch. RTMR0 through RTMR3 are the runtime measurement registers, four PCR-equivalents the TDX Module exposes for runtime measured-boot extensions. These four registers are what a TDX-aware Trusted Boot chain extends.

The build-time and runtime measurement registers exposed by an Intel TDX Trust Domain. MRTD is hashed by the TDX Module over the initial TD launch image and is the SEAM analogue of an immutable launch PCR. RTMR0-3 are four extendable runtime registers, the SEAM analogue of the runtime-extension TPM PCRs (the same conceptual role as PCRs 8-15 in the canonical static-OS measurement chain), that hold a measured-boot chain of subsequent components (loaders, kernel, initrd, paravisor pages). The canonical TDX-vTPM event-log convention used by Linux IMA and systemd-stub maps RTMR[0] to PCR[1, 7]; RTMR[1] to PCR[2-6]; RTMR[2] to PCR[8-9]; and RTMR[3] to PCR[14, 17-22]. A TD Quote carries all five values; a verifier evaluates them against a customer-defined policy [@intel-tdx-overview; @intel-tdx-spec-344425].

The seventh is the TD Quote. A TD Quote is produced in two stages. The TD guest first issues TDCALL[TDG.MR.REPORT], which lands in the TDX Module (the VMM-to-Module entry is the separate SEAMCALL interface defined in the comparison table below); the TDX Module returns an in-SEAM SEAMREPORT structure, a Report MAC-signed with a key bound to the platform. A host-side SGX Quoting Enclave then converts that Report into a Quote signed with the SGX-resident QE attestation key. The Quote carries MRTD, RTMR0-3, the TD's TCB SVN (a per-component firmware version vector), and a caller nonce. The Intel Trust Authority (or Microsoft Azure Attestation, or Google's verifier) checks the quote [@intel-tdx-overview; @intel-tdx-module-base-348549].

flowchart TB HW["Silicon: TME-MK + SEAMRR -- + Secure EPT + PAMT"] SEAM["Intel TDX Module -- (SEAM mode)"] VMM["Legacy VMM -- (Hyper-V / KVM)"] TD1["Trust Domain 1"] TD2["Trust Domain 2"] HW --> SEAM HW --> VMM VMM -- "SEAMCALL" --> SEAM SEAM -- "SEAMRET" --> VMM SEAM -- "TDENTER / TDEXIT" --> TD1 SEAM -- "TDENTER / TDEXIT" --> TD2

Side by side

The two architectures answer the same question and arrive at the same end-state contract through fundamentally different trust geometries.

Ingredient	AMD SEV-SNP	Intel TDX
Memory cipher	AES-128, per-VM key in memory controller	AES-128-XTS, per-TD key by keyID (TME-MK)
Integrity binding	Reverse Map Table per host page	Physical-Address-Metadata table + Secure EPT
Mediating component	AMD-SP firmware (microcode + on-die security processor)	Signed Intel TDX Module in SEAM mode
Privilege lattice	VMPL0-VMPL3 (four levels)	TD Partitioning L1/L2 (TDX Module 1.5)
Build-time measurement	Launch measurement in SNP_REPORT	MRTD inside the TDX Module
Runtime measurement	None at module level (vTPM provides it)	RTMR0-RTMR3 inside the TDX Module
Attestation signing key	VCEK (ECDSA-P384), per chip per TCB version	SGX-resident Quoting Enclave key
Certificate chain	VCEK -> ASK -> AMD root	Quoting Enclave -> Intel root
Page-validation primitive	`PVALIDATE` (guest-driven)	TDX Module-mediated page acceptance
Shared-page indicator	C-bit (clear = shared, set = encrypted)	Shared bit in GPA (set = shared)
Hypervisor-to-trust-component call	Mediated VMRUN	`SEAMCALL` / `SEAMRET`

{` // Pseudo-code sketch of how a SEV-SNP guest assembles an SNP_REPORT // via SNP_GUEST_REQUEST. Not runnable against silicon; the point is // the shape of the evidence the verifier receives.

function buildSnpReport(nonce32) { // Guest builds a request structure with a 32-byte user nonce. const request = { reportData: nonce32, vmpl: 0 };

// Hypercall lands in the AMD-SP, which signs with the VCEK. const report = sp_guest_request(request);

return { version: report.version, // structure version guestSvn: report.guestSvn, // guest firmware SVN policy: report.policy, // SEV policy bits at launch familyId: report.familyId, // 16-byte ID set by launch measurement: report.measurement, // 48-byte launch measurement reportData: report.reportData, // echoes user nonce vmpl: report.vmpl, // VMPL of issuing component chipId: report.chipId, // 64-byte unique chip ID tcbVersion: report.tcbVersion, // boot loader / TEE / SNP / microcode SVNs signature: report.signature, // ECDSA P-384 over the report }; }

// The verifier walks the certificate chain VCEK -> ASK -> AMD root, // re-checks the signature, and then evaluates policy on the claims. console.log(JSON.stringify(buildSnpReport('nonce_from_relying_party'), null, 2)); `}

Key idea: SEV-SNP and TDX answer the same question differently. AMD bakes integrity into microcode plus the AMD-SP, signs with a per-chip per-TCB VCEK, and exposes a four-level VMPL lattice. Intel puts integrity into a separately loaded, separately signed software module running in a new CPU mode, signs with an SGX-resident Quoting Enclave, and exposes L1/L2 partitioning. The trust roots, the breaking surfaces, and the supply chains are different even when the end-state contract is the same.

flowchart LR subgraph AMD["AMD SEV-SNP"] A1["AMD-SP firmware"] A2["Reverse Map Table"] A3["VMPL0-3 lattice"] A4["SNP_REPORT -- VCEK signed"] end subgraph INTEL["Intel TDX"] I1["Signed TDX Module"] I2["PAMT + Secure EPT"] I3["L1 / L2 partitioning"] I4["TD Quote -- Quoting Enclave"] end A1 --- I1 A2 --- I2 A3 --- I3 A4 --- I4

Generation 2 makes a confidential VM architecturally possible. But a SEV-SNP guest is not yet a Windows Server VM you can lift and shift onto Azure -- there is a whole productisation problem still to solve. How does Microsoft put a paravisor inside that trust boundary, and what does it deliver?

5. The contract: a cloud-shaped TEE

A confidential VM is two rails, not one. Rail 1 is confidentiality plus integrity of memory and CPU state. Rail 2 is measurement plus attestation. SEV-SNP and TDX each deliver both rails. Anyone who has read the equivalent Secure Boot / Trusted Boot story will recognise the shape: a measurement chain anchored in silicon, terminated in a remote verifier, with a signed result that a relying party can act on.

The Confidential Computing Consortium's framing, repeated here as a contract the architectures actually realise: "Confidential Computing protects data in use by performing computation in a hardware-based, attested Trusted Execution Environment" [@ccc-about]. Hardware-based is rail 1. Attested is rail 2. The two words together are why a TPM-only system, however well-measured, is not a CVM, and why a SEV-only system, however well-encrypted, is not a CVM either.

RFC 9334 names the actors. The attester is the guest plus the paravisor producing evidence. The evidence is the SNP_REPORT or TD Quote, plus optionally a vTPM quote chained to it. The verifier is the entity that checks the evidence against a policy and emits an attestation result. The relying party is the consumer who acts on the result -- typically a key vault releasing a wrapped secret [@rfc9334].

The IETF Remote ATtestation procedureS working group's RFC 9334 (January 2023) fixes the vocabulary the rest of the confidential-computing industry uses: an *attester* produces *evidence*; a *verifier* checks it against reference values from an *endorser* and a *reference value provider* and emits an *attestation result*; a *relying party* acts on the result. RFC 9334 §5 names two topologies. In the *Passport* model (§5.1), the attester sends evidence directly to the verifier, collects a signed result, and presents that result to the relying party. In the *Background-Check* model (§5.2), the attester sends evidence to the relying party, which forwards it to the verifier and receives the result on the attester's behalf. Microsoft Azure Attestation, Intel Trust Authority, Google's verifier, and AWS KMS attestation all implement variants of this model [@rfc9334].

Microsoft Azure Attestation implements the Passport model. The attester -- the CVM, through its in-guest agent -- sends evidence (an SNP_REPORT or TD Quote, plus a vTPM quote) directly to MAA. MAA validates the evidence against the customer-authored policy and returns a signed JWT. The attester then presents that JWT to the relying party. Azure Key Vault authorises Secure Key Release against the MAA-issued claim set, not against raw SNP evidence. The relying party never sees the SNP_REPORT and never calls MAA on the attester's behalf, which is the design signature of Passport rather than Background-Check [@rfc9334; @msdocs-maa-overview].

flowchart LR Rail1["Rail 1 -- Confidentiality + Integrity"] --> Mem["Encrypted DRAM -- + RMP / PAMT -- + encrypted register state"] Rail2["Rail 2 -- Measurement + Attestation"] --> Ev["Evidence: -- SNP_REPORT / TD Quote -- + vTPM quote"] Ev --> Ver["Verifier: -- MAA / Intel Trust Authority"] Ver --> Tok["Attestation Result -- (signed JWT)"] Tok --> RP["Relying Party -- (Azure Key Vault)"] RP --> Secret["Wrapped secret release"]

Key idea: A Confidential VM is not a memory-encryption product. It is a contract: confidentiality with integrity, plus an evidence-bearing attestation chain that a relying party can verify before it releases a secret. Anyone who sells you "confidential" infrastructure without rail 2 is selling you half the product.

If this is the contract, how does Azure actually build a usable Windows-guest CVM on top of it? What lives where, and who signs what?

6. State of the art on Azure: from silicon to MAA

July 20, 2022. Microsoft Azure announces general availability of the DCasv5 and ECasv5 confidential VM SKUs on AMD third-generation EPYC silicon. The Register's coverage captures the framing: "Microsoft is expanding its Azure confidential computing portfolio with virtual machines that use the encryption and memory protection features of AMD's third-gen Epyc processors. ... Customers using them can also use the free Microsoft Azure Attestation (MAA) service to remotely verify the operating environment and integrity of the software binaries running on it" [@theregister-azure-cvm]. That is the moment a confidential VM stops being a research paper and starts being a product the customer can pay for by the hour.

This section walks the Azure stack bottom-up. It is the longest section because it is the article's reason to exist.

The Azure CVM SKU family

Microsoft Learn's confidential-computing products page enumerates the current Azure CVM SKU map. On AMD SEV-SNP: "DCasv5 and ECasv5 enable rehosting of existing workloads" [@msdocs-overview-products]. These are the third-generation EPYC Milan SKUs that went GA in July 2022. The Learn page continues: "DCasv6 and ECasv6 confidential VMs based on fourth-generation AMD EPYC processors are currently in gated preview" [@msdocs-overview-products]. Lenovo Press corroborates that "SEV-SNP is supported on AMD EPYC processors starting with the AMD EPYC 7003 series processors" -- i.e., Milan -- with the third-generation 7003 series being the first SEV-SNP silicon [@lenovo-lp1893].

On Intel TDX: "DCesv5 and ECesv5" are the fourth-generation Xeon Sapphire Rapids SKUs, generally available. SecurityWeek's coverage anchors the Sapphire Rapids launch: "Intel announced on Tuesday that it has added Intel Trust Domain Extensions (TDX) to its confidential computing portfolio with the launch of its new 4th Gen Xeon enterprise processors. ... The feature will be available through cloud providers such as Microsoft, Google, IBM and Alibaba" [@securityweek-tdx]. Wikipedia notes that "TDX is available for 5th generation Intel Xeon processors (codename Emerald Rapids) and Edge Enhanced Compute variants of 4th generation Xeon processors (codename Sapphire Rapids)" [@wiki-tdx]. The fifth-generation Emerald Rapids SKUs DCesv6 and ECesv6 are in preview at the time of writing, per the Learn products page [@msdocs-overview-products].

GPU CVMs anchor on the same CPU-side TEEs and add a GPU TEE. The Learn page describes the NCCadsH100v5 SKU: "NCCadsH100v5 confidential VMs come with a GPU ... use linked CPU and GPU Trusted Execution Environments (TEEs)" [@msdocs-overview-products]. This is the linked-attestation product for confidential AI -- a SEV-SNP host CVM bound by attestation to an NVIDIA H100 in Confidential Compute mode.March 30, 2026 brings a pricing change customers should plan for. Microsoft Learn states: "From March 30 2026, encrypted OS disks will incur higher costs" [@msdocs-azure-cvm]. Confidential OS-disk encryption remains the recommended configuration where the workload requires it; the change is to the billing line, not to the architecture.

The paravisor: OpenHCL on OpenVMM

The single most important productisation move Azure made is what Microsoft calls a paravisor. The framing from the October 17, 2024 Tech Community announcement is verbatim: "Microsoft developed the first paravisor in the industry, and for years, we have been enhancing the paravisor offered to Azure customers. This effort now culminates in the release of a new, open source paravisor, called OpenHCL" [@openhcl-blog].

A thin operating system running inside the trust boundary of a confidential VM, between the host hypervisor and the customer guest. The paravisor exposes the synthetic devices, the vTPM, and the GPA partitioning that a Windows or Linux guest expects from a Hyper-V environment -- without trusting any of those services to the host below the trust boundary. The paravisor is itself part of the TCB, but on Azure the paravisor binary is open source [@openhcl-blog; @openvmm-repo]. Microsoft's open-source paravisor, released on October 17, 2024. OpenHCL is built on top of OpenVMM, "a modular, cross-platform Virtual Machine Monitor (VMM), written in Rust" [@openvmm-repo]. On Azure SEV-SNP CVMs OpenHCL runs at VMPL0; on TDX CVMs it runs in the L1 partition seat under TD Partitioning [@openhcl-blog; @openvmm-dev]. It mediates virtual devices, brokers the vTPM, manages GPA partitioning between private and shared pages, and handles diagnostics, all inside the trust boundary. Microsoft developed the first paravisor in the industry, and for years, we have been enhancing the paravisor offered to Azure customers. This effort now culminates in the release of a new, open source paravisor, called OpenHCL. -- Microsoft Tech Community, OpenHCL announcement, October 17, 2024 [@openhcl-blog]

The OpenVMM repository README puts the focus crisply: "OpenVMM is a modular, cross-platform Virtual Machine Monitor (VMM), written in Rust. Although it can function as a traditional VMM, OpenVMM's development is currently focused on its role in the OpenHCL paravisor" [@openvmm-repo]. The OpenVMM Guide lists the virtualisation APIs OpenVMM supports, including "MSHV (using VSM / TDX / SEV-SNP)" for paravisor mode, WHP for a Windows host, and KVM for a Linux host [@openvmm-dev]. The use cases listed include Azure Boost, Trusted Launch, and Confidential VMs.

Because OpenHCL is in the TCB, customers do not avoid trusting Microsoft by running it -- but they can now read the source. That is a categorical change from earlier closed paravisors. The point about a TCB is not its size but its auditability and reviewability.

The canonical Linux-side analogue is AMD's Secure VM Service Module (SVSM), which runs at VMPL0 inside an SEV-SNP guest and provides the same kind of in-trust-boundary services (virtual TPM, paravirtualised I/O brokering, attestation surface) that OpenHCL provides on Azure [@amd-svsm]. SVSM and OpenHCL solve the same problem with different implementations and different signing chains. The Linux community's reference SVSM is the COCONUT-SVSM open-source project [@coconut-svsm]. A reader who needs a confidential-VM paravisor on a non-Azure Linux host should look at SVSM; a reader who needs it on Azure gets OpenHCL.

The vTPM

Inside the paravisor's protected memory, OpenHCL synthesises a per-VM virtual TPM. Microsoft Learn is verbatim: "Azure confidential VMs feature a virtual TPM (vTPM) for Azure VMs. ... Confidential VMs have their own dedicated vTPM instance, which runs in a secure environment outside the reach of any VM" [@msdocs-azure-cvm]. The architectural significance of this single sentence cannot be overstated. The vTPM's endorsement key is bound at provision time to the SEV-SNP or TDX hardware attestation report, so a vTPM quote can be transitively chained back to silicon: vTPM quote -> EK certificate -> SNP_REPORT or TD Quote -> VCEK or Intel signing root [@msdocs-azure-cvm].

The practical consequence is that a Windows Server CVM runs an unmodified Trusted Boot chain inside the guest. PCR-7 still indexes the Secure Boot signer. Code Integrity policies still extend their own PCRs. BitLocker still seals the Volume Master Key to the TPM. None of those operating-system features need to know that the TPM they are talking to is synthesised by OpenHCL inside an SEV-SNP guest -- and yet every one of those features is now anchored, transitively, to AMD or Intel silicon rather than to a discrete TPM chip on a motherboard the cloud customer cannot inspect.

Microsoft Azure Attestation

The verifier in Azure's confidential-computing stack is Microsoft Azure Attestation. The Learn overview describes it: "Microsoft Azure Attestation is a unified solution for remotely verifying the trustworthiness of a platform and integrity of the binaries running inside it. The service supports attestation of the platforms backed by Trusted Platform Modules (TPMs) alongside the ability to attest to the state of Trusted Execution Environments (TEEs) such as Intel Software Guard Extensions (SGX) enclaves, Virtualization-based Security (VBS) enclaves ... and Azure confidential VMs" [@msdocs-maa-overview].

Azure's unified verifier service for confidential platforms. MAA accepts evidence -- an SNP_REPORT or TD Quote, plus a vTPM quote, plus boot measurements -- evaluates it against a customer-defined attestation policy, and returns a signed JWT carrying the issued claims. MAA's role in the RATS architecture is the *verifier*, in *Passport* topology: the attester collects MAA's signed result and presents it to the relying party (Azure Key Vault) [@msdocs-maa-overview; @rfc9334].

The SKR loop is documented verbatim. "When a CVM boots up, SNP report containing the guest VM firmware measurements are sent to Azure Attestation. The service validates the measurements and issues an attestation token that is used to release keys from Managed-HSM or Azure Key Vault. These keys are used to decrypt the vTPM state of the guest VM, unlock the OS disk and start the CVM" [@msdocs-maa-overview].

The Azure Key Vault / Managed HSM operation that releases a wrapped key only after the requesting party presents a valid Microsoft Azure Attestation token that satisfies the key's release policy. SKR is what closes the loop between rail 1 (memory protection) and rail 2 (attestation) at the customer's perimeter: a key never leaves the HSM unless the attesting CVM has been verified [@msdocs-maa-overview; @msdocs-azure-cvm].

MAA policy v1.2

The policy language is the operational surface customers actually interact with. The MAA policy v1.2 grammar has four segments, verbatim from the Microsoft Learn page: "Policy version 1.2 has four segments: version, configurationrules, authorizationrules, issuancerules" [@maa-policy-v12]. The critical operational distinction is between the last two. Authorization rules can fail attestation; issuance rules cannot. The docs are explicit: "authorizationrules: ... These rules can be used to fail attestation. issuancerules: ... These rules can be used to add to the outgoing claim set and the response token. These rules can't be used to fail attestation" [@maa-policy-v12].

Note: The most common bug in hand-authored MAA policies is writing a security gate as an issuance rule. If you want a missing SecureBoot value to reject the attestation, the predicate must live in authorizationrules. Putting it in issuancerules only adds a claim to the resulting JWT; the relying party then has to enforce the gate. The verifier will mint the token either way [@maa-policy-v12].

The configuration-rule defaults give you sane behaviour out of the box: require_valid_aik_cert defaults to true and required_pcr_mask defaults to 0xFFFFFF (the first twenty-four PCRs must appear in the quote) [@maa-policy-v12].

Claim extraction uses JmesPath. The Learn page reproduces a Secure Boot detection rule that the verifier can use to flip a secureBootEnabled claim:

{` // Verbatim from Microsoft Learn (MAA policy v1.2 Secure Boot detection). // This is JS-style pseudo-code that walks the rule structure, not // runnable MAA syntax.

const policyRule = { segment: 'issuancerules', // "Claim rules" use JmesPath queries against parsed event data. step1: { when: 'type == "events" && issuer == "AttestationService"', add: 'efiConfigVariables', via: "Events[?EventTypeString == 'EV_EFI_VARIABLE_DRIVER_CONFIG' " + "&& ProcessedData.VariableGuid == '8BE4DF61-93CA-11D2-AA0D-00E098032B8C']" }, // GUID 8BE4DF61-93CA-11D2-AA0D-00E098032B8C is the EFI Global Variable // namespace, which is where 'SecureBoot' lives. step2: { issue: 'secureBootEnabled', via: "[?ProcessedData.UnicodeName == 'SecureBoot'] " + "| length(@) == 1 && @[0].ProcessedData.VariableData == 'AQ'" }, // 'AQ' is base64('\x01'), i.e. SecureBoot==1. fallback: { issue: 'secureBootEnabled', value: false } };

console.log('Segment :', policyRule.segment); // issuancerules console.log('Yields :', 'secureBootEnabled claim in JWT'); console.log('Lesson :', 'Add this to authorizationrules to actually fail!'); `}

sequenceDiagram participant E as Evidence (SNP_REPORT + vTPM) participant C as configurationrules participant A as authorizationrules participant I as issuancerules participant J as Signed JWT E->>C: parse + defaults -- (require_valid_aik_cert, PCR mask) C->>A: typed claim set A-->>A: predicate checks alt All authorization rules pass A->>I: continue I->>J: mint claims (secureBootEnabled, x-ms-isolation-tee, ...) J-->>E: signed attestation token else Any authorization rule fails A-->>E: attestation rejected end

The two-axis privilege model: VMPL crossed with VTL

A common misconception is that a SEV-SNP CVM makes Virtualization-Based Security inside the guest redundant. The argument goes: "the whole VM is in a TEE, so why do I still need a Secure Kernel?" The architecture answers the question by saying that VMPL and VTL are orthogonal axes.

The VMPL axis is cloud-operator threat model. VMPL0 (the OpenHCL paravisor) sees pages that the customer's kernel at VMPL2 does not, and the host hypervisor below VMPL0 sees none of the encrypted memory at all. VMPL keeps the operator out.

The VTL axis is intra-guest threat model. Inside the guest, VTL1 hosts the Secure Kernel, IUM (Isolated User Mode) trustlets like LSAIso for Credential Guard, and the HVCI code-integrity verifier. VTL0 hosts the normal Windows kernel and user mode. VTL keeps a kernel-mode attacker out of LSA secrets and credential blobs. Without VTL, the customer's own kernel can read its own LSAIso heap; without VMPL, the hypervisor can read the customer's RAM.

VBS-inside-CVM is therefore not a duplication. It closes two different attack classes.

flowchart TB subgraph Host["Host below trust boundary"] H["Hyper-V host kernel -- (no access to encrypted RAM)"] end subgraph Boundary["Inside SEV-SNP / TDX trust boundary"] subgraph V0["VMPL0 / L1 TD partition"] P["OpenHCL paravisor -- (synthetic devices, vTPM)"] end subgraph V2["VMPL2 / L2 TD partition (customer guest)"] subgraph T1["VTL1 (Secure Kernel)"] SK["Secure Kernel -- + IUM trustlets: -- LSAIso, Credential Guard"] end subgraph T0["VTL0 (normal OS)"] W["Windows Server kernel -- + user mode"] end end end H -. "blocked by VMPL + -- RMP / PAMT" .-> P W -. "blocked by VTL 1 -- VBS / HVCI" .-> SK P --> V2

Confidential Containers: three Azure surfaces

Confidential VMs are not the only Azure surface where SEV-SNP attestation can land. There are three more.

Confidential Containers on Azure Container Instances (ACI), GA. Microsoft Learn: "Confidential containers on Azure Container Instances are deployed in a container group with a Hyper-V isolated TEE, which includes a memory encryption key generated and managed by an AMD SEV-SNP capable processor" [@msdocs-aci-confidential]. ACI Confidential Containers use confidential computing enforcement (CCE) policies generated by the confcom Azure CLI extension, and they expose SNP attestation reports for the SKR sidecar pattern.

Confidential Containers on AKS, preview, sunsetting. The Learn AKS page is explicit: "The Confidential Containers preview is set to sunset in March 2026. After this date, customers with existing Confidential Container node pools should expect to see reduced functionality, and you won't be able to spin up any new nodes with the KataCcIsolation runtime" [@msdocs-aks-confidential-containers]. Microsoft routes customers to four alternatives: Confidential VM AKS node pools, ACI Confidential Containers, ARO Confidential Containers, and the upstream Confidential Containers project [@msdocs-aks-confidential-containers].

Confidential VM AKS worker nodes, GA. A different model -- node-granularity CVM rather than per-pod CVM. Learn: "AKS now supports confidential VM node pools with Azure confidential VMs. These confidential VMs are the generally available DCasv5 and ECasv5 confidential VM-series using 3rd Gen AMD EPYC processors with Secure Encrypted Virtualization-Secure Nested Paging (SEV-SNP) security features" [@msdocs-aks-cvm-nodes]. This is a lift-and-shift path for existing AKS workloads.

Confidential Containers on ARO is the Red Hat OpenShift equivalent, with Kata-isolated per-container SEV-SNP enforcement.

The cross-cloud parallel is the CNCF Confidential Containers project, accepted to CNCF on March 8, 2022 at the Sandbox maturity level [@cncf-coco]. The project documentation describes it as "an open source project that brings confidential computing to Cloud Native environments, using hardware technology to protect complex workloads" [@coco-docs]. Trustee is the canonical attestation broker on the CNCF side. CoCo's substrate is Kata Containers' MicroVM model; the TEE backing is currently Linux-only. The open-source community floor under all of this includes Edgeless's Constellation (historically the canonical confidential-Kubernetes distribution; the upstream repo was archived in 2025-2026 and Edgeless's successor project Contrast [@contrast] now carries the work forward at the workload-confidential-container layer rather than the whole-cluster layer) [@constellation], COCONUT-SVSM (the AMD-side reference SVSM running at VMPL0) [@coconut-svsm], and the CoCo Trustee attestation broker.

NVIDIA H100 CC on NCCadsH100v5

The Azure NCCadsH100v5 SKU pairs an SEV-SNP CVM with an NVIDIA H100 in Confidential Compute mode and links the two attestations together. CPU-side rail 1 is SEV-SNP. GPU-side rail 1 is H100 CC. Rail 2 must compose both: the relying party only releases the workload's key if both attestations check out. Cross-vendor attestation composition is one of the open standards problems §9 will revisit.

flowchart TB subgraph S["Silicon"] AMD["AMD-SP firmware -- + SEV-SNP RMP"] INTEL["Intel TDX Module -- (SEAM, SEAMRR)"] end subgraph H["Host"] HV["Azure Hyper-V -- (below trust boundary)"] end subgraph P["Paravisor (in TCB)"] OH["OpenHCL on OpenVMM -- VMPL0 / L1 TD seat"] VT["vTPM synthesised -- by paravisor"] end subgraph G["Customer guest"] WS["Windows Server CVM -- (VTL0 + VTL1, VBS / HVCI)"] end subgraph V["Verifier"] MAA["Microsoft Azure Attestation -- (policy v1.2)"] end subgraph R["Relying party"] AKV["Azure Key Vault / -- Managed HSM (SKR)"] APP["Customer application"] end AMD --> HV INTEL --> HV HV --> OH OH --> VT OH --> WS WS -- "SNP_REPORT -- or TD Quote -- + vTPM quote" --> MAA MAA -- "Signed JWT" --> AKV AKV --> APP

That is the Azure stack. But Azure is not the only design point -- Google and AWS chose different glue, and one of them is on a fundamentally different threat model. How do they compare?

7. Competing approaches

Three competitors share the design space with very different choices. Two are near-peers to Azure; one is a fundamentally different model that customers routinely confuse for the same product.

Google Cloud Confidential VMs

Google Cloud supports the same two CPU TEEs. The GCP Confidential VM docs are explicit: "AMD Secure Encrypted Virtualization-Secure Nested Paging (SEV-SNP) expands on SEV, adding hardware-based security to help prevent malicious hypervisor-based attacks like data replay and memory remapping. Attestation reports can be requested at any time directly from the AMD Secure Processor" [@gcp-cvm-overview]. And on the Intel side: "Intel Trust Domain Extensions (TDX) creates an isolated trust domain (TD) within a VM, and uses hardware extensions for managing and encrypting memory" [@gcp-cvm-overview].

GCP's machine-type mapping is direct. AMD SEV / SEV-SNP runs on N2D and C3D; Intel TDX runs on C3 Confidential VMs. The Confidential Computing product hub lists "Confidential VMs on the C3 machine series brings hardware-level protection to your AI models and data" and "Confidential VMs on the accelerator-optimized A3 machine series with NVIDIA H100 GPUs" as the parallel GPU-CC product [@gcp-confidential-overview]. There is a Confidential Space product on top for multi-party analytics, plus Confidential GKE Nodes and Confidential Dataflow.

The verifier-of-record is Google's own attestation service, with the guest's vTPM as the default trust root. Intel Trust Authority is supported as a plug-in alternative for TDX evidence.

The GCP Confidential VM docs make a claim Azure does not match: "AMD SEV machines that use the N2D and C3D machine types support live migration" [@gcp-cvm-overview]. Live migration of a confidential VM is genuinely hard: the encrypted state has to be re-keyed under the destination host's per-VM key, and the integrity-rail structures (RMP entries) have to be coherently re-established without ever exposing the plaintext to either host. AMD's SEV migration helper is the underlying mechanism. Azure does not currently expose live migration on its confidential VM SKUs. This is the most operationally consequential cross-cloud difference today.

A small correction to a widely repeated framing. It is sometimes said that GCP's confidential offerings are "also SEV-SNP" -- the Stage 0 input to this article said exactly that. Per the GCP docs, GCP supports both SEV-SNP and TDX [@gcp-cvm-overview]. If you are picking a CVM cloud for a multi-vendor strategy, treat GCP as a near-peer to Azure on the CPU dimension and differentiate on the verifier, the SKU mapping, and the live-migration story instead.

AWS Nitro Enclaves: a genuinely different model

The most common confusion in this design space is the assumption that AWS Nitro Enclaves is "AWS's confidential VM product." It is not. It is a different model on a different threat boundary.

The Nitro Enclaves user guide is unambiguous about the threat model. "AWS Nitro Enclaves is an Amazon EC2 feature that allows you to create isolated execution environments ... Enclaves are separate, hardened, and highly-constrained virtual machines. They provide only secure local socket connectivity with their parent instance. They have no persistent storage, interactive access, or external networking" [@aws-nitro-enclaves]. The same page continues: "Nitro Enclaves is processor agnostic and it is supported on most Intel, AMD, and AWS Graviton-based Amazon EC2 instance types built on the AWS Nitro System" [@aws-nitro-enclaves]. And: "Nitro Enclaves use the same Nitro Hypervisor technology that provides CPU and memory isolation for Amazon EC2 instances" [@aws-nitro-enclaves].

Three differences matter.

First, there is no CPU memory cipher. Isolation is enforced by the Nitro hypervisor on a dedicated Nitro System card, not by SEV-SNP or TDX. Memory is in the clear in DRAM, just architecturally walled off by the hypervisor and the hardware root of trust below it.

Second, attestation signs through the Nitro hypervisor and integrates with AWS KMS. There is no VCEK or TDX Quoting Enclave.

Third, the threat model is parent-instance and co-tenant isolation, not cloud-operator isolation. Amazon is in the TCB by design. A subpoena or a compromised AWS operator are within the threat model of Azure / GCP CVMs and outside the threat model of Nitro Enclaves.

Note: If your threat model includes a malicious or compelled cloud operator, AWS Nitro Enclaves does not protect you. The Nitro hypervisor enforces the enclave boundary; it is software AWS owns and operates. Use Nitro Enclaves for what it is good at -- a hardened compartment for key material against your own parent instance and your own application bugs. Use SEV-SNP / TDX on Azure or GCP if you need cryptographic protection against the operator's hypervisor [@aws-nitro-enclaves].

Nitro Enclaves still has a role: it is excellent at isolating a long-lived signing service from a more loosely audited application instance, and four enclaves per parent EC2 host is a generous concurrency budget for that pattern.

Confidential Containers and NVIDIA H100 CC

The Confidential Containers project crosses cloud boundaries. CNCF accepted it in March 2022 [@cncf-coco]. The project docs describe it as "an open source project that brings confidential computing to Cloud Native environments, using hardware technology to protect complex workloads" [@coco-docs]. The Azure surfaces (ACI, AKS, ARO) were covered in §6; the equivalent on AWS is the Kata Containers + Confidential Containers combination on top of bare-metal Nitro hosts, and on GCP it lands on Confidential GKE Nodes.

The NVIDIA H100 CC story is roughly cross-cloud parity. Azure NCCadsH100v5 pairs SEV-SNP with H100 CC; Google's A3 series pairs SEV-SNP and TDX with H100 CC. Cross-vendor attestation composition is the open standards problem on which the relying party experience still depends. On the silicon side, ARM's Confidential Compute Architecture (CCA, with Area Management Extension) is the ARM-side analogue of SEV-SNP/TDX, and Apple's Secure Enclave Processor is a board-scoped TEE with a different form factor; both are adjacent VM-scoped or board-scoped TEE designs but out of scope for the cloud-CVM body of this article.

The head-to-head matrix

Dimension	Azure CVM	GCP CVM	AWS Nitro Enclaves	Confidential Containers
CPU TEE	SEV-SNP, Intel TDX	SEV / SEV-SNP, Intel TDX	None (Nitro hypervisor)	SEV-SNP, TDX (varies by host)
Memory cipher	AES (per-VM, per-TD)	AES (per-VM, per-TD)	None (host RAM)	Inherited from host TEE
Integrity rail	RMP (AMD), PAMT (Intel)	RMP, PAMT	Nitro hypervisor isolation	Inherited from host TEE
Attestation evidence	SNP_REPORT, TD Quote, vTPM quote	SNP_REPORT, TD Quote, vTPM	Nitro attestation document	TEE evidence + container measurement
Verifier	Microsoft Azure Attestation	Google attestation, Intel Trust Authority	AWS KMS	Trustee (CNCF)
Operator threat model	Yes (operator excluded)	Yes (operator excluded)	No (Nitro in TCB)	Yes (operator excluded)
Lift-and-shift Windows	Yes	Yes	No (custom enclave format)	Linux containers only
Live migration of CVM	No	Yes (SEV on N2D / C3D)	N/A	No
2024-era CVE exposure	CacheWarp, WeSee, Heckler (SEV-SNP); Heckler (TDX)	Same upstream CVEs	Distinct (Nitro hypervisor)	Inherited from host TEE
Granularity	Whole VM, container	Whole VM	Per enclave (up to 4 per host)	Per pod / per container

flowchart LR Nitro["AWS Nitro Enclaves -- (parent-instance threat model)"] Azure["Azure / GCP CVMs -- (cloud-operator threat model, -- whole VM)"] CoCo["Confidential Containers -- (per pod / per container)"] H100["NVIDIA H100 CC -- (CPU + GPU linked TEE)"] Nitro --- Azure Azure --- CoCo CoCo --- H100

If the contract is settled and the products ship, what is still wrong with this picture? Why do four published papers in 2024 demonstrate extracting secrets from a fully-patched SEV-SNP CVM?

8. Theoretical limits and the 2024 attack class

May 2, 2024. ETH Zurich's ZISC group publishes the Ahoi family of attacks. The lab's announcement is brisk: "Researchers from the SECTRS group have now discovered a new class of attacks, dubbed Ahoi attacks, that exploit vulnerabilities in the notification framework in Intel TDX and AMD SEV-SNP. ... the vulnerabilities are tracked under 2 CVEs: CVE-2024-25744, CVE-2024-25743" [@eth-ahoi-news] (with CVE-2024-25742 covering WeSee). WeSee won the Distinguished Paper Award at IEEE S&P 2024 [@ahoi-wesee]. Heckler appeared at USENIX Security 2024 [@heckler-usenix]. CISPA's CacheWarp, also at USENIX Security 2024, cross-cut both [@cachewarp-usenix].

Four 2024-era papers attacking shipping confidential VMs, and a key observation: none of them broke the Generation-2 integrity rail itself. They all exploit seams around it.

Trusted Computing Base accounting

The irreducible silicon-vendor trust root is non-zero by design. On SEV-SNP the customer must trust AMD-SP firmware and the ECDSA-P384 VCEK chain rooted at AMD. On TDX the customer must trust the signed TDX Module binary and the SGX-resident Quoting Enclave's signing root rooted at Intel. On Azure the customer additionally trusts Microsoft's signed OpenHCL binary -- with the consolation that OpenHCL is open source and reviewable [@openhcl-blog; @openvmm-repo]. The verifier (MAA, Intel Trust Authority, Google's verifier) is a separate trust component the relying party must extend.

The set of hardware, firmware, and software components whose correct operation is necessary for a system to enforce its security properties. For an Azure SEV-SNP CVM the TCB is the AMD silicon, the AMD-SP firmware, the OpenHCL paravisor binary, and Microsoft Azure Attestation acting as the verifier. The TCB cannot be empty; the goal is to make it small, auditable, and named [@amd-snp-whitepaper; @openhcl-blog].

The lower bound on TCB is at least one signing root the customer cannot independently rebuild from public artefacts. Reproducible-build transparency over the AMD-SP firmware and the Intel TDX Module is one of the open standards problems on the 2026 frontier. The Google-Intel joint TDX security review from April 2023 is the best public substitute for a reproducible build of the TDX Module today [@gcp-tdx-review].

The 2024 attack class, in order of architectural depth

CacheWarp (USENIX Security 2024; CVE-2023-20592; AMD-SB-3005). A software fault injection. The mechanism, in NVD's verbatim language: "Improper or unexpected behavior of the INVD instruction in some AMD CPUs may allow an attacker with a malicious hypervisor to affect cache line write-back behavior of the CPU leading to a potential loss of guest virtual machine (VM) memory integrity" [@nvd-cve-2023-20592]. The project page is plain: "CacheWarp is a new software fault attack on AMD SEV-ES and SEV-SNP. It allows attackers to hijack control flow, break into encrypted VMs, and perform privilege escalation inside the VM" [@cachewarp-site]. The CacheWarp authors -- Ruiyi Zhang, Lukas Gerlach, Daniel Weber, Lorenz Hetterich (CISPA), Youheng Lü (Independent), Andreas Kogler (Graz), Michael Schwarz (CISPA) -- demonstrated full RSA key recovery from Intel IPP, passwordless OpenSSH login, and sudo-to-root escalation [@cachewarp-usenix]. SEV-SNP is affected; the fix is the AMD microcode update tracked by AMD-SB-3005 [@amd-sb-3005].

WeSee (IEEE S&P 2024 Distinguished Paper; CVE-2024-25742). A malicious #VC injection. The hypervisor coerces the guest's #VC handler into doing the wrong thing by injecting a #VC at a moment the guest does not expect one. The arXiv abstract is verbatim: "We present WeSee attack, where the hypervisor injects malicious #VC into a victim VM's CPU to compromise the security guarantees of AMD SEV-SNP. ... WeSee can leak sensitive VM information (kTLS keys for NGINX), corrupt kernel data (firewall rules), and inject arbitrary code (launch a root shell from the kernel space)" [@wesee-arxiv]. SEV-SNP only.The arXiv citation_author metadata for 2404.03526 enumerates the WeSee co-authors as Schlueter, Sridhara, Bertschi, Shinde [@wesee-arxiv]. Earlier writeups, including some upstream pipeline stages of this article, listed the third co-author as "Wilke." This was an inadvertent crossover from the SEVurity author list. The canonical author list, retrieved by querying the arXiv abstract page's citation_author meta tags, names Andrin Bertschi (ETH Zurich), which matches the project page on ahoi-attacks.github.io/wesee/ [@ahoi-wesee]. This article reflects the corrected attribution.

Heckler (USENIX Security 2024; CVE-2024-25743, CVE-2024-25744). A malicious non-timer interrupt injection. The hypervisor injects int 0x80 or a signal-mapped exception into the guest at a moment that breaks an invariant. The Ahoi Heckler page captures the scope: "All Intel TDX and AMD SEV-SNP processors are vulnerable to Heckler" [@ahoi-heckler]. The arXiv extended version demonstrates "Heckler on OpenSSH and sudo to bypass authentication. On AMD SEV-SNP we break execution integrity of C, Java, and Julia applications that perform statistical and text analysis" [@heckler-arxiv]. Mitigations are kernel-side interrupt filtering plus AMD's protected interrupt delivery feature.

Ahoi Attacks (umbrella). The family page describes scope: "Ahoi Attacks is a family of attacks on Hardware-based Trusted Execution Environments (TEEs) to break AMD SEV-SNP, Intel TDX and Intel SGX" [@ahoi-site]. The ZISC news framing names the SECTRS group at ETH Zurich (Shweta Shinde's lab) as the locus [@eth-ahoi-news].

One Glitch to Rule Them All (CCS 2021). The physical-fault lower bound established in §3, included here for completeness. Buhren et al. voltage-glitched the AMD-SP on Zen 1 / 2 / 3 to execute custom payloads and to "reverse-engineer the Versioned Chip Endorsement Key (VCEK) mechanism introduced with SEV Secure Nested Paging (SEV-SNP)" [@one-glitch-arxiv]. With supplemental tooling on the PSPReverse GitHub artefact [@pspreverse-github]. With physical access and the right glitcher, the AMD-SP is breakable.

SEV cannot adequately protect confidential data in cloud environments from insider attackers, such as rogue administrators, on currently available CPUs. -- Buhren, Jacob, Krachenfels, Seifert, *One Glitch to Rule Them All*, 2021 [@one-glitch-arxiv] flowchart TB INTG["Generation-2 integrity rail -- (RMP / PAMT)"] INVD["CacheWarp -- CVE-2023-20592 -- INVD seam -- (SEV-ES, SEV-SNP)"] VC["WeSee -- CVE-2024-25742 -- #VC handler seam -- (SEV-SNP)"] INT["Heckler -- CVE-2024-25743/4 -- Interrupt-injection seam -- (SEV-SNP, TDX)"] GLITCH["One Glitch -- Physical voltage-fault -- (AMD-SP firmware)"] INTG -. "intact" .-> INVD INTG -. "intact" .-> VC INTG -. "intact" .-> INT INTG -. "intact" .-> GLITCH

Composition limits and operational corollaries

Can the verifier itself be a CVM? Can SKR survive a verifier compromise? These are open standards questions; the Confidential Computing Consortium is iterating on them and there is no settled answer. What there is is operational guidance.

Note: Every 2024-era SEV-SNP and TDX attack has a corresponding microcode or firmware update with a higher TCB SVN. Policies that accept "any TCB SVN at or above the floor of last year's launch" leave the door open to CacheWarp-class CPUs. Bind your MAA policy to tcb_version >= latest_advisory and update the floor when AMD or Intel publishes a new security bulletin [@amd-sb-3005; @nvd-cve-2023-20592].

Confidential VMs do not promise side-channel resistance. They promise that the hypervisor cannot directly read memory and that an integrity-broken page cannot be silently substituted. The current equilibrium against the 2024 attack class is patch-after-disclosure plus attestation-policy hygiene. That equilibrium is itself an architectural statement.

Key idea: The 2024 attacks do not break the SEV-SNP or TDX integrity rail. They exploit seams around the rail: the INVD instruction, the #VC handler, the interrupt-injection path, and the physical AMD-SP. The architecture is settled. The residuals are the work.

The architecture is settled; the residuals are open. What is the 2026 research frontier actually working on?

9. Open problems

Six open problems shape the 2026 confidential-VM research frontier.

OP1. Nested CVMs. Intel TDX Module 1.5 ships TD Partitioning, where an L1 TD can host L2 TDs of its own [@intel-tdx-td-partitioning-354807]. AMD's analogue is the VMPL0 / VMPL2 layout that Azure OpenHCL already exploits. The portable cross-vendor formulation -- nested-CVM evidence that composes both vendors' attestation reports into a single relying-party-checkable artefact -- is not yet standardised. Customers who want a verifier-inside-a-CVM design must build the composition themselves.

OP2. Cross-vendor attestation composition for CPU+GPU CVMs. Azure NCCadsH100v5 and GCP A3 already compose AMD or Intel CPU attestation with NVIDIA H100 GPU attestation in production. The relying party today consumes two separate evidence packages and runs two separate policy evaluations. The RATS working group's RFC 9711 (The Entity Attestation Token, EAT) [@rfc9711] is the canonical wire-format vocabulary -- a JWT- or CWT-encoded attested claims set -- that a Passport-topology verifier such as Microsoft Azure Attestation produces, and is the path to a single composed evidence package, but the cross-vendor standards work is unsettled.

OP3. Transparency and reproducible builds of the AMD-SP firmware and the Intel TDX Module. Both are signed binaries customers trust but do not build. Google's April 2023 joint security review of TDX, authored by Erdem Aktas, Cfir Cohen, Josh Eads (Google Cloud Security), James Forshaw, and Felix Wilhelm (Google Project Zero), enumerated specific vulnerabilities including "Non-Persistent SEAM Loader, Exit Path Interrupt Hijacking, Unsafe Performance Monitoring VMCS Configuration" [@gcp-tdx-review]. That review is the closest thing to public auditability the TDX Module has today. A reproducible build with binary transparency log (rekor-style) would close the residual auditability gap that even open-source OpenHCL leaves on the table for the silicon vendor's firmware.

OP4. Post-quantum attestation signatures. SNP_REPORT signs with ECDSA-P384. TD Quotes are Intel-signed with RSA / ECDSA. The NIST FIPS 204 (ML-DSA) and FIPS 205 (SLH-DSA) standards are final, but vendor-side migration of the CVM signing roots has not been announced for either AMD or Intel. The deployment-feasible path is dual-signing: the SNP_REPORT or TD Quote carries both an ECDSA signature and an ML-DSA signature, the verifier accepts either, and the relying party gates on whichever signing root it trusts most. The transition is non-trivial because the VCEK derivation itself uses a classical KDF chain rooted in classical entropy.

OP5. Side-channel-resistant CVMs at deployment scale. The CacheWarp, WeSee, Heckler, and Ahoi family is the active frontier. The current operational equilibrium is policy-pinning to the latest TCB SVN plus microcode-update discipline. There is no production CVM architecture that promises constant-time execution across the integrity rail or that closes the cache-side and notification-injection seams at the silicon layer. The 2026 frontier is what architectural mitigations look like, not what microcode patches catch up to.

OP6. Confidential container portability after AKS KataCcIsolation sunset (March 2026). The Azure CoCo surface fragments into ACI per-pod CVM, ARO per-container CVM, AKS Confidential VM node pools at node granularity, and the upstream CoCo project [@msdocs-aks-confidential-containers]. Customers picking a confidential-containers strategy today need to plan for one of those four routes; the CoCo project itself is Linux-only as of 2026-05. Windows confidential containers remain out of scope on every shipping cloud.

This article does not deep-cover Intel SGX (the sibling enclave article handles that), ARM Confidential Compute Architecture (CCA) or Apple's Secure Enclave Processor (different threat models and form factors), the full text of the TDX Module Architecture Specification (it is 285 pages [@intel-tdx-spec-344425]; this article cites the load-bearing parts), the regulatory and sovereign-cloud framing of CVMs (a separate topic), or the application-level patterns for designing a customer service to be SKR-aware (an operations topic for a future post). flowchart LR OP1["OP1 -- Nested CVMs -- (TD Part. / VMPL)"] OP2["OP2 -- Cross-vendor -- attestation composition"] OP3["OP3 -- Firmware transparency -- + reproducible build"] OP4["OP4 -- PQ signatures -- (ML-DSA / SLH-DSA)"] OP5["OP5 -- Side-channel- -- resistant CVMs"] OP6["OP6 -- CoCo portability -- (post-March-2026)"] OP1 --- OP2 OP3 --- OP4 OP5 --- OP6

If you are deploying today, what should you do this quarter? The next section is a practical walk-through that ties the architecture to a runnable workflow.

10. Practical guide: VBS-inside-CVM end-to-end

Six steps move you from a credit-card swipe to a Windows Server CVM that runs an attested workload with HSM-backed key release. Treat the list as a checklist; each step is a place where the architecture from the previous sections becomes operational.

Step 1. Provision the CVM. Pick a SEV-SNP SKU (DCasv5 or DCasv6 preview), a supported Windows Server image (2019, 2022, or 2025), and turn on Confidential OS-disk encryption with a customer-managed key in Azure Key Vault or Managed HSM. Bind the key to an MAA-aware release policy. The Learn CVM overview describes the SKU family and the OS-image support [@msdocs-azure-cvm]. Plan for the March 30, 2026 encrypted-OS-disk pricing change [@msdocs-azure-cvm].

Step 2. Confirm VBS inside the CVM. A common misconception is that turning on SEV-SNP makes Virtualization-Based Security redundant. It does not -- VMPL and VTL are orthogonal. From an elevated PowerShell session:

Note: Get-CimInstance -Namespace Root\Microsoft\Windows\DeviceGuard -ClassName Win32_DeviceGuard should return VirtualizationBasedSecurityStatus = 2 (running) and a non-empty SecurityServicesRunning array that includes Credential Guard and HVCI. This proves that VTL1 / VTL0 separation is intact inside the SEV-SNP trust boundary -- the cloud operator is excluded by VMPL, and the customer's own user mode and ring-0 are excluded from the Secure Kernel by VTL.

Step 3. Capture an attestation token and walk it by hand. Use the Azure Attestation client (Microsoft.Azure.Attestation) to send the guest's SNP_REPORT and vTPM quote to the regional MAA endpoint. Inspect the returned JWT. The decoded claim set will include x-ms-isolation-tee describing the TEE (SEV-SNP or TDX), x-ms-runtime describing the guest configuration, the boot measurements, and any custom claims your policy mints. Verify the JWT signature against the region's MAA signing certificate -- not against an arbitrary trusted root; this is the verifier-identity hygiene that closes the SKR loop.

A valid MAA JWT will contain `x-ms-attestation-type = sevsnpvm` (or `tdxvm`) and a `x-ms-compliance-status = azure-compliant-cvm` claim. If either is missing or has a different value, the policy did not gate on the TEE and the relying party is about to release a key against unattested evidence.

Step 4. Author the policy. Write an MAA policy v1.2 file with four pieces. A configuration-rules block that keeps the defaults: require_valid_aik_cert=true and required_pcr_mask=0xFFFFFF [@maa-policy-v12]. An authorization-rules block that requires (a) x-ms-attestation-type == "sevsnpvm", (b) the SNP_REPORT measurement matches a known reference value for the customer's golden image, (c) the vTPM PCR-7 matches a known Secure Boot signer baseline, and (d) the VBS-enabled claim is true. An issuance-rules block that mints a customer-workload-tier claim from the SNP_REPORT's tcb_version. And version 1.2. Bind your HSM key's release policy to require the issuance-rule claim plus the authorization-rule pass.

Note: Use az attestation policy set to upload the policy to a non-production attestation provider and replay captured evidence through attestationProvider REST endpoints. This lets you iterate on JmesPath claim rules without rebooting CVMs. Pre-production failures here are cheap; failures after SKR binding are expensive [@maa-policy-v12].

Step 5. Repeat on a TDX SKU. Provision a DCesv5 or DCesv6 (preview) CVM. The attestation evidence shape changes: TDX evidence carries MRTD plus RTMR0-3 instead of a single SNP measurement, and the claims JSON shape differs. The JmesPath rules in your policy must be parameterised on productId to handle both TEEs from one policy file, or split into two policy files keyed by attestation provider region and TEE type [@intel-tdx-overview; @maa-policy-v12].

Step 6. Plan TCB SVN hygiene. Treat the TCB SVN floor in your policy as a moving target, not a one-time configuration. Subscribe to the AMD security bulletins and the Intel TDX security advisories. When CacheWarp's microcode shipped via AMD-SB-3005 [@amd-sb-3005], the appropriate operational response was to raise the policy's TCB SVN floor to the new microcode level, not to leave the floor at the launch baseline. This is the single most important operational habit a CVM customer can adopt.

Note: A policy that accepts the launch-baseline TCB SVN forever is a policy that grandfathers in every known CVE the silicon vendor has shipped a microcode patch for. The 2024 attack class makes this a load-bearing operational discipline, not a footnote [@nvd-cve-2023-20592; @amd-sb-3005].

You can build it today. The FAQ below answers the questions readers most often ask after they have built it.

11. FAQ and closing

Architecturally, the host hypervisor cannot read your encrypted RAM and cannot silently remap pages without triggering an RMP or PAMT fault [@amd-sev-portal; @intel-tdx-overview]. Operationally, the verifier (Microsoft Azure Attestation) is run by Microsoft, the paravisor (OpenHCL) is built by Microsoft, and the silicon is signed by AMD or Intel. You must still trust those components. The lower bound on TCB is at least the silicon vendor's signing root plus at least one verifier; you can shrink the *verifier* trust by using a third party (Intel Trust Authority for TDX, or your own deployment of an attestation broker), but you cannot shrink the silicon-vendor root [@msdocs-maa-overview]. No. VMPL (the SEV-SNP privilege axis) and VTL (the in-guest Virtualization-Based Security axis) are orthogonal -- VMPL gates the *operator*; VTL gates the *guest kernel*. See §6 for the full two-axis treatment; a Windows Server CVM should run with VBS, HVCI, and Credential Guard enabled inside the guest exactly as it would outside a CVM [@msdocs-azure-cvm]. No. The Nitro hypervisor enforces the enclave boundary in software AWS owns and operates; there is no CPU-level memory cipher, and the threat model is parent-instance isolation rather than cloud-operator isolation. See §7 for the three architectural differences and the operator-trustless callout [@aws-nitro-enclaves]. Yes, with limits. The attestation surface changes: the SNP_REPORT measurement (or MRTD plus RTMR extensions on TDX) now reflects your custom image. Your MAA policy must whitelist the new measurement values or use issuance-rule projection to bind to attributes you control. You cannot bypass the paravisor without abandoning the OpenHCL-mediated vTPM, which removes the chained vTPM-quote to silicon path most customers depend on [@msdocs-azure-cvm; @openhcl-blog]. Yes -- transitively, through the paravisor. See §6 for the full `vTPM quote -> EK certificate -> SNP_REPORT or TD Quote -> VCEK or Intel signing root` chain, and read it end-to-end before you accept a vTPM quote as silicon-bound [@msdocs-azure-cvm]. Node-granularity CVM versus per-pod CVM. Confidential VM AKS node pools put each worker node inside an SEV-SNP CVM; all pods on that node share the trust boundary [@msdocs-aks-cvm-nodes]. Confidential Containers on AKS used the `KataCcIsolation` runtime to put each pod inside its own SEV-SNP-backed Kata MicroVM; that preview is sunsetting in March 2026 [@msdocs-aks-confidential-containers]. Different SKUs, different runtimes, different sunset timelines. Pick node-granularity for lift-and-shift; pick per-pod when you need stricter blast-radius isolation between pods on the same hardware. No. See §8 for the architectural finding (the Generation-2 integrity rail remains intact under all four 2024 papers; each attack exploits a seam *around* the rail) and §10 Step 6 for the TCB-SVN-pinning operational habit that translates the finding into deployment policy [@cachewarp-site; @ahoi-heckler; @amd-sb-3005].

Imagine drawing the architecture from memory. Start at the bottom with AMD silicon plus the AMD-SP firmware, or Intel silicon plus the SEAM Range Register and the signed TDX Module. Above that, the Azure Hyper-V host -- below the trust boundary, blind to encrypted RAM. Above that, the OpenHCL paravisor at VMPL0 or the L1 TD seat, mediating synthetic devices and the vTPM. Above that, the Windows Server guest at VMPL2 or the L2 TD, still running VBS, HVCI, and Credential Guard inside. Then evidence flows up: SNP_REPORT or TD Quote plus vTPM quote into Microsoft Azure Attestation, which evaluates policy v1.2 against the evidence and emits a signed JWT, which Azure Key Vault checks before releasing the wrapped OS-disk key. If you can draw it on a napkin in two minutes, you have understood the article. If you can write the MAA policy that says exactly what you mean by "this VM is one of mine," you can build with it.

<StudyGuide slug="confidential-vms-on-azure" keyTerms={[ { term: "Reverse Map Table (RMP)", definition: "AMD SEV-SNP per-page metadata table enforcing GPA-to-HPA binding; mismatched mappings raise #NPF(rmpfault)." }, { term: "Virtual Machine Privilege Level (VMPL)", definition: "AMD SEV-SNP four-level privilege lattice; OpenHCL paravisor at VMPL0, customer kernel at VMPL2." }, { term: "SNP_REPORT", definition: "ECDSA-P384 signed attestation report from the AMD-SP, carrying measurement, policy, report_data, vmpl, chip_id, tcb_version." }, { term: "Secure Arbitration Mode (SEAM)", definition: "Intel CPU privilege state in which the signed TDX Module executes, hosted in the SEAMRR memory range." }, { term: "Intel TDX Module", definition: "Signed Intel firmware running in SEAM that mediates entry, exit, and measurement for Trust Domains." }, { term: "MRTD", definition: "Build-time TDX measurement of the initial TD image; SEAM analogue of an immutable launch PCR." }, { term: "RTMR0-3", definition: "Runtime extendable measurement registers exposed by the TDX Module; SEAM analogue of the runtime-extension TPM PCRs. Canonical TDX-vTPM mapping: RTMR[0]<->PCR[1,7], RTMR[1]<->PCR[2-6], RTMR[2]<->PCR[8-9], RTMR[3]<->PCR[14,17-22]." }, { term: "OpenHCL paravisor", definition: "Microsoft's open-source Rust paravisor on OpenVMM, running inside the CVM trust boundary at VMPL0 or the L1 TD seat." }, { term: "Microsoft Azure Attestation (MAA)", definition: "Azure's RATS verifier; evaluates customer policy v1.2 against SNP_REPORT or TD Quote plus vTPM evidence and returns a signed JWT." }, { term: "Secure Key Release (SKR)", definition: "Azure Key Vault / Managed HSM operation gating wrapped-key release on a valid MAA attestation token." }, { term: "Versioned Chip Endorsement Key (VCEK)", definition: "AMD per-chip per-TCB-version ECDSA-P384 signing key for SNP_REPORTs; certificate chain anchors to AMD root via the ASK." } ]} />