Parag Mali - tag: virtualization

Hyper-V Enlightenments, VMBus, and the Synthetic Device Model

noreply@paragmali.com (Parag Mali) — Thu, 14 May 2026 00:00:00 GMT

Hyper-V's guest OSes do not see emulated 1990s hardware. They see a published, versioned hypervisor ABI called the **Top-Level Functional Specification**, a transport called **VMBus** that consists of two ring buffers per channel, and a catalogue of synthetic devices whose backends live in the privileged root partition. This design is what makes Windows and Linux equally fast inside Hyper-V, and it is also why the host-side parsers in `vmswitch.sys` keep producing critical CVEs. The 2024 OpenHCL paravisor moves those parsers into the guest's own trust boundary in memory-safe Rust, which is the most consequential change to the Hyper-V device model since 2008.

1. The Type-1 hypervisor foundation

Open Task Manager on a modern Windows 11 desktop, switch to the Performance tab, and look at the line that says "Virtualization: Enabled." That single line hides one of the most consequential design choices in modern operating systems: when Microsoft shipped Hyper-V with Windows Server 2008 in June 2008 [@ms-hyperv-server-overview], they did not bolt a virtualization product on top of Windows. They put a small hypervisor underneath it.

That ordering matters more than it sounds. In the older Microsoft Virtual Server 2005 model, Windows ran on the bare metal and a user-mode service emulated PC hardware for guests inside it. In the Hyper-V architecture documented by Microsoft in 2008 [@ms-hyperv-architecture], the hypervisor boots first and Windows itself becomes a guest of the hypervisor. Microsoft calls this guest the root partition. Every other VM on the box is a child partition.

A hypervisor that runs directly on the physical hardware rather than inside a host operating system. Hyper-V, VMware ESXi, and Xen are Type-1; VirtualBox and the original Microsoft Virtual Server are Type-2 (hosted). In a Type-1 design no general-purpose OS sits between the hypervisor and the silicon, which lets the hypervisor enforce isolation directly using CPU virtualization extensions like Intel VT-x and AMD-V.

The root partition is not just another VM. It is a privileged partition: it owns the physical I/O devices, runs the parent stack of synthetic-device backends, and brokers everything that touches real hardware. Children get virtual processors and a slice of memory, and they communicate with the root over a software bus called VMBus that we will spend most of this article taking apart.

flowchart TD HW["Physical hardware (CPU, RAM, NICs, NVMe)"] HV["Hyper-V hypervisor (microkernel)"] Root["Root partition (Windows Server)"] VSP["Virtualization Service Providers (VSPs): vmswitch.sys, storvsp.sys, ..."] C1["Child partition: Windows VM"] C2["Child partition: Linux VM"] VSC1["VSCs: netvsc, storvsc, ..."] VSC2["VSCs: hv_netvsc, hv_storvsc, ..."] HW --> HV HV --> Root HV --> C1 HV --> C2 Root --> VSP VSP -. "VMBus channel" .-> VSC1 VSP -. "VMBus channel" .-> VSC2 C1 --> VSC1 C2 --> VSC2

The hypervisor itself is small by design. The Hyper-V architecture page on Microsoft Learn [@ms-hyperv-architecture-perf] describes it as a microkernel: it does the minimum a hypervisor must do (CPU scheduling, memory partitioning, interrupt routing, an inter-partition message bus) and pushes everything else, including the device models, out to the root partition. This is the opposite of the early VMware ESX design, where the hypervisor itself contained large device drivers.The microkernel choice was pragmatic, not ideological. A monolithic hypervisor with built-in NIC and storage drivers would have been a catastrophic certification problem: every NIC firmware update would risk a hypervisor patch. By delegating I/O to the Windows root partition, Microsoft re-used the entire Windows driver stack.

The split also explains why Hyper-V "feels Windows-shaped" even though it is technically not Windows. The root partition is Windows, with all of its drivers, its WMI, its event log, its Get-VM PowerShell cmdlets. The hypervisor underneath is a small, separate binary (hvix64.exe on Intel, hvax64.exe on AMD) that you almost never have a reason to think about. Microsoft itself goes further: in the same architecture document, it stresses that all device-model traffic flows through the root: "the management operating system hosts virtual service providers (VSPs) that communicate over the VMBus to handle device access requests from child partitions" (Microsoft Learn: Overview of Hyper-V [@ms-overview-hyper-v]).

This sets up the question the rest of the article answers: if the hypervisor is small, the guest is unmodified Windows or Linux, and the root partition owns the real devices, then how does a guest actually do disk and network I/O at gigabit-or-better speeds without paying enormous costs to traverse all of these boundaries?

The short answer is in three pieces: enlightenments (the guest knows it is virtualized and uses hypercalls), VMBus (the inter-partition transport), and the VSP/VSC pair (split drivers that share memory through VMBus rings). The next section starts with the first of those three.

2. Enlightenments: what "knowing you are virtualized" buys you

In the early 2000s, the dominant intuition was that a hypervisor's job is to fool the guest. A perfectly faithful emulation of an Intel 440BX motherboard, a DEC 21140 NIC, and an IDE controller is what made VMware Workstation a useful product in 1999. It is also what made Microsoft Virtual Server 2005 too slow to saturate gigabit links: every out instruction on a fake NIC port trapped to the hypervisor, was decoded against an in-memory chip model, and produced a synthetic interrupt that itself trapped on the way out. The Microsoft Virtual Server retrospective on Wikipedia [@wikipedia-virtual-server] notes that the architecture had no paravirtualization support and that performance was constrained relative to later hardware-assisted designs.

Hyper-V's answer was to drop the pretence. If the guest knows it is in a VM, it can use a fast path designed for VMs instead of pretending to drive imaginary chips. Microsoft calls this knowledge an enlightenment, and the Hyper-V feature discovery page [@ms-tlfs-feature-discovery] is the contract a guest uses to learn what enlightenments the hypervisor offers.

A modification or feature in a guest operating system that takes advantage of running under a specific hypervisor. An enlightened guest detects the hypervisor (on x86, by reading the `cpuid` leaves at `0x40000000` and above), then opts in to using paravirtual interfaces (hypercalls, synthetic timers, synthetic interrupt controllers, shared TSC pages) instead of trapping on emulated hardware. An unmodified guest would still boot, but slower.

Detection is the cheap part. The Linux kernel's Hyper-V overview document [@kernel-hyperv-overview] describes four cooperating mechanisms, layered atop one another: implicit traps that the hypervisor handles transparently, explicit hypercalls the guest issues on purpose, synthetic registers exposed as model-specific registers (MSRs) in the architectural CPU register file, and VMBus for high-bandwidth device traffic. Each layer builds on the one below it.

Key idea: The contract between Hyper-V and its guests is published. Microsoft maintains the Top-Level Functional Specification as a public document under the Open Specification Promise. That single decision is why Linux ships an in-tree Hyper-V driver stack and why VMBus is not a black box.

The hypercall page

The first thing an enlightened guest does is set up a hypercall page. The TLFS Hypercall Interface page [@ms-tlfs-hypercall] describes the dance: the guest writes its identity into HV_X64_MSR_GUEST_OS_ID (MSR 0x40000000), then writes a guest-physical address and an enable bit into HV_X64_MSR_HYPERCALL (MSR 0x40000001). The hypervisor responds by populating that page with the right opcode for the current CPU: vmcall on Intel, vmmcall on AMD. From that moment on, "make a hypercall" is a normal call into a known address rather than an opcode the kernel must hand-assemble per CPU vendor.This trick neatly externalises the vendor-specific calling convention. Microsoft can later swap to a new opcode (say, on ARM64, where the equivalent is an HVC instruction) without any guest code change. The guest just learns the new page contents.

The same TLFS page documents two hypercall classes: simple hypercalls (one operation, returns or faults) and rep (repeated) hypercalls that take a counter and a start index, so a long-running operation can yield mid-flight without losing work. Three calling conventions exist: a memory-based one for large parameter blocks, a register-only fast variant for the very common case of one or two inputs, and an XMM-register variant that lets a guest pass up to 112 bytes of input through SSE registers.

That XMM variant is unusual enough to flag. Most kernel ABIs do not touch SSE in privileged code because saving and restoring the full SSE state is expensive. Hyper-V's hypercall ABI uses XMM precisely because the round-trip cost of a hypercall is dominated by the VMEXIT itself, so squeezing a few more bytes into registers is cheaper than spilling them to memory and reading them back.

Synthetic interrupts and synthetic timers

A guest's virtual processor has its own emulated local APIC by default, but an enlightened guest can also use a Synthetic Interrupt Controller (SynIC), defined in the TLFS. Each virtual processor gets 16 SINT slots, a per-CPU shared message page, and a per-CPU shared event page. SINTs are how VMBus signals events to the guest without going through the legacy LAPIC fast path.

One of 16 logical interrupt sources per virtual processor that the Hyper-V Synthetic Interrupt Controller can signal. SINTs are reachable through MSRs (`HV_X64_MSR_SINT0` through `HV_X64_MSR_SINT15`) and back the doorbell mechanism for VMBus channels and for synthetic timers. They are paravirtual: they would not exist on a bare-metal CPU.

The clock side is even more interesting. The Linux kernel Hyper-V clocks documentation [@kernel-clocks] describes a reference TSC page that the hypervisor maintains in shared memory: it contains a scale factor and an offset such that

$$ \text{guest_time} = (\text{TSC} \times \text{scale}) >> 64 + \text{offset} $$

ticks at a constant 10 MHz frequency regardless of the underlying TSC. The guest's clock_gettime and gettimeofday can read TSC, multiply, shift, add, and return, all in user space via vDSO, with no kernel transition and no hypercall.

A web server that calls `clock_gettime` once per request, on a million-requests-per-second box, is a ridiculous workload that real systems run constantly. Without enlightenments, every call would be a `rdmsr` on a virtualised TSC or a trap into the hypervisor. With the reference TSC page, the same call is four arithmetic ops and a memory load. The kernel doc explains that this scale and offset survive live migration: "in the case of a live migration to a host with a different TSC frequency, Hyper-V adjusts the scale and offset values in the shared page so that the 10 MHz frequency is maintained" (Linux kernel: Hyper-V clocks [@kernel-clocks]).

Synthetic timers complete the picture. Each virtual CPU has four synthetic timers programmable via MSRs; they fire SINTs into the SynIC. The guest does not need to touch an emulated PIT or HPET. Combined, SynIC + synthetic timers + the reference TSC page mean that an enlightened guest can do most of its time-keeping and inter-partition signalling without ever touching the legacy interrupt/timer chip surface.

The TLFS as a contract

All of this is published. The Top-Level Functional Specification [@ms-tlfs] is the document a guest author reads to know which MSRs to write, which cpuid leaves to query, which hypercalls exist, and which features the hypervisor signals via feature flags. Microsoft maintains it under the Open Specification Promise. That promise is a deliberate contractual choice. Without it, Linux could not ship drivers/hv/ in-tree and Microsoft could not credibly claim that Linux is a first-class Hyper-V guest. The TLFS is the artefact that makes the rest of the architecture cooperative rather than reverse-engineered.

The next layer up uses these primitives to build something more ambitious: a general-purpose inter-partition transport.

3. VMBus: the inter-partition transport

If enlightenments are the alphabet, VMBus is the language that synthetic devices speak. The Linux kernel VMBus document [@kernel-vmbus] puts the definition tersely: "VMBus is a software construct provided by Hyper-V to guest VMs. It consists of a control path and common facilities used by synthetic devices that Hyper-V presents to guest VMs. The common facilities include software channels for communicating between the device driver in the guest VM and the synthetic device implementation that is part of Hyper-V, and signaling primitives to allow Hyper-V and the guest to interrupt each other."

There is a lot in that paragraph. Let me unpack it, because this is the architectural core.

A software-only inter-partition communication bus provided by Hyper-V. It has a control path (channel offer, open, close, rescind), and per-device data channels built on shared memory ring buffers. VMBus is not a real bus in any hardware sense; nothing on the PCIe topology is named VMBus. It is a contract between guest drivers and the hypervisor.

Channels and the offer protocol

Every synthetic device a guest sees corresponds to a VMBus channel. The root partition advertises (OfferChannel) the list of devices a guest is permitted to use. The guest's VMBus driver iterates the offers, matches each to a class GUID (synthetic SCSI is one GUID, synthetic NIC is another, the input-style vmbusrhid device is a third), and binds an in-kernel device driver to each one. The reverse operation, RescindChannel, lets the host revoke a device cleanly, which is what happens during live migration when an SR-IOV virtual function gets pulled out from under a running VM.

sequenceDiagram participant Root as Root partition (VSP) participant HV as Hyper-V hypervisor participant Guest as Guest VM (VSC) Root->>HV: OfferChannel(class_guid, instance_guid) HV->>Guest: ChannelOffer message via SynIC Guest->>HV: OpenChannel(ringbuf_gpa, signal_event) HV->>Root: Channel opened loop steady-state I/O Guest->>Root: write descriptor + payload to ring, signal SINT Root->>Guest: write response to ring, signal SINT end Root->>HV: RescindChannel(instance_guid) HV->>Guest: ChannelRescind via SynIC Guest->>Root: CloseChannel

Two ring buffers, one channel

Each open channel is two unidirectional ring buffers in shared memory: one for guest-to-host messages, one for host-to-guest. Each ring has a 4 KiB header page that holds the read index, the write index, and control flags, plus a power-of-two payload region. The guest tells the hypervisor which guest-physical pages back the ring through an object called a GPA Descriptor List (GPADL), built up via the vmbus_establish_gpadl API.

The kernel doc reveals a small but durable engineering detail. It maps the ring buffer twice in the guest's kernel virtual address space: header page first, ring contents next, and then the ring contents again, contiguously. Why? Because that lets a copy loop walk past the end of the ring without writing wrap-around code; the next byte after the ring's last byte is the ring's first byte, by virtual-memory arrangement. It is the same trick used inside the Linux page cache for fbdev and inside DPDK's mempool. It costs a little address space; it saves a branch on every payload byte.The Linux kernel doc is explicit that this double-mapping convenience exists in the guest only. If you are writing a userspace tool that ingests a captured VMBus ring (for forensics or debugging) you must implement wrap-around manually. This is exactly the kind of detail that source code documentation captures and prose articles forget.

The total amount of GPADL-shared memory a single guest can hold is capped per Windows version. The kernel doc records the numbers: roughly 1280 MiB on Windows Server 2019 and later, roughly 384 MiB on earlier hosts (Linux kernel: VMBus [@kernel-vmbus]). For a guest with 30+ channels (multiple netvsc subchannels, multiple storvsc subchannels, vPCI, KVP, time sync, VSS, balloon, framebuffer), that ceiling is real but not yet limiting at typical ring sizes of 1 to 16 MiB per direction.

The doorbell

Shared memory alone is not enough. The guest can write into the ring all it wants; the host will not look until it is told to. Conversely, the host can write into the ring; the guest will not check until something signals it. That signal is the doorbell, and it is implemented via the Synthetic Interrupt Controller SINTs introduced in the previous section.

When the guest enqueues a request and the host's read pointer is already chasing it (i.e., the host is still processing the last batch), the guest can suppress the doorbell entirely. Only the first request after the host has caught up triggers a hypercall. This is interrupt coalescing in software, and it is the single most important performance lever on a software data plane: the round-trip cost of a VMEXIT is amortised across many packets.

Note: This same shape, shared memory rings plus an event-channel doorbell, was the central insight of Xen's split-driver paravirtualization model in 2003 [@xen-pv-wiki]). Hyper-V's contribution was not the shape; it was packaging the shape so unmodified Windows guests could use it via in-box drivers, and publishing the protocol so unmodified Linux could too.

VSPs and VSCs

The two endpoints of a channel have specific names. The Virtualization Service Provider (VSP) is the kernel module in the root partition that owns the device backend. The Virtualization Service Client (VSC) is the guest-side driver that talks to the VSP through the channel. Microsoft's own architecture page is precise: "the Hyper-V-specific I/O architecture consists of virtualization service providers (VSPs) in the root partition and virtualization service clients (VSCs) in the child partition. Each service is exposed as a device over VM Bus, which acts as an I/O bus and enables high-performance communication between VMs that use mechanisms such as shared memory" (Microsoft Learn: Hyper-V architecture [@ms-hyperv-architecture-perf]).

**VSP** (Virtualization Service Provider): a kernel module in the root partition that exposes a synthetic device backend to guests over a VMBus channel. Examples: `vmswitch.sys` (synthetic NIC), `storvsp.sys` (synthetic SCSI), the `vmbusrhid` server (synthetic input). **VSC** (Virtualization Service Client): the matching driver in the guest that consumes the channel and presents an OS-native device interface (a NIC, a SCSI controller, a keyboard) to the rest of the kernel.

The split is symmetric in transport (both sides use the same ring) but asymmetric in trust. The VSP runs in the most privileged context on the box, the root partition's kernel. The VSC runs in a normal guest kernel. Every byte that flows from guest to host crosses a trust boundary and gets parsed by code with full system privilege. The next two sections will return to this fact at length, because it is where the security story lives.

Why this works for closed-source guests

The Xen project tried something similar in 2003 with netfront/blkfront rings and event channels, but Xen PV required a paravirtualised guest kernel: the guest had to know it was running on Xen at compile time. Closed-source guests like Windows could not be modified, so Xen's wiki [@xen-pv-wiki]) eventually documents PV-on-HVM as a workaround.

Hyper-V finessed this with hardware virtualization. The guest kernel runs unmodified inside VT-x or AMD-V; CPU-level privilege separation handles the privileged instructions. The only thing the guest needs to do to opt into VMBus is load a driver. Every supported Windows version since Windows 7 / Server 2008 R2 ships those drivers in-box. Linux ships them in-tree from kernel 2.6.32 onward. There is no separate "install paravirt drivers" step, which is why Hyper-V "just works" for almost any guest you point at it.

The transport is settled. What rides on it is a catalogue.

4. Synthetic device classes: storage, network, input, video, vPCI

A modern Hyper-V guest, on first boot, sees a small zoo of devices that have nothing to do with PC hardware. There is no IDE controller, no PS/2 keyboard, no Cirrus VGA. There is a synthetic SCSI controller, a synthetic NIC, a synthetic keyboard and mouse, a synthetic framebuffer, and (often) a synthetic PCI passthrough channel. Each is a VSP/VSC pair on top of VMBus.

The Linux kernel VMBus document [@kernel-vmbus] enumerates the catalogue: synthetic SCSI controller (storvsc), synthetic NIC (netvsc), synthetic framebuffer (synthvid), synthetic keyboard, synthetic mouse, PCI passthrough, plus the non-device services: heartbeat, time sync, shutdown, memory balloon, KVP exchange, and online backup (VSS).

flowchart LR subgraph Guest nv["netvsc (NIC)"] st["storvsc (SCSI)"] sv["synthvid (framebuffer)"] kb["hyperv-keyboard"] ms["hyperv-mouse"] pc["pci-hyperv (vPCI)"] kvp["hv_kvp (KVP)"] ts["hv_utils (timesync, shutdown, heartbeat)"] end subgraph Root vsw["vmswitch.sys"] sto["storvsp.sys"] sfb["synthvid VSP"] rhid["vmbusrhid VSP"] vpci["vPCI VSP"] kvpd["KVP daemon"] tsd["IS daemons"] end nv -- "VMBus channel" --- vsw st -- "VMBus channel(s)" --- sto sv -- "VMBus channel" --- sfb kb -- "VMBus channel" --- rhid ms -- "VMBus channel" --- rhid pc -- "VMBus channel" --- vpci kvp -- "VMBus channel" --- kvpd ts -- "VMBus channel" --- tsd

Synthetic SCSI: storvsc

The storvsc VSC presents itself to the guest as a SCSI host bus adapter. Disks attached to the VM appear as SCSI LUNs hanging off that HBA. The wire protocol uses ring buffers carrying SRB (SCSI Request Block) style commands. To scale, storvsc can open multiple sub-channels, one per host CPU, so that I/O completion interrupts and request submission spread across cores rather than serialising on a single VMBus channel.

This is also why Hyper-V's "Generation 2" VMs work. A Generation 2 VM [@ms-gen1-gen2-vms], introduced in Windows Server 2012 R2 in 2013, has no IDE controller in the boot path at all. UEFI loads the OS loader from a synthetic SCSI device, the OS loader hands off to the kernel, and the kernel binds storvsc to the same device. The legacy IDE emulator simply never runs. That removes a lot of attack surface and lets boot volumes grow up to 64 TB on VHDX.

Synthetic NIC: netvsc

netvsc is the synthetic NIC. The wire protocol historically wrapped Microsoft's NDIS-style RNDIS frames around payloads sent through the channel ring, which is why some Linux discussions mention "RNDIS frames over VMBus." The Linux driver lives in drivers/net/hyperv/ and the kernel netvsc documentation [@kernel-netvsc] describes how it can spread receive-side traffic across multiple VMBus subchannels via Receive Side Scaling.

netvsc is also the one device class where Hyper-V composes with hardware passthrough. Section 8 will take this apart in detail; for now, note that the same netvsc VSC can run alongside an SR-IOV virtual function in the guest, with netvsc acting as the slow-path failover and the VF carrying the steady-state traffic.

Synthetic input: vmbusrhid

The synthetic keyboard, the synthetic mouse, and a few related input streams ride on a server in the root partition called vmbusrhid (the name is shorthand for "VMBus relay HID"). It is a small surface in bytes, but architecturally it has the same shape as netvsc: guest-controllable messages parsed in kernel mode in the root partition. Anyone evaluating the trust boundary should treat it the same way as netvsc, even though the data rate is six orders of magnitude lower.

Note: A path that carries 100 keystrokes per second is, on the wire, almost free. As an attack surface, it is identical to a path that carries a million packets per second: both are guest-controlled bytes parsed by privileged code. Section 7 walks through why the security community treats vmbusrhid the way it treats vmswitch.sys.

Synthetic video: synthvid

synthvid is a synthetic framebuffer. It is what lets you connect to a Hyper-V VM through the Virtual Machine Connection client without dragging in an emulated VGA. It is intentionally simple: there is no 3D acceleration in the synthetic path. Workloads that need GPU acceleration use a different mechanism, vPCI / DDA, to assign a real GPU to the VM.

vPCI: synthetic PCI passthrough

The most subtle device class is pci-hyperv, which exposes a virtual PCIe topology to the guest. The Linux kernel vPCI document [@kernel-vpci] describes the trick: a passthrough device is offered to the guest initially over VMBus (the channel carries the device's PCI configuration space and BARs), and once the guest's vPCI driver has constructed a real PCI device object for it, the device dual-identifies as a normal PCIe device. The vendor driver can then load against it.

This is the mechanism behind both Hyper-V's Discrete Device Assignment (DDA) [@ms-dda] and Azure's Accelerated Networking, which we will return to in Section 8. The DDA planning document is explicit that Microsoft formally supports DDA for GPUs and NVMe storage as device classes; other PCIe devices are "likely to work" but require vendor support.

Generation-1 vs Generation-2: a quick decoder

Putting the device classes side by side clarifies why the move from Generation-1 to Generation-2 VMs simplified so much:

Element	Generation-1 VM (legacy)	Generation-2 VM (since 2013)
Firmware	BIOS	UEFI with Secure Boot
Boot disk	Emulated IDE	Synthetic SCSI (`storvsc`)
Network on boot	Emulated DEC 21140 fallback	Synthetic NIC (`netvsc`)
Input	Emulated PS/2 + `vmbusrhid`	`vmbusrhid` only
Display	Emulated VGA + `synthvid`	`synthvid` only
Max boot VHDX	2 TB	64 TB
Source	Microsoft Learn: Gen 1 vs Gen 2 [@ms-gen1-gen2-vms]	Same

Generation-2 is what the Hyper-V architecture wanted to be from the beginning: an all-synthetic stack with no fallback to imaginary 1990s chipsets. The two-generation existence was not a design preference; it was the cost of supporting older operating systems whose boot loaders only knew about BIOS and IDE. Today, every modern Windows and modern Linux supports Generation-2; Generation-1 remains for legacy guests.

Counting boundary crossings

The shape of the hot path is now visible. To send one network packet from a guest:

The guest writes one descriptor and one payload copy into the netvsc TX ring (one memory copy).
The guest possibly fires a doorbell (one hypercall, often suppressed if the host has not caught up).
The host's vmswitch.sys reaps the descriptor, parses it, and forwards it through the virtual switch to a real NIC.

A single packet's hot path is at most one hypercall and one memory copy in the guest, plus host-side ring traversal. Section 8's comparison table will quantify how this stacks up against virtio and SR-IOV, but the scale is clear: paravirt I/O on Hyper-V is orders of magnitude cheaper per packet than full PC emulation, and the gap closes only when you go all the way to hardware passthrough.

The catalogue is set. Now, who actually wrote the Linux side of all this?

5. Linux Integration Services: Microsoft writes Linux drivers

In December 2009, Microsoft did something quietly historic. Linux kernel 2.6.32 merged a set of drivers under drivers/staging/hv/, contributed by Microsoft itself, that taught the Linux kernel to be an enlightened Hyper-V guest. The kernel.org Hyper-V index page [@kernel-hyperv-index] is the maintained landing page for that work. Over the next several releases the drivers moved out of staging/, settled at drivers/hv/, drivers/net/hyperv/, drivers/scsi/storvsc_drv.c, and drivers/pci/controller/pci-hyperv.c, and became the default in every mainstream distribution.

That set of drivers is collectively called Linux Integration Services (LIS).

The set of in-kernel Hyper-V guest drivers that Microsoft contributes to upstream Linux. Includes `hv_vmbus` (the VMBus core), `hv_netvsc` (synthetic NIC), `hv_storvsc` (synthetic SCSI), `hv_utils` (KVP, time sync, shutdown, heartbeat, VSS), `pci-hyperv` (vPCI), and `hv_balloon` (memory ballooning). The same code that Microsoft maintains in the Linux tree powers Linux guests on Hyper-V on Windows Server, on Azure, and on developer Hyper-V on Windows 11.

The reason this matters is bigger than convenience. In 2009, Linux had a long, painful history with Hyper-V's competitors. VMware shipped open-vm-tools but the deepest paravirt drivers (VMXNET3, PVSCSI) lived in vendor packages. Xen's PV drivers existed in-tree but their evolution depended on Citrix and the Xen project. By contributing the full driver stack upstream and committing to keep it there, Microsoft chose a different route: they put the spec (the TLFS) and the implementation (LIS) in the open at the same time.

Microsoft did not just publish a hypervisor specification and hope Linux would adopt it. They wrote the Linux drivers themselves and upstreamed them, and then they kept doing it for fifteen years.

You can see the maintenance pattern in any current kernel. The drivers/hv/ directory has continuous commit activity from Microsoft engineers. Kernel-doc files like the VMBus [@kernel-vmbus], clocks [@kernel-clocks], vPCI [@kernel-vpci], overview [@kernel-hyperv-overview], and CoCo VM [@kernel-coco] pages are written by the same engineers who write the drivers. Several of those documents are the most lucid descriptions of the architecture that exist anywhere in public.One unexpected consequence: the Linux kernel docs are often easier to read for the architecture than Microsoft's own customer-facing docs. The customer docs answer "how do I configure this?"; the kernel docs answer "what is actually happening?" When researching this article, I found that the cleanest single description of VMBus channel lifecycle is the Linux kernel doc, not the TLFS.

What "in-box" really means

Both major guests now ship VMBus support without any post-install step:

On Windows, the VMBus client stack is built into every supported Windows version since Windows 7 / Windows Server 2008 R2. The legacy Integration Services package, which once shipped as an ISO you mounted into the VM, is no longer needed on supported Windows.
On Linux, the drivers are in-tree from kernel 2.6.32 (December 2009) onward and ship in every mainstream distro.

The kernel.org Hyper-V overview document [@kernel-hyperv-overview] explicitly warns against installing legacy LIS packages on top of a kernel that already has the in-tree drivers: it can break MSI-X handling and PCI passthrough. This is the kind of operational footgun that survives precisely because the in-box answer is correct and the LIS package is a holdover from earlier kernels.

A practical smoke test

You can confirm a Linux guest is using its enlightenments without any vendor tooling. The kernel exposes cpuid leaves and Hyper-V detection through dmesg and through /sys. A small script makes it concrete:

{ // This logic mirrors what \dmesg | grep -i hyperv` and a peek into // /sys/devices/virtual/misc/vmbus would tell you on a real Linux Hyper-V guest.

const guestObservations = { cpuidSig: '0x40000000', // Microsoft's vendor signature for Hyper-V guestOsIdMsr: 0x40000000, // HV_X64_MSR_GUEST_OS_ID, written by the guest hypercallMsr: 0x40000001, // HV_X64_MSR_HYPERCALL, returns the hypercall page vmbusModuleLoaded: true, netvscDevice: '/sys/class/net/eth0/device/driver', netvscDriverName: 'hv_netvsc', storvscModuleLoaded: true, };

function isEnlightenedHyperVGuest(o) { if (o.cpuidSig !== '0x40000000') return false; if (!o.vmbusModuleLoaded) return false; if (o.netvscDriverName !== 'hv_netvsc') return false; return true; }

console.log( isEnlightenedHyperVGuest(guestObservations) ? 'Yes: Hyper-V enlightened, using netvsc + storvsc' : 'No: running on emulated PC hardware or non-Hyper-V hypervisor' ); `}

The point is not the script itself (anyone can write a few lines of awk against dmesg); it is that the verification surface is public. The CPU vendor signature, the MSRs, the kernel module names, the /sys paths are all documented. There is nothing to reverse-engineer.

Why this earned trust

Two pieces of practical evidence persuaded the Linux community that LIS was not a strategic trap:

The drivers stayed upstream. From 2009 to the present, Microsoft has maintained the drivers/hv/ tree, responded to maintainer feedback, and shipped patches through the normal kernel process.
The TLFS stayed accurate. Successive Hyper-V releases either matched what the TLFS said or updated the TLFS. There was no second, secret protocol.

The combination put Microsoft in the unusual position of being the most open hypervisor vendor for Linux guest support. (VirtIO on KVM has a richer cross-vendor story; that comparison is Section 8.) This open posture is also what set up the 2024 OpenVMM open-sourcing as a credible move rather than a stunt.

But before we get to OpenVMM, we need to look at a different way Hyper-V matters: not just as a substrate for VMs, but as a substrate for in-VM security boundaries inside Windows itself.

6. VBS and HVCI: Hyper-V as the trust anchor inside Windows

Up to this point the article has treated Hyper-V as a virtualization product: a thing that hosts VMs. Starting in Windows 10 and Windows Server 2016 [@ms-server-2016], Microsoft began using the same hypervisor for a different job: enforcing security boundaries inside a single OS install. The umbrella name is Virtualization-Based Security (VBS).

The mechanism is simple in description and subtle in consequences. The hypervisor splits a single guest's address space into two Virtual Trust Levels (VTLs). The lower one, VTL0, runs the normal Windows kernel and user mode (this is where explorer.exe and your browser live). The higher one, VTL1, runs a much smaller stack called the Secure Kernel plus a set of isolated user-mode services called trustlets. A compromise of VTL0, even of ntoskrnl.exe, cannot read or write VTL1 memory because the hypervisor enforces that boundary using the same hardware machinery (Intel EPT / AMD NPT, plus Intel VT-d / AMD-Vi for DMA) that it uses to isolate one VM from another.

A Hyper-V construct that partitions a single guest's address space into multiple privilege tiers enforced by the hypervisor. VTL0 hosts the normal kernel and user mode; VTL1 hosts the Secure Kernel and trustlets. The hypervisor presents each VTL with its own separate set of memory mappings, system registers, and interrupt state, so code running at VTL0 cannot read VTL1's memory even if it has run-as-NT-AUTHORITY-SYSTEM privilege. flowchart TD HV["Hyper-V hypervisor"] subgraph Guest["A single Windows guest"] subgraph VTL0["VTL0 (normal world)"] User0["User mode: apps"] Kernel0["NT kernel"] end subgraph VTL1["VTL1 (secure world)"] SK["Secure Kernel"] Trustlets["Trustlets: LSAIso, BIOiso, ..."] end end HV --> Guest HV -. "EPT + IOMMU enforcement" .-> VTL0 HV -. "EPT + IOMMU enforcement" .-> VTL1 Kernel0 -. "VTL switch (hypercall)" .-> SK

What lives in VTL1

The flagship inhabitant of VTL1 is Hypervisor-protected Code Integrity (HVCI), which moves kernel-mode page-table integrity checking into the Secure Kernel. With HVCI on, no VTL0 driver can mark a kernel page as both writable and executable; the Secure Kernel mediates the page tables and refuses the request. The result is that attackers who already have code execution in the NT kernel cannot trivially load arbitrary unsigned kernel code or build new executable JIT pages on the fly.

The other tenants of VTL1 are trustlets. The most familiar is lsaiso.exe (LSA Isolation), which holds the cached domain credentials that historically lived in lsass.exe and were the prime target for tools like Mimikatz. With Credential Guard on, those secrets move to a trustlet whose memory is unreadable from VTL0; even SYSTEM-level malware in the normal world cannot extract them. Other trustlets handle biometric template storage, key isolation for code integrity policy, and similar small, security-sensitive workloads.

Why the hypervisor is the right place for this

Putting these protections inside the hypervisor rather than inside the kernel has a property that no in-kernel mitigation can match: the protected component does not share an address space with the attacker. A defence built inside ntoskrnl.exe (PatchGuard, KASLR, control-flow guard) lives in the same memory the attacker is trying to corrupt. A defence built inside VTL1 lives in memory the attacker cannot touch, because the page tables that map it are themselves invisible from VTL0.

Note: Pre-VBS Windows had decades of memory-safety bugs in the NT kernel. After VBS, exploiting one of those bugs no longer immediately yields the attacker the ability to read LSASS secrets or load arbitrary kernel code. The attacker now needs a second bug, in the much smaller Secure Kernel codebase. The defender's effective budget went up by a large multiplier without rewriting a single line of NT.

How this connects back to VMBus

VBS would not be possible without the work the previous sections described. The Secure Kernel is what runs in VTL1; it needs to communicate with VTL0 for ordinary system services (the lsaiso.exe process must respond to authentication requests from VTL0 callers, the HVCI mediator must answer page-table requests, and so on). The signalling and shared-memory primitives that make those calls cheap are the same SynIC and shared-page primitives that VMBus uses between partitions.

In other words, the architecture Microsoft built in 2008 to give a Windows VM a fast network card became, in 2016, the architecture that gives a single Windows install a security boundary stronger than its own kernel. The same hypervisor, the same trust-mediation primitives, two completely different applications.

Windows Server 2019 [@ms-server-2019] extended this further with Hyper-V isolation for containers, where a container's lightweight VM gets its own kernel inside a tiny VTL0 of its own. The pattern is consistent: every time Windows wanted a stronger isolation primitive, the answer was "use the hypervisor."

This dual-use is the reason a serious Windows security review touches the Hyper-V codebase even on machines that nobody thinks of as virtualization hosts. A Hyper-V escape (a guest-to-host VMBus exploit) is not just "an exploit against Azure"; it is also, on a typical Windows 11 desktop with VBS enabled, an exploit against the boundary that protects LSASS secrets from kernel-mode malware.

That makes the next section's question urgent: how strong is the VMBus boundary, in practice?

7. VMBus security: every message is a parser at the trust boundary

Here is the part of the architecture worth being honest about. The same property that makes VMBus fast, namely that the host-side VSP runs in the root partition's kernel and parses guest-supplied bytes directly, also makes the VSP the most consequential piece of attack surface in the entire stack. Microsoft itself prices it that way: the Hyper-V Bug Bounty Program [@ms-bounty-hyperv] pays up to USD 250,000 specifically for guest-to-host escapes that hit this surface, which is among the highest payouts Microsoft offers for any category of vulnerability.

Key idea: Every byte that crosses a VMBus channel from a guest is a byte that a kernel-mode parser in the most privileged partition on the host has to interpret. The performance argument for a software data plane and the security argument against it are the same argument, looked at from opposite directions.

The historical record

Three CVEs make the pattern concrete:

CVE-2017-0075 is the Hyper-V escape that the Qihoo 360 Vulcan Team demonstrated at Pwn2Own 2017. The NVD entry [@nvd-cve-2017-0075] describes it as a Hyper-V flaw that "allows guest OS users to execute arbitrary code on the host OS via a crafted application." The reachable code was in a VMBus message handler on the host side.
CVE-2021-28476 is the canonical example. The NVD record [@nvd-cve-2021-28476] classifies it as a critical Hyper-V remote code execution vulnerability with a CVSS score of 9.9. The Akamai writeup with Guardicore and SafeBreach [@akamai-cve-2021-28476] traces the bug to vmswitch.sys, the synthetic-NIC VSP, and shows it had been present in production since the August 2019 vmswitch build. The exploit primitive is exactly what the architecture invites: a guest crafts an OID-style RNDIS request, sends it through the netvsc VMBus channel, and the host's kernel parser misvalidates a length, producing memory corruption in the most privileged kernel on the box.
CVE-2024-21407 is a more recent Hyper-V remote code execution vulnerability patched in March 2024 (NVD [@nvd-cve-2024-21407]). Its existence demonstrates that the bug class did not vanish; the same shape (guest-controlled message, host kernel parser, escalation to host code execution) keeps reappearing.

The MSRC bounty page ranges from \$5,000 for low-impact bugs to \$250,000 for full guest-to-host escapes (Microsoft bounty page [@ms-bounty-hyperv]). That price point is not a marketing number; it is Microsoft signalling what its threat model says these bugs are worth. A defender pricing their own controls should treat any VSP code path that parses guest-controlled data as a category that justifies the same level of attention as remote internet-facing services.

Why the bug class is structural

The pattern in all three CVEs is the same:

A guest writes carefully crafted bytes into a VMBus channel ring.
The guest fires the doorbell.
The host's VSP, running in the root partition's kernel, dequeues the message.
The VSP parses the message in C or C++ kernel code.
A memory-safety mistake (length confusion, missing bounds check, integer overflow) becomes a write or read primitive in the host kernel.

There is no exotic mechanism here. The exploit surface is "kernel C code parsing untrusted input," which has been the dominant source of remote-code-execution bugs in operating systems since the 1990s. The novelty is the location: the parser sits below the most privileged supervisor on the box, with full access to every other tenant's memory.

sequenceDiagram participant Mal as Malicious guest VM participant Ring as VMBus ring (shared memory) participant SInt as Synthetic Interrupt Controller participant VSP as Host VSP (e.g., vmswitch.sys, kernel) Mal->>Ring: Write crafted RNDIS-style message Mal->>SInt: Hypercall: signal channel event SInt-->>VSP: SINT delivered on host CPU VSP->>Ring: Read message header note over VSP: Length confusion / missing bounds check VSP->>VSP: Out-of-bounds write in root partition kernel note over VSP: Result: arbitrary code in the most privileged partition

Mitigations short of a rewrite

Microsoft's first line of defence is the same one every kernel team uses: ASLR, control-flow integrity, kernel hardening, fuzzing the parsers, code review of every new device class, and, on Azure specifically, isolating each tenant's compute hypervisor so a single compromised host does not become a multi-tenant disaster. The MSRC bounty program is partly a procurement mechanism for this same effort: pay researchers to find and report bugs before attackers find them in the wild.

A second line of defence is Generation-2 VMs (Microsoft Learn [@ms-gen1-gen2-vms]), which remove the legacy emulators (IDE, PS/2, PIC) from the host data path entirely. Every emulator removed is one fewer parser in the most privileged kernel.

A third is the Microsoft Hyper-V architecture page [@ms-hyperv-architecture-perf]'s "minimise root-partition exposure" guidance: configure hosts with the smallest set of root-partition services that the workload requires, since every service is potential surface.

These all help, but none of them change the structural fact that VSPs parse guest-controlled data in C/C++ kernel code. The next architectural shift, the one that does change that fact, is what Section 9 is about.

Side channels and the Spectre era

VMBus also has to defend against side-channel attacks across the partition boundary. The same Spectre / Meltdown / L1TF mitigations that apply to a multi-tenant hypervisor in general apply to Hyper-V specifically. Microsoft's broader hypervisor mitigation strategy interacts with VMBus mostly indirectly: the SynIC, the hypercall page, and the timer subsystem all needed audit and adjustment when these classes of attacks emerged. The detail is largely outside the scope of an article about the device model, but the takeaway is consistent with the rest of this section: any shared CPU resource between partitions is a potential attack surface, and "shared via the hypervisor's bus" is no exception.

The structural answer to all of this, the one Microsoft itself has been working toward, is to change the languages and the trust boundaries. To set that up, the next section first widens the field by comparing VMBus to its peer in the KVM world, virtio.

8. VMBus vs virtio: two answers to the same question

Hyper-V is not the only hypervisor with a paravirt I/O story. The KVM world evolved its own answer to the same problem at roughly the same time, and it ended up with a different design with different trade-offs. The standard is virtio.

The original virtio paper, Rusty Russell's "virtio: Towards a De-Facto Standard For Virtual I/O Devices" [@rusty-virtio-paper], was published at OLS 2008, the same year Hyper-V shipped. The proposal was explicit in its motivation: every hypervisor was reinventing paravirt drivers, and a single hypervisor-independent specification could let one guest driver work everywhere. OASIS later standardised virtio 1.0 in 2016, then virtio 1.1 in 2019 [@oasis-virtio-1-1], then virtio 1.2 as a Committee Specification in 2023 [@oasis-virtio-1-2].

A hypervisor-independent paravirtual I/O specification, governed by OASIS. A virtio device is presented to the guest over a transport (PCI, MMIO, or s390 channel I/O) that advertises capability bits. The data plane is a generic ring layout called a **virtqueue**: a ring of descriptors, an `avail` ring (guest-to-host), and a `used` ring (host-to-guest). Each device class (virtio-net, virtio-blk, virtio-scsi, virtio-fs, virtio-gpu) defines its own message format on top of virtqueues.

The same shape, viewed sideways

Architecturally, virtio and VMBus are sibling answers to the same shaped problem.

flowchart LR subgraph virtio_pci["virtio over PCI"] gv["Guest virtio driver"] vq["virtqueue (descriptors + avail + used)"] host_be["Host backend (vhost-net, vhost-user, OpenVMM)"] gv -- "PIO doorbell write" --> host_be gv -- "shared memory" --- vq host_be -- "shared memory" --- vq host_be -- "MSI-X" --> gv end subgraph vmbus["Hyper-V VMBus"] gv2["Guest VSC"] ring["Two ring buffers + GPADL"] vsp["Host VSP (kernel)"] gv2 -- "Hypercall doorbell" --> vsp gv2 -- "shared memory" --- ring vsp -- "shared memory" --- ring vsp -- "SINT" --> gv2 end

Both:

Use shared-memory rings for payload.The phrase "shared-memory rings" hides a small subtlety: a ring buffer is a circular buffer with separate read and write indices. Producer and consumer can run concurrently as long as they only touch their own index, which is what makes ring buffers a wait-free communication primitive on cache-coherent hardware.
Use a doorbell for signalling.
Batch many requests per doorbell so per-message hypercall cost amortises.
Have per-class device protocols layered on top of a common transport.

The differences are where the world bites:

Dimension	VMBus	virtio (1.2)
Transport	Software-only "bus", channel offer/open/close	PCI, MMIO, s390 channel I/O
Doorbell	Hypercall (`HV_SIGNAL_EVENT`)	PIO write to a doorbell BAR
Reverse signal	Synthetic interrupt (SINT)	MSI-X
Standardisation	Microsoft-owned, Open Specification Promise [@ms-tlfs]	OASIS-ratified, multi-vendor
Windows in-box drivers	Yes, every supported version	No; out-of-box signed VirtIO INFs from cloud vendors
Device classes beyond I/O	Yes: KVP, time sync, VSS, balloon	Limited; non-I/O often built on virtio-vsock or out-of-band agents
Cross-hypervisor portability	Hyper-V only	Universal: KVM, QEMU, Cloud Hypervisor, Firecracker, Xen HVM, OpenVMM
Spec governance	Single vendor under OSP	Multi-vendor with formal conformance clauses
Source for Linux side	drivers/hv/ [@kernel-hyperv-index]	drivers/virtio in the Linux tree

Where each design wins

Virtio's strongest claim is portability. The same Linux guest VM image, with the same in-tree virtio drivers, runs on KVM, QEMU, Cloud Hypervisor, AWS Firecracker, and (since 2024) Microsoft's own OpenVMM, which added virtio backend support. A workload that has to move between cloud providers benefits from this directly: the guest does not need a different driver stack per host.

Virtio also has a richer multi-vendor governance story. The spec is OASIS-ratified, with explicit conformance clauses; multiple commercial hypervisors implement it; multiple SmartNIC vendors implement virtio data planes in hardware (the vDPA and VDUSE work, described by Red Hat [@redhat-vdpa] and the Linux kernel VDUSE doc [@kernel-vduse]).

VMBus's strongest claim is integration. Every supported Windows ships with the VSCs in-box; there is nothing for an admin to install. The transport carries not just I/O but a service catalogue: KVP for guest configuration, time sync, VSS for online backup, the heartbeat and shutdown channels. The TLFS, while owned by Microsoft, is published under the Open Specification Promise and is a single document a guest author can read end-to-end.This is why "VirtIO drivers for Windows" exist as a separate project (the Fedora/Red Hat-signed virtio-win package) for KVM clouds: out of the box, Windows does not know virtio. The Hyper-V world inverts the problem: out of the box, Linux does not need any third-party install because the drivers are upstream.

Where they coexist

The most interesting recent development is that the two camps have stopped being purely competitive. Microsoft's OpenVMM [@github-openvmm] implements both VMBus and virtio backends, so a Linux guest using virtio drivers can run on a Microsoft-developed VMM, and a Windows guest using VMBus drivers can run on the same VMM. This is partially ideological (Microsoft is no longer pretending its way is the only way) and partially pragmatic (a single VMM that supports both transports is simpler than maintaining two).

Beyond the protocol-level comparison, both VMBus and virtio sit inside a larger composition with hardware passthrough, where the transport becomes the slow path and a real PCIe device carries the steady-state traffic.

Hardware passthrough as a complement

The composition that runs almost every modern Azure VM is VMBus + SR-IOV, packaged as Accelerated Networking [@ms-accelerated-networking]. The same VM gets both a synthetic NIC (netvsc over VMBus) and an SR-IOV virtual function. The Linux netvsc driver documentation describes the failover mechanic: "If SR-IOV is enabled in both the vSwitch and the guest configuration, then the Virtual Function (VF) device is passed to the guest as a PCI device. In this case, both a synthetic (netvsc) and VF device are visible in the guest OS and both NIC's have the same MAC address. The VF is enslaved by netvsc device. The netvsc driver will transparently switch the data path to the VF when it is available and up." (Linux kernel: netvsc [@kernel-netvsc]).

When live migration starts, Azure revokes the VF, the data plane falls back to the netvsc/VMBus path, the VM moves, and a new VF on the destination host gets re-attached, all without dropping TCP connections. The VMBus path was never the production hot path, but its existence is what enables migration. The KVM world's analogue is vDPA, which gives a virtio-shaped guest interface backed by a hardware data plane.

A modern Azure NIC stack is pushing this even further. Azure Boost [@ms-azure-boost] moves both storage and networking data planes off the host CPU into dedicated FPGAs, with a stable Microsoft-engineered NIC interface called MANA [@ms-mana]. Microsoft's documentation reports up to 200 Gbps of network bandwidth and 6.6 million IOPS on local storage with this design, with the host's vmswitch still acting as the live-migration fallback path. The architectural insight is that the VMBus-based slow path is the durable invariant; what changes is whether the steady-state data plane is software, an SR-IOV VF, or a SmartNIC firmware path. Frameworks like DPDK [@dpdk-about] sit on top of whichever data plane the VM exposes.

What none of this changes is the property Section 7 cared about: as long as a host-side VSP exists and parses guest-controlled bytes in kernel C/C++, the bug class is open. The next section is about the architectural move that closes it.

9. OpenVMM and OpenHCL: the 2024 open-source pivot

In 2024, Microsoft did two things that would have been hard to imagine a decade earlier. First, they open-sourced OpenVMM [@github-openvmm], a Rust implementation of the virtualization stack including the VSPs and the VMBus protocol. Second, they introduced OpenHCL [@ms-openhcl-deep-explainer], a "paravisor" configuration of OpenVMM that runs inside a confidential VM as a higher-trust mediator between the workload and the (now-untrusted) host.

Both moves are explained by the same trend the article has been circling: confidential computing fundamentally inverts the trust boundary, and the device model has to follow.

A higher-privileged software layer that runs *inside* a guest VM (not on the host) and mediates the guest's interaction with the hypervisor. In the Hyper-V model, a paravisor lives in VTL2 of the same VM whose workload runs in VTL0; the host hypervisor is outside the VM's trust boundary. The paravisor presents the workload with a familiar VMBus + VSP interface while internally talking to a hardware-isolated confidential VM substrate (AMD SEV-SNP or Intel TDX).

What changed in confidential computing

The classical Hyper-V trust model places the root partition at the apex of trust. The guest trusts the host. Memory the guest writes is, in the worst case, readable by the host. In confidential computing, that is no longer acceptable. A regulated workload (a healthcare database, a financial processor) needs to run in a VM whose contents are protected even from a malicious or compromised hypervisor. AMD's SEV-SNP and Intel's TDX are CPU features that encrypt and integrity-protect VM memory in hardware so that a compromised host cannot read the guest's secrets.

Azure Confidential Computing [@ms-confidential-computing] made these capabilities available as a product starting around 2022. The Azure confidential VM options page [@ms-coco-vm-options] documents the SKUs.

This breaks the old VMBus story. In the classical model, the host's vmswitch.sys reads the guest's network packets out of the VMBus ring. In a confidential VM that protection demands you can no longer let the host see those bytes; that defeats the entire point. So the question becomes: where does the synthetic-device backend live, if not in the host?

The paravisor answer

The Linux kernel's Hyper-V CoCo VMs document [@kernel-coco] describes the design directly: "Paravisor mode. In this mode, a paravisor layer between the guest and the host provides some operations needed to run as a CoCo VM. The guest operating system can have fewer CoCo enlightenments than is required in the fully-enlightened case ... some aspects of CoCo VMs are handled by the Hyper-V paravisor while the guest OS must be enlightened for other aspects."

OpenHCL is that paravisor. It runs in a higher-trust virtual trust level inside the same confidential VM (VTL2), it has access to the encrypted-memory primitives the CPU provides, and it presents the workload (in VTL0) with the same VMBus + VSP world a non-confidential VM would see. The workload OS does not need to be heavily modified; it sees what looks like Hyper-V, talks to what look like normal VSPs, and never has to know that those VSPs are now inside its own VM rather than on the host.

flowchart TD HW["Confidential CPU (SEV-SNP / TDX)"] HV["Host hypervisor (untrusted by the workload)"] subgraph CoCoVM["Confidential VM (memory encrypted)"] VTL2["VTL2: OpenHCL paravisor (Rust VSPs)"] VTL0["VTL0: workload OS (Windows or Linux, lightly enlightened)"] VTL0 -- "VMBus, looks normal" --- VTL2 end HW --> HV HV --> CoCoVM HV -. "no access to guest plaintext" .-> CoCoVM

The Rust rewrite

The other half of the story is memory safety. Recall Section 7's CVE list: every headline Hyper-V escape in the past decade involved a parser bug in C/C++ kernel code. OpenVMM's choice to implement the entire VMM, including the VSPs, in Rust is a direct response to that history. Rust's ownership model rules out, by construction, a large class of memory-safety bugs (use-after-free, out-of-bounds access on slices, double-free) that produced those CVEs.

This does not magically eliminate every vulnerability. A logic bug in a state machine, an integer-overflow on a length field, a side-channel timing leak: all of these still exist in Rust. But the categories that produced CVE-2017-0075, CVE-2021-28476, and CVE-2024-21407 are exactly the categories Rust was designed to make hard.

Garbage-collected languages are wrong for a kernel-mode parser: GC pauses are unacceptable in a hypervisor-adjacent fast path, and you cannot afford a runtime that allocates memory during interrupt handling. Rust's compile-time memory safety with no GC is, today, the only mature option that gives you both the safety and the predictability a VSP needs. Microsoft's choice is consistent with the rest of the industry; comparable rewrites of low-level systems infrastructure (Cloudflare's `cf-cmd`, Mozilla's `quiche`, the Android Bluetooth stack) have all converged on Rust.

What you can actually look at

OpenVMM is not a press release; it is a public repository that ships:

The full Rust source tree at github.com/microsoft/openvmm [@github-openvmm].
A separate repository for the Linux kernel fork that the paravisor runs on top of, at github.com/microsoft/OHCL-Linux-Kernel [@github-ohcl-linux].
Project documentation centred at openvmm.dev [@openvmm-dev].
Both VMBus and virtio backends, so the same VMM can host Windows guests on VMBus and Linux guests on virtio.
Documentation through the deeper Microsoft Tech Community explainer [@ms-openhcl-deep-explainer] and the original announcement [@ms-openhcl-announce] describing the paravisor's role.

For a security researcher or a regulated-cloud customer, this is a meaningful change. For the first time, the VMBus + VSP stack is auditable end-to-end in source.

If you want to see how a VSP actually consumes a channel, the OpenVMM repository contains the Rust modules that implement the VMBus channel state machine. Cloning the repo and grepping for `Channel::open` and `RingBuffer` shows the same offer/open/close/rescind pattern Section 3 described, expressed in Rust types whose lifetimes the compiler checks. Reading the same logic in Rust after reading the Linux C version in `drivers/hv/channel_mgmt.c` is a useful exercise; the abstraction is identical, and the safety guarantees diverge.

What still has to be solved

The kernel CoCo doc is candid about an open architectural problem that OpenHCL alone cannot solve: "Unfortunately, there is no standardized enumeration of feature/functions that might be provided in the paravisor, and there is no standardized mechanism for a guest OS to query the paravisor for the feature/functions it provides. The understanding of what the paravisor provides is hard-coded in the guest OS." (Linux kernel: CoCo VMs [@kernel-coco]).

In other words, the TLFS gave us a portable contract between guests and Hyper-V hypervisors. The paravisor world does not yet have an equivalent portable contract between guests and paravisors. Today's guests have OpenHCL-specific knowledge baked in. A future "paravisor TLFS" would let any compliant paravisor host any compliant guest, the same way the original TLFS did for the hypervisor. That standard does not exist yet, and writing it is the most consequential open problem in this corner of the architecture.

The architecture is moving. Section 10 takes stock of what that means for engineers building or operating on this stack today.

10. Engineering takeaways and open problems

A working architecture is one where the trade-offs are visible. Hyper-V's enlightenments + VMBus + VSP/VSC stack is a working architecture in exactly that sense: every property it has, including the security ones, is a consequence of design choices a reader can name.

What the design optimises for

Three explicit optimisations:

In-box drivers for closed-source guests. Hardware virtualization handles privileged CPU instructions; the guest only needs to load a VMBus client driver to opt in to the fast path. Every supported Windows ships those drivers in-box. Every modern Linux ships them in-tree. There is no "install paravirt drivers" step, which is a large reason "it just works."
A single transport that carries everything. VMBus carries 12+ device classes plus non-device services (KVP, time sync, VSS, balloon, heartbeat). One protocol, one set of primitives, one debugging surface. This is the engineering equivalent of "everything is a file" applied to inter-partition communication.
Live migration. Because the data plane is software in the root partition, the VM is not bound to a specific host. The VSPs serialise their state during migration without guest cooperation. This is the property that makes VMBus the durable invariant under hardware-passthrough acceleration: SR-IOV gives you throughput; VMBus gives you mobility.

What it pays for those properties

Two costs:

The host CPU is on the data plane. A software ring serviced by vmswitch.sys cannot match a 100 GbE NIC's line rate per host CPU core. Microsoft's answer is hybrid composition with SR-IOV (Accelerated Networking [@ms-accelerated-networking]) and SmartNIC offload (Azure Boost + MANA [@ms-azure-boost]). The KVM analogue is vDPA [@redhat-vdpa]. Both of these accept the structural truth that for the highest throughputs, the host CPU has to leave the data plane.
The host kernel parses guest-controlled bytes. Section 7's CVE record is the catalogue of what that costs. The architectural answer is OpenHCL: move the parser into the guest's own trust boundary and rewrite it in Rust.

A four-property idealisation

It is useful to write down what an idealised paravirt I/O stack would do, so it is clear which properties any real stack today is trading away.

The four idealised properties:

Zero hypercalls per packet in steady state.
Live-migration parity with a software baseline.
Cross-vendor / cross-hypervisor portability of the guest driver.
No host-side memory-unsafe parser of guest-controlled data.

Approach	(1) Zero hypercall	(2) Live migration	(3) Portability	(4) No unsafe host parser
VMBus + in-kernel VSP	partial (batched)	yes	no	no
virtio + vhost-net	partial (batched)	yes	yes	no
SR-IOV / DDA	yes	no	no	yes
Accelerated Networking (VMBus + SR-IOV)	yes (steady)	yes	no	no
vDPA	yes	partial	yes	no
OpenHCL paravisor + VMBus	partial	yes	partial	yes
Azure Boost + MANA	yes	yes	no	partial

No single approach today matches all four properties. The Hyper-V production composition is roughly (VMBus baseline) + (Accelerated Networking for throughput) + (OpenHCL for confidential workloads). The KVM-world composition is (virtio baseline) + (vDPA / SmartNIC for throughput). SmartNIC-based stacks (Azure Boost, AWS Nitro, Google's offload) approach the same four-corner problem from yet another angle.

This is a synthesis, not a single-source claim: the matrix combines properties documented separately in the Microsoft Accelerated Networking docs [@ms-accelerated-networking], the Linux kernel CoCo doc [@kernel-coco], the Discrete Device Assignment doc [@ms-dda], the SR-IOV overview [@ms-sriov-overview], the Linux netvsc driver doc [@kernel-netvsc], the VDUSE userspace interface [@kernel-vduse], the vPCI doc [@kernel-vpci], and the OpenHCL explainer [@ms-openhcl-deep-explainer]. Each individual cell is sourced; the ranking is the author's reading of those sources.

Practical pitfalls for operators

A few things the customer-facing docs do not always say plainly:

vmbusrhid is not low-risk. The keyboard/mouse channel is a kernel-level RPC surface from guest to root. Treat it the same way you would treat netvsc when modelling threat exposure.
Generation-2 VMs reduce attack surface. Choosing Generation-2 for new workloads removes the legacy IDE/PS/2/PIC emulators from the host data path entirely (Microsoft Learn: Gen 1 vs Gen 2 [@ms-gen1-gen2-vms]).
Mixing in-box and out-of-band Integration Services breaks things. Modern Windows and modern Linux already have the drivers; installing the legacy LIS package on top can break MSI-X handling and PCI passthrough (Linux kernel: overview [@kernel-hyperv-overview]).
DDA is not SR-IOV. Discrete Device Assignment covers any PCIe device passthrough, but Microsoft formally supports only GPUs and NVMe as device classes (Microsoft Learn: DDA planning [@ms-dda]).
Confidential VMs do not have the same device set. Hardware constraints reduce or alter the device classes available; always validate the specific synthetic devices your workload depends on are present in the target SKU (Linux kernel: CoCo [@kernel-coco]).

Note: 1. Confidential VM (SEV-SNP / TDX)? Use the OpenHCL paravisor mode (Azure CoCo VM options [@ms-coco-vm-options]). 2. Need ≥40 Gbps with live migration? Use Accelerated Networking; on Boost-enabled SKUs, Boost adds another tier of offload. 3. Need ≥100 Gbps and accept binding to host? Use Discrete Device Assignment / SR-IOV. 4. Maximum guest portability across hypervisors? Use virtio; for bandwidth-sensitive workloads, vDPA. 5. Default Hyper-V workload, broad device coverage, native migration? VMBus + VSP (the default).

Open problems worth watching

The substantive open problems are:

A standardised paravisor feature-enumeration interface. OpenHCL is the first auditable paravisor, but there is no portable contract a guest can use to query "what does this paravisor support." The TLFS gave us this for hypervisors; the paravisor analogue is missing (Linux kernel: CoCo [@kernel-coco]).
Confidential-VM-friendly live migration with paravirt devices. Hardware-attested state cannot be cloned trivially; today's pragmatic answer is to constrain migration in CoCo VMs. A general solution is open.
A formal model of the VMBus offer/rescind state machine. The kernel docs describe it narratively. A model that the VSP code could be checked against would let static analysis rule out the bug class behind the headline CVEs.
Live-migrating stateful SR-IOV VFs without device cooperation. Vendor proposals exist; an industry standard does not.
Erasing memory-unsafety in legacy VSPs. The Rust rewrite path in OpenVMM is correct; the multi-year engineering effort to convert every existing VSP is real. CVE-2024-21407 is recent enough to remind everyone the bug class is still producing fresh entries.

What to remember in five years

The most important sentence in this article is one I have been quietly preparing throughout: the durable architectural invariant in Hyper-V is shared-memory ring + doorbell, with a published guest-side contract. Everything else, including the choice of programming language for the VSP, the question of whether the data plane is software or hardware, and even whether the trust boundary places the VSP on the host or in a paravisor, is implementation. The transport is the invariant. That is the lesson the next decade of CoCo VMs and SmartNIC offload is converging toward: keep the contract stable, and let everything else change.

FAQ

No. The drivers (`hv_vmbus`, `hv_netvsc`, `hv_storvsc`, `hv_utils`, `pci-hyperv`, `hv_balloon`) have been in the upstream Linux kernel since 2.6.32 in December 2009 and ship in every mainstream distribution. The legacy LIS package is a holdover from the era before in-tree support and can in fact break MSI-X handling and PCI passthrough if installed on top of a modern kernel (Linux kernel: Hyper-V overview [@kernel-hyperv-overview]). Because the trust gradient is asymmetric. The VSP runs in the root partition's kernel, the most privileged context on the box; the VSC runs in a normal guest kernel. Bytes flowing from guest to host get parsed by code with full system privilege. A VSC bug typically harms only the guest; a VSP bug can be a cross-tenant compromise. The pattern is visible in the CVE record: CVE-2017-0075 [@nvd-cve-2017-0075], CVE-2021-28476 [@nvd-cve-2021-28476], and CVE-2024-21407 [@nvd-cve-2024-21407] all hit host-side parsers. For live migration. SR-IOV gives you near-bare-metal throughput but binds the VM to a specific physical NIC; you cannot migrate that state. Keeping a VMBus-backed `netvsc` device in the same guest gives the hypervisor a software path it can fall back to during migration windows. The Linux kernel netvsc doc describes this failover explicitly: when SR-IOV is enabled, the VF is enslaved by netvsc and the data path switches transparently when the VF is up (Linux kernel: netvsc [@kernel-netvsc]). OpenHCL is a *configuration* of OpenVMM, not a separate codebase. OpenVMM is the Rust virtualization stack at github.com/microsoft/openvmm [@github-openvmm]; OpenHCL is OpenVMM run as a paravisor inside a confidential VM's higher-trust virtual trust level (VTL2), so that the synthetic-device backends sit inside the guest's own trust boundary rather than on a host the guest cannot trust. The same Rust code can run as a host-side VMM (when paired with a hypervisor on the host) or as an in-guest paravisor (when running inside a SEV-SNP or TDX VM). Both directions exist with caveats. OpenVMM, when used as a host VMM, supports both VMBus and virtio backends, so a Linux virtio guest can run on a Microsoft-developed VMM (github.com/microsoft/openvmm [@github-openvmm]). Native Hyper-V on a Windows Server host historically expects VMBus-driven guests; there is no in-box virtio device emulation on a stock Hyper-V Server. KVM hosts can technically present a VMBus-shaped device, but in practice the production answer on KVM is virtio. Generation-2 VMs use UEFI with Secure Boot, boot from synthetic SCSI, and have no emulated IDE, PS/2, or PIC in the data path (Microsoft Learn: Gen 1 vs Gen 2 [@ms-gen1-gen2-vms]). Every emulator that is removed is one fewer parser running in the most privileged kernel on the host, so the host-side attack surface is meaningfully smaller. Generation-1 still exists for legacy guests that only know how to boot from BIOS + IDE. VBS uses the Hyper-V hypervisor to split a single Windows install into VTL0 (the normal kernel and apps) and VTL1 (the Secure Kernel and trustlets like `lsaiso.exe`). The hypervisor enforces that VTL0 cannot read or modify VTL1's memory, even with kernel privileges. So an attacker who already has SYSTEM-level code execution in the normal world cannot trivially extract LSASS secrets or load arbitrary unsigned kernel code; the hypervisor stops them. This works on any modern Windows machine with the right CPU features, regardless of whether you ever run a VM yourself (Microsoft Learn: Windows Server 2016 What's New [@ms-server-2016]).

Above Ring Zero: How the Windows Hypervisor Became a Security Primitive

noreply@paragmali.com (Parag Mali) — Sun, 10 May 2026 00:00:00 GMT

**The Windows hypervisor is the program that loaded before Windows did.** It runs at a privilege level the Windows kernel cannot reach and owns the page tables that decide which memory the Windows kernel may even see. Virtualization-Based Security, Credential Guard, HVCI (Memory Integrity in Windows Security), Application Control, VBS Enclaves, and System Guard Secure Launch are all built by composing five primitives the hypervisor exposes -- partitions, hypercalls, intercepts, SynIC, and per-VTL SLAT. The substrate is real, alive, and producing two to four public CVEs per year; the residual attack surface (firmware below, side channels above, IOMMU bypass beside, hypervisor rollback) is where Windows security still earns its hardest miles.

1. Above Ring Zero

On a Windows 11 machine with VBS turned on, a kernel-mode driver running with full Ring-0 privilege cannot read a single byte of the LSASS process's credential cache. It cannot load an unsigned driver. It cannot patch ntoskrnl.exe. It cannot disable HVCI without a reboot. None of this is enforced by Windows. It is enforced by a different program -- one that loaded before Windows did, that runs at a privilege level the Windows kernel cannot reach, and that owns the page tables that say which memory the Windows kernel may even see. That program is the Windows hypervisor [@ms-hyperv-architecture, @ms-tlfs-vsm].

The intuition this fact violates is older than most readers' careers. "SYSTEM owns the box." Every introductory security course teaches it. Local administrator escalates to SYSTEM, SYSTEM loads a driver, the driver runs in the kernel, and the kernel can do anything to the machine. That model is correct for a Windows installation running without Virtualization-Based Security. It is wrong, in three specific and load-bearing ways, for a Windows installation that has VBS turned on.

A Windows security architecture that uses the Hyper-V hypervisor to create a small, isolated execution environment alongside the normal Windows operating system. The hypervisor allocates a portion of memory, configures its second-level page tables to make that memory unreadable and unwritable from normal kernel mode, and runs Microsoft-signed code there -- the Secure Kernel and isolated user-mode trustlets -- that the regular NT kernel cannot reach. Credential Guard, HVCI, Application Control, and System Guard all sit on top of this primitive [@ms-tlfs-vsm].

The binary in question is named hvix64.exe on Intel hosts and hvax64.exe on AMD hosts.Loose security writing sometimes calls the hypervisor's privilege level "Ring -1." That phrase is colloquial. Intel's manuals say "VMX root operation"; AMD's manuals say "SVM host mode." Both terms denote a CPU operating mode that sits architecturally outside the four-ring privilege stack the guest OS sees, not a fifth ring inside it. It is loaded by hvloader.efi before winload.exe ever runs. By the time the Windows boot manager hands control to the NT kernel, the hypervisor has already configured the CPU's virtualization extensions, allocated its own private memory, taken ownership of the IOMMU, and set up the per-partition second-level page tables that decide which physical pages each partition can see [@ms-tlfs-pdf]. From the NT kernel's point of view, the machine starts up already inside a guest partition. There is no escape upward.

This article is about the program that loaded first. The siblings in this series -- on the Secure Kernel, on Credential Guard and NTLMless, on Secure Boot, and on Adminless -- all assume what this article explains. Each of them describes a policy: the Secure Kernel enforces code integrity; Credential Guard isolates LSASS; Adminless raises the bar on local administrator. None of those policies would be enforceable without a piece of software running at a privilege level the policy's adversary cannot reach. The hypervisor is that piece of software, and "security primitive" is how Microsoft, the security research community, and the bug-bounty market all describe its current role.

By the end of this article you will know five things. First, why the hypervisor became a security primitive -- the architectural failure of Ring-0 defenses that Microsoft fought for a decade and finally gave up on in 2015. Second, how it became one, in three steps: Popek and Goldberg's 1974 virtualizability theorem; Intel VT-x and AMD-V in 2005-2006; and David Hepkin and Arun Kishan's 2013 patent on hierarchical Virtual Trust Levels [@us9430642b2-patent]. Third, what it enforces, feature by feature, with the hypervisor primitive that backs each: HVCI rides on per-VTL SLAT; Credential Guard rides on SynIC plus the secure-call ABI; System Guard Secure Launch rides on DRTM [@ms-system-guard-secure-launch]. Fourth, where it has actually failed in public -- six worked CVEs across three distinct attack classes, all narrowly localized. Fifth, what is structurally outside its mandate: firmware below the hypervisor, microarchitectural side channels above it, IOMMU bypass beside it, and hypervisor rollback through the update pipeline.

The story is half engineering and half conceptual inversion. How did a server-consolidation hypervisor that shipped in 2008 with Windows Server 2008 -- a product whose original marketing pitch was "run more VMs per box" -- become the architectural substrate that protects every load-bearing Windows security boundary in 2026? The answer begins in 1974, with a paper that defined what a hypervisor even is. But the political and engineering thread begins five years before that, in San Mateo, California.

2. Origins -- Connectix to Viridian to Hyper-V

Microsoft entered the virtualization market three years late and by acquisition. On February 19, 2003, the company bought Connectix, a small San Mateo software house founded in 1988 that had built Virtual PC for Macintosh and, later, Virtual PC for Windows. The Connectix engineers became the nucleus of what Microsoft would internally call the Windows Server Virtualization team. The acquired products shipped as Microsoft Virtual PC 2004 and Microsoft Virtual Server 2005. Both were Type-2 hypervisors -- user-mode applications that ran on top of Windows, using software techniques rather than CPU virtualization extensions, because the CPU virtualization extensions did not yet exist on shipping x86 hardware.

A hypervisor that runs directly on hardware rather than as an application on top of a host operating system. The hypervisor owns the CPU, the second-level page tables, and (in the security-relevant case) the IOMMU; guest operating systems run at a lower privilege level, in partitions or virtual machines that the hypervisor schedules and isolates. IBM's CP-67/CMS in 1968 is the genre's origin; VMware ESX, Xen, and the Microsoft hypervisor (`hvix64.exe`/`hvax64.exe`) are the modern examples [@wp-hypervisor].

In 2005, the team began a new project under the codename "Viridian." The goal was a Type-1 micro-kernelized hypervisor for x86-64 -- a fresh build, not a derivative of Virtual Server -- that required hardware virtualization extensions at install time. Intel's VT-x had shipped in November 2005 with the Pentium 4 662/672; AMD-V had shipped on May 23, 2006 with the Socket AM2 platform, initially available across Athlon 64 X2 and Athlon 64 FX and select Athlon 64 models. Both were now broadly enough deployed that Microsoft could make hardware virtualization a system requirement rather than a configuration option. Three years later, on June 26, 2008 (Wikipedia's body text gives this date; the infobox states June 28), Hyper-V reached RTM and was delivered as a Windows Server 2008 feature through Windows Update [@wp-hyperv].Microsoft ships two hypervisor binaries: hvix64.exe for Intel hosts (using VT-x) and hvax64.exe for AMD hosts (using AMD-V). The instruction-set-architecture divergence is real -- Intel uses vmcall to enter the hypervisor; AMD uses vmmcall -- but the hypercall ABI surface above that single instruction is identical, so the rest of the Microsoft hypervisor codebase is shared between the two binaries.

The 2008 design choices are worth naming individually because the ones that mattered for server consolidation turned out, twelve years later, to also be the ones that mattered for security. Three deserve flagging:

Micro-kernelized architecture. The hypervisor binary contains only the minimum machinery needed to virtualize the CPU, schedule VMs, and enforce memory isolation. It does not contain device drivers. It does not contain a network stack. It does not contain a filesystem.
Root partition plus child partitions. From the Microsoft architecture documentation: "The Microsoft hypervisor must have at least one parent, or root, partition, running Windows. The virtualization management stack runs in the parent partition and has direct access to hardware devices. The root partition then creates the child partitions which host the guest operating systems" [@ms-hyperv-architecture]. The root partition is a full Windows install; the child partitions are guest VMs.
VMBus, VSP, and VSC. Inter-partition I/O happens over the VMBus -- a paravirtualized message channel. A Virtualization Service Provider (VSP) runs in the root partition and owns the real device; a Virtualization Service Client (VSC) runs in each child partition and talks to the VSP over VMBus. Device emulation lives in the root partition's user-mode and kernel-mode code, not in the hypervisor binary itself. This is the choice that, twelve years later, kept the hypervisor's Trusted Computing Base small enough to be defensible.

flowchart TD subgraph Root["Root partition (Windows Server)"] RD["Real device drivers"] VSP["Virtualization Service Providers"] VMM["VM Worker Processes (vmwp.exe)"] end subgraph Child1["Child partition 1 (guest OS)"] VSC1["Virtualization Service Clients"] Guest1["Guest kernel + apps"] end subgraph Child2["Child partition 2 (guest OS)"] VSC2["Virtualization Service Clients"] Guest2["Guest kernel + apps"] end HV["Microsoft Hypervisor (hvix64.exe / hvax64.exe)"] HW["Hardware (CPU, RAM, NIC, disk)"] Root -. VMBus .- Child1 Root -. VMBus .- Child2 Root --> HV Child1 --> HV Child2 --> HV HV --> HW

The micro-kernel, root-plus-child, and VMBus choices were defensible server engineering. Their server engineering rationale was that emulating a NIC, or a SCSI controller, or a graphics adapter inside a hypervisor binary would balloon the binary's size, lock its code-review cycles to those of every device the company shipped, and force the same security-critical code that scheduled CPUs to also handle Ethernet frame parsing. Putting device emulation in a normal Windows process inside the root partition -- the VM Worker Process vmwp.exe -- meant the hypervisor binary could stay small enough to reason about.

The 2008 design goal was, again, server consolidation. Microsoft's positioning materials at the time named "run more VMs per box, get better hardware use" as the customer pitch. Nothing in the 2008 Hyper-V documentation describes the hypervisor as a security primitive for the host OS. The security re-purposing -- the moment Hyper-V's hardware-privilege isolation became the way Windows itself protected its own kernel from itself -- did not arrive until 2015. To understand why it arrived at all, we have to back up thirty-four years to a 1974 paper that defined what virtualization formally requires.

3. The Theoretical Anchor -- Popek, Goldberg, and SLAT

Before Microsoft could build a hypervisor that ran security-critical code at a higher privilege than the Windows kernel, two unrelated decisions had to land. One was made in 1974, by two researchers who would never see Windows. The other was made in 2005, by Intel.

In July 1974, Gerald Popek of UCLA and Robert Goldberg of Harvard published "Formal Requirements for Virtualizable Third Generation Architectures" in Communications of the ACM. The paper laid down three properties any "true" virtual machine monitor must satisfy:

Equivalence. Programs run on the VMM exhibit behavior essentially identical to behavior on the bare machine, except for differences due to timing and resource availability.
Resource control. The VMM, not the guest, controls the system resources -- CPU time slices, memory, devices.
Efficiency. A statistically dominant subset of the instruction stream executes directly on hardware, without VMM intervention.

The theorem that gave the paper its lasting reputation followed from those properties. Let a sensitive instruction be one that either reads or modifies privileged state (the processor's mode bits, page-table base register, interrupt mask). Let a privileged instruction be one that traps when executed in user mode. Then a sufficient condition for an ISA to be virtualizable is that every sensitive instruction is privileged. The intuition is simple: the VMM must get a chance to see -- and to handle -- every guest action that touches the machine's privileged state. If the CPU silently lets the guest do something privileged-feeling without trapping, the VMM cannot maintain equivalence and control simultaneously.

A property of a processor architecture: every sensitive instruction in the instruction set is privileged. An architecture with this property can be virtualized "classically" -- with a thin trap-and-emulate hypervisor whose only entry points are the traps the CPU raises on privileged-instruction violations. An architecture without this property requires software workarounds (binary translation, paravirtualization) or hardware extensions (VT-x, AMD-V) before a Popek-Goldberg-style VMM can be built.

For three decades, x86 was famously not virtualizable in the Popek-Goldberg sense. John Robin and Cynthia Irvine enumerated the problem in their 2000 USENIX Security paper: seventeen protected-mode instructions on the IA-32 architecture either read or modified privileged state without trapping from user mode.The Robin and Irvine enumeration includes instructions like SGDT (store global descriptor table register), SIDT (store interrupt descriptor table register), SLDT (store local descriptor table register), SMSW (store machine status word), and PUSHF/POPF (push/pop flags including IOPL). Each of these silently returned or accepted privileged state from user mode without raising a fault. The aggregate effect was that no classical Popek-Goldberg VMM could correctly virtualize an unmodified x86 guest -- every one of those seventeen instructions was a hole the VMM could not see through. VMware Workstation, released in 1999 by VMware Inc. (which had been founded the year prior by Mendel Rosenblum, Diane Greene, Scott Devine, Ellen Wang, and Edouard Bugnion), worked around the problem with binary translation: it dynamically rewrote each protected-mode guest instruction stream to substitute or trap the seventeen offenders. The technique imposed double-digit overhead, made debugging miserable, and was a security liability in its own right -- the binary translator itself was a parser of arbitrary attacker-controlled code.

Intel and AMD ended the problem in hardware. Intel VT-x (codename Vanderpool, November 2005) and AMD-V (codename Pacifica, May 2006) added a new CPU mode -- VMX root operation for Intel, SVM host mode for AMD -- and a new instruction-emulation mechanism. A VM exit could be configured to fire on every sensitive instruction the hypervisor wished to intercept, transferring control to the host with a structured exit reason and an opaque, host-controlled snapshot of guest state. After 2006, x86-64 became Popek-Goldberg-virtualizable in hardware [@wp-x86-virtualization].

sequenceDiagram participant Guest as Guest OS (VMX non-root) participant CPU as CPU hardware participant HV as Hypervisor (VMX root) Guest->>CPU: MOV CR3, rax (sensitive instr) CPU->>HV: VM-EXIT (reason 28: CR access) HV->>HV: Read VMCS exit-qualification HV->>HV: Validate, emulate, update SLAT HV->>CPU: VMRESUME CPU->>Guest: Continue guest at next instruction

One architectural element more was needed before any of this could be a security primitive rather than just a virtualization primitive. Classical x86 paging maps a guest virtual address to a physical address through a single CPU-walked page table. In a virtualized system that single table cannot be enough, because the guest needs its own virtual-to-physical map and the host needs to remap the guest's "physical" address to a real machine-physical address. The first generations of VT-x simulated this two-level mapping in software through shadow page tables, which the hypervisor had to maintain alongside the guest's tables on every page-table edit. Shadow paging was correct but slow, and it gave the hypervisor no clean way to enforce a different memory map for different parts of the same guest.

Second-Level Address Translation (SLAT) -- Intel's Extended Page Tables (EPT, shipped with Nehalem in November 2008) and AMD's Nested Page Tables (NPT, shipped with the Barcelona-generation Opteron on September 10, 2007) -- solved both problems in hardware. The guest walks its own page table from virtual to "guest physical"; the CPU then walks a second, hypervisor-owned page table from "guest physical" to "system physical." Two key properties follow. First, the hypervisor has exclusive control of the second-level mapping; the guest cannot read, write, or even know that it exists. Second, because the second-level mapping is per-partition, the hypervisor can give two partitions different views of the same machine physical memory -- the same page can be readable in one partition and entirely absent in another.

A hardware feature on Intel (EPT) and AMD (NPT) CPUs that lets the hypervisor maintain a second page table mapping guest-physical addresses to system-physical addresses. The CPU walks the guest's own page table for the virtual-to-guest-physical mapping, then walks the hypervisor's table for the guest-physical-to-system-physical mapping. Because the second table is hypervisor-controlled and per-partition, the hypervisor can give different partitions -- and, in VBS, different Virtual Trust Levels inside the same partition -- different views of physical memory. SLAT is the bedrock of VTL memory protection [@ms-tlfs-pdf].

Hyper-V required VT-x or AMD-V at install time from day one. SLAT became mandatory with Windows Server 2016 and Windows 10 1607 [@ms-hyperv-architecture].

Popek and Goldberg gave us the property. Intel and AMD gave us the hardware. Microsoft used both to build a server hypervisor in 2008. But for the first seven years of Hyper-V's life, none of that machinery protected Windows from itself. Microsoft hadn't yet noticed the architectural problem that made it necessary -- or rather, they had noticed the problem (PatchGuard's bypass record was public) and had not yet conceded that the problem was structural. The concession came in 2015. What forced it was the same-privilege paradox.

4. The Same-Privilege Paradox -- Why PatchGuard Was Never Enough

PatchGuard, which Microsoft shipped in 2005 with Windows Server 2003 SP1 x64, ran inside ntoskrnl.exe at Ring 0 and scanned a curated list of kernel structures -- the system service dispatch table, the interrupt descriptor table, the kernel image's .text section -- at randomized intervals to detect tampering. It was bypassed within months by Skywing's Uninformed writeups. Microsoft kept shipping it. Researchers kept bypassing it. The pattern lasted a decade. The reason is not that PatchGuard's authors were sloppy [@wp-kpp]. The reason is structural, and naming it correctly is the first of the three insights this article is built around.

Key idea: Any defense reachable by mov from Ring 0 is defeasible by mov from Ring 0.

The intuition is simple. PatchGuard is a piece of code. It lives in the kernel's virtual address space at some page. It owns a timer that re-runs it periodically. It maintains a randomization seed for which structures it checks next. It has a callback path into KeBugCheckEx if it detects tampering. Every one of those four assets -- the code page, the timer callback, the randomization seed, the bug-check path -- is a kernel data structure or a kernel virtual address. An attacker with Ring-0 code execution can locate each of them by searching the same kernel address space PatchGuard searches. They can patch the callback so the timer no-ops. They can patch the seed so the randomization is predictable. They can patch the bug-check path so it reports success. They can do all of this with a sequence of plain mov instructions. PatchGuard cannot defend against this, because PatchGuard's defenses live in the same place its attacker's writes do.

PatchGuard and its attacker are colleagues, not adversaries. They share an office. The office is `ntoskrnl.exe`'s virtual address space, and there is no key on the door.

This is the same-privilege paradox. It is not an implementation bug. It does not yield to better obfuscation, more randomization, or harder-to-find timers. It is an architectural ceiling. A defense at privilege level $P$ cannot be enforced against an attacker who also runs at privilege level $P$, because the defender's state lives in the attacker's address space. The defender can be made expensive to find; it cannot be made impossible to find, because the attacker has the same instructions, the same address-space view, and the same MMU privileges as the defender.

Note: The same-privilege paradox is a property of where the defense lives, not of how clever the defense is. PatchGuard's authors did add randomization. They did add multiple decoy callbacks. They did add cryptographically derived integrity checks. None of those reductions changes the basic fact that the attacker, holding the same Ring-0 privilege, can locate and edit each of them. The architectural fix is not better PatchGuard. The architectural fix is moving the defender to a privilege level the attacker cannot reach.

Once the paradox is named, the defender's choice is binary. Either give up on having a defense at all -- treat Ring 0 as a free-fire zone where any malware that gets there has won -- or move the defender to a privilege level above Ring 0, at a hardware boundary the attacker's mov instructions cannot cross. Microsoft picked the second. It is the only architecturally honest choice.

To make it work, Microsoft needed three things. The first was a hypervisor already deployed on every Windows install. They had that since 2008. The second was a way to put a piece of Windows itself -- code, data, secrets -- inside the hypervisor's protection without spawning a separate VM, because spawning a separate VM doubles the system's resource cost and forces every Windows process to choose between living on the normal side or the secure side. That required an architectural idea that did not yet exist in 2010: a way to split a single partition into two privilege levels, each with its own SLAT mapping and its own register state. The third was a way to ensure the hypervisor itself could not be silently replaced or rolled back beneath the OS. That required a hardware-rooted measurement -- a DRTM event -- that the OS could attest to.

The architectural idea is the subject of section 6. The DRTM measurement is the subject of section 11. Both of them required a decade-long conversation about whether the hypervisor itself could be trusted at all -- a conversation that ran in parallel during the same years and that briefly seemed to argue the opposite case. We turn to that conversation next.

5. The Hyperjacking Era -- SubVirt, Blue Pill, and CloudBurst

While Microsoft was finishing Hyper-V, the security community was establishing that a hypervisor was not just a defense -- it was also the most powerful possible attacker against the OS sitting above it. Three demonstrations in three years made the point unmistakable.

SubVirt. In May 2006, Samuel King and Peter Chen at the University of Michigan, joined by Yi-Min Wang, Chad Verbowski, Helen Wang, and Jacob Lorch at Microsoft Research, presented "SubVirt: Implementing Malware with Virtual Machines" at IEEE S&P [@king-subvirt-2006]. Their construction was a Virtual Machine Based Rootkit (VMBR). A privileged installer running inside a legitimate OS installed a malicious VMM at boot time; on the next reboot, the malicious VMM ran first, brought up the original OS as a guest underneath it, and gained the privileged position of seeing every CPU instruction, every memory access, and every I/O the OS performed. The original OS had no architectural way to tell it was no longer the most-privileged software on the box. SubVirt was demonstrated against Windows XP (using Microsoft Virtual PC as the malicious VMM substrate) and against Linux (using VMware Workstation), specifically to show that the technique was not tied to any one operating system or any one hypervisor product.

Blue Pill. Three months later, at Black Hat USA 2006, Joanna Rutkowska of COSEINC demonstrated "Subverting Vista Kernel for Fun and Profit" [@wp-blue-pill]. Her tool, codenamed Blue Pill, took a step beyond SubVirt by doing the VMM insertion at runtime rather than at boot. The technique: a Ring-0 driver, running inside an already-booted Windows install on an AMD-V capable host, executed VMRUN against an attacker-controlled Virtual Machine Control Block (VMCB) whose initial state matched the current physical CPU. The CPU dropped out of SVM root mode and re-entered as a guest under the attacker's VMM. The OS continued running normally, with no boot-loader modification and no reboot.

By 2007, Rutkowska and Alexander Tereshkin returned to Black Hat USA with the more polished "IsGameOver(,) Anyone?" presentation, refining the technique and addressing the early critics' detection ideas [@wp-blue-pill].Rutkowska's marketing claim that Blue Pill was "100% undetectable" attracted a public counter-effort: in 2007, Edgar Barbosa, Nate Lawson, Peter Ferrie, and Tom Ptacek all proposed detection techniques relying on side channels (timing artifacts of trapped instructions, TSC skew, structural differences in how RDTSC behaves under VT-x). The claim softened in subsequent publications, but the underlying point survived: a hostile thin hypervisor below a victim OS can be made arbitrarily difficult to detect from inside that OS, and the only architecturally clean way to know what you are running under is to measure the boot chain before the OS starts.

CloudBurst. At Black Hat USA 2009, Kostya Kortchinsky of Immunity Inc. presented CLOUDBURST. It was the first publicly demonstrated arbitrary-code-execution guest-to-host escape against a commercial hypervisor: a heap overflow in VMware's emulated SVGA-II graphics adapter, tracked as CVE-2009-1244 [@nvd-cve-2009-1244]. A guest VM, executing entirely inside a VMware-managed user-mode process on the host, could overflow a buffer in that process and gain host code execution. CloudBurst's lasting operational lesson was not the specific bug but the attack surface: device emulation -- not the trap-and-emulate core of the hypervisor -- is the largest piece of guest-attacker-controlled code in any commercial VMM. Every Hyper-V guest-to-host escape Microsoft has shipped a patch for since 2018 lands in either this device-emulation surface or the hypercall input-validation surface that mediates the same kinds of structured guest-controlled input.

flowchart TD subgraph Before["Before hyperjacking"] OS1["Victim OS"] FW1["Firmware (UEFI)"] HW1["Hardware"] OS1 --> FW1 FW1 --> HW1 end subgraph After["After hyperjacking"] OS2["Victim OS (now a guest)"] VMM["Hostile VMM (SubVirt / Blue Pill)"] FW2["Firmware (UEFI)"] HW2["Hardware"] OS2 --> VMM VMM --> FW2 FW2 --> HW2 end

The three demonstrations established a difficult dual truth. The hypervisor is the most powerful defender against an OS-level attacker, and it is the most powerful attacker against an OS-level defender. The same primitive can play either role; which role it plays in any given system depends only on whose hypervisor it is and whether the OS above it can prove that. SubVirt-style attacks did not require Microsoft to invent anything new -- they only had to be a possibility -- to force Microsoft into a design constraint: any "hypervisor as security primitive" architecture has to start by being the only hypervisor on the box, with a measurement of the hypervisor binary recorded in a TPM platform configuration register so that any malicious VMBR underneath could be detected at attestation time. This is the role that System Guard Secure Launch (DRTM) plays in the architecture, and we will return to it in section 11.

Blue Pill (offense) and VBS (defense) are architecturally identical. Each is a thin Type-1 hypervisor that interposes between firmware and OS. Each owns the CPU's virtualization mode, the second-level page tables, and the IOMMU. Each is invisible to the OS unless the OS can prove what is underneath it. The only differences between them are whose hypervisor it is, whether it was measured at load time, and what it does with its privilege. The defense is the offense, run by the right people, in the right order, and attested to.

By 2010 the security community had agreed: the hypervisor is the most powerful primitive in the system, and whoever owns the SLAT page tables owns the box. Joanna Rutkowska's Invisible Things Lab launched Qubes OS, an explicitly hypervisor-rooted security OS, on April 7, 2010 [@qubes-introducing-2010]. Microsoft owned the SLAT page tables. They had a hypervisor on every Windows install. They had a server-consolidation product. What they did not yet have was a reason to re-purpose any of it for security. The reason was already being filed at the United States Patent and Trademark Office. The priority date was September 17, 2013.

6. The Pivot -- VSM, VTLs, and the Hepkin-Kishan Patent

On September 17, 2013, David Hepkin and Arun Kishan filed United States patent application 14/186,415, which would issue on August 30, 2016 as US Patent 9,430,642 B2 [@us9430642b2-patent]. The patent's title, "Providing virtual secure mode with different virtual trust levels," reads like marketing now because the words it introduced -- "Virtual Trust Level," "VTL," "Virtual Secure Mode" -- became Microsoft's own canonical terminology. In 2013 the words did not exist. The patent describes, in 2013, exactly what Microsoft shipped twenty-two months later in Windows 10 build 10240 [@ms-tlfs-vsm].

The patent's claim language is unusually specific. It teaches a virtual-machine manager that makes "multiple different virtual trust levels available to virtual processors of a virtual machine"; it teaches that "different memory access protections (such as the ability to read, write, and/or execute memory) can be associated with different portions of memory (e.g., memory pages) for each virtual trust level"; and it teaches that "the virtual trust levels are organized as a hierarchy with a higher level virtual trust level being more privileged than a lower virtual trust level." Each of those phrases is now a feature of the shipping Microsoft hypervisor.

A hypervisor-managed privilege level inside a single partition. Each VTL has its own SLAT mapping (so the same machine page can be readable in one VTL and absent in another), its own virtual-processor register state (so a VTL transition is a context switch, not a procedure call), and its own interrupt subsystem (so interrupts targeted at one VTL do not preempt code running in another). VTLs are hierarchical: a higher VTL can read all of a lower VTL's memory, but not vice versa. The shipping Microsoft hypervisor implements two VTLs (VTL0 = Normal world, VTL1 = Secure world); the architecture admits up to sixteen [@ms-tlfs-vsm].

Windows 10 RTM on July 29, 2015, and Windows Server 2016, shipped VBS atop the existing Hyper-V hypervisor [@wp-windows-10]. The architectural innovation -- the thing the patent was for -- was that VTL0 (Normal world, containing the NT kernel, user mode, and LSASS) and VTL1 (Secure world, containing the Secure Kernel and Isolated User Mode trustlets) ran inside the same partition rather than in two separate partitions. VBS is not a second VM. It is a per-VTL SLAT split inside the root partition, plus a per-VTL register-state snapshot, plus a per-VTL interrupt delivery surface. The hypervisor switches SLAT contexts on VTL transitions, exactly as it would switch SLAT contexts on a partition switch -- but the switch happens inside a single partition's address space, so there is no extra VM scheduling and no extra OS image to manage.

flowchart TD subgraph Root["Root partition"] subgraph VTL0["VTL0 -- Normal world"] NT["NT kernel (ntoskrnl.exe)"] User["User mode (lsass.exe, applications)"] end subgraph VTL1["VTL1 -- Secure world"] SK["Secure Kernel (securekernel.exe)"] IUM["Isolated User Mode trustlets"] LSAISO["LSAISO.EXE"] VTPM["vTPM trustlet"] IUM --- LSAISO IUM --- VTPM end end HV["Microsoft Hypervisor (hvix64 / hvax64)"] HW["Hardware (CPU, RAM, IOMMU, TPM)"] VTL0 -. "Secure call (hypercall + SynIC)" .-> VTL1 VTL1 --> HV VTL0 --> HV HV --> HW

The Hyper-V Top-Level Functional Specification, chapter 15, names the architectural facts verbatim. "VSM achieves and maintains isolation through Virtual Trust Levels (VTLs). VTLs are enabled and managed on both a per-partition and per-virtual processor basis." "Virtual Trust Levels are hierarchical, with higher levels being more privileged than lower levels." "Architecturally, up to 16 levels of VTLs are supported; however a hypervisor may choose to implement fewer than 16 VTL's. Currently, only two VTLs are implemented." The C-level definition #define HV_NUM_VTLS 2 is published in the same specification [@ms-tlfs-vsm]. Two VTLs are what ships; the architecture has room for more.

VSM enables operating system software in the root and guest partitions to create isolated regions of memory for storage and processing of system security assets. Access to these isolated regions is controlled and granted solely through the hypervisor, which is a highly privileged, highly trusted part of the system's Trusted Compute Base (TCB). -- Microsoft, *Hyper-V Top-Level Functional Specification*, chapter 15 [@ms-tlfs-vsm]

This is the second insight the article is built around: VBS is not a re-architecture. It is a re-purposing. The hypervisor was already on every Windows install for unrelated reasons. The 2015 pivot did not require new hardware, new VMs, or new CPUs. It required a new way to organize what was already there -- two SLAT mappings instead of one, two register snapshots instead of one, a secure-call ABI on top of the SynIC -- and a Windows-side Secure Kernel binary to run inside the new VTL1 view. The patent gave the design its formal expression; the engineering had been waiting since 2008 for the right architectural insight.David Hepkin spent over a decade on the NT kernel architecture team before the VSM design; Arun Kishan was an NT kernel architect and is now Microsoft's Corporate Vice President for the Operating Systems Platform group. Neither is a virtualization specialist by background. Their patent is, in retrospect, a kernel-team idea about how to put a piece of the kernel itself behind a hardware boundary the kernel cannot cross -- exactly the kind of design that an architect who had lived inside ntoskrnl.exe for years would invent.

Alex Ionescu's Black Hat USA 2015 deck "Battle of SKM and IUM: How Windows 10 Rewrites OS Architecture" reverse-engineered the entire VSM stack within four weeks of Windows 10 RTM [@ionescu-bh-2015]. The vocabulary Ionescu introduced has become the canonical research language for talking about VBS: VTL as "synthetic ring level managed by the hypervisor"; trustlets for the user-mode processes that run inside VTL1's Isolated User Mode; Signature Level 12 plus the IUM EKU 1.3.6.1.4.1.311.10.3.37 as the loader's signing requirement. Microsoft's own developer documentation now uses the same terms [@ms-iso-user-mode-trustlets].

The pivot, then, was not a sudden re-architecture. It was the cash-out of a deliberate multi-year engineering plan that began at least twenty-two months before Windows 10 RTM. To see what VBS actually enforces -- and which hypervisor primitive backs each piece of that enforcement -- we need to walk the hypervisor's public surface. There are five surfaces. They are the architectural body of the article.

7. Architecture Tour -- The Hypervisor's Public Surface

What does the Windows hypervisor actually look like as a piece of software? It is a small kernel, on the order of one to two hundred thousand lines of C and C++ by community estimate; Microsoft has not published a primary line count. It has five externally visible surfaces, all of which are documented in the Hyper-V Top-Level Functional Specification (TLFS) v6.0b [@ms-tlfs-pdf]. We walk them in turn.

7.1 Partitions, VMBus, and the VSP/VSC pair

A partition is the hypervisor's unit of isolation. From the Microsoft architecture page: "The Microsoft hypervisor must have at least one parent, or root, partition, running Windows. The virtualization management stack runs in the parent partition and has direct access to hardware devices. The root partition then creates the child partitions which host the guest operating systems" [@ms-hyperv-architecture]. The root partition is a full Windows install with privileged hypercalls and direct access to hardware; each child partition is a guest VM with only the hardware the root has chosen to expose.

A guest VM does I/O over the VMBus. A network packet, for example, travels from the guest application down to the guest's Windows NDIS stack; through the synthetic NIC miniport driver (the VSC) in the guest's kernel; over the VMBus message channel; into the network VSP in the root partition; into the root's real NDIS stack; into the physical NIC driver; out the wire. The hypervisor's role in this chain is structural: it owns the VMBus message channel, the SynIC interrupts that notify the VSP and VSC of new traffic, and the per-partition SLAT mappings that decide which bytes either side can read.

The architectural implication is that device emulation lives in the root partition, not in the hypervisor binary. The TCB the hypervisor binary itself has to protect is narrow. The TCB the root partition's drivers have to protect is much wider -- but those drivers live in normal Windows kernel mode, where Microsoft has thirty years of tooling. This is why almost every public Hyper-V CVE since 2018 has landed in vmswitch.sys, storvsp.sys, or the NT Kernel Integration VSP, rather than in hvix64.exe itself.

Note: Putting device emulation in the root partition means the hypervisor binary does not need to parse Ethernet frames, SCSI commands, USB descriptors, or graphics-adapter command rings. The trade-off is that the root partition becomes part of the TCB -- a root-partition kernel-mode bug is a hypervisor-equivalent break -- but the small hypervisor binary itself can be reviewed, fuzzed, and reasoned about as a single piece of code.

7.2 The hypercall ABI

Hypercalls are how partitions request services from the hypervisor. The TLFS documents two flavors. A fast hypercall passes its parameters inline in CPU registers: on x64, rcx carries a 64-bit hypercall input value (the low 16 bits are the call code; the upper 48 bits are a control word with fields for the Fast flag, variable-header size, Rep Count, and Rep Start Index), rdx carries the first input parameter, and r8 carries the second. A slow hypercall instead passes the GPA (guest physical address) of an input-parameter page in rdx, and the GPA of an output-parameter page in r8; the actual parameter content lives in those pages. The instruction that triggers the hypercall is vmcall on Intel and vmmcall on AMD; the hypervisor maps both onto the same internal entry point [@ms-tlfs-pdf].

A guest-to-hypervisor call. The guest issues `vmcall` (Intel) or `vmmcall` (AMD); the CPU traps via VM-EXIT into the hypervisor in VMX root mode; the hypervisor reads the call code from `rcx`, reads the inputs from registers (fast) or from a GPA-pointed page (slow), services the request, writes outputs back, and returns via VM-ENTRY. Hypercalls are the only legitimate way for a partition to invoke hypervisor services [@ms-tlfs-pdf].

{// A JavaScript model of the rcx hypercall input value layout. // In a real hypercall the guest sets rcx, rdx, r8 and issues vmcall / vmmcall. function packHypercallInput({ callCode, fastFlag, varHeaderSize, isNested, repCount, repStartIdx }) { // rcx layout (TLFS section 3 "Hypercall Interface", verbatim bit map) // bits 0..15 Call Code // bit 16 Fast (1 = inline params in rdx/r8) // bits 17..26 Variable header size (in QWORDs) // bits 27..30 RsvdZ // bit 31 Is Nested // bits 32..43 Rep Count // bits 44..47 RsvdZ // bits 48..59 Rep Start Index // bits 60..63 RsvdZ let rcx = 0n; rcx |= BigInt(callCode) & 0xFFFFn; if (fastFlag) rcx |= 1n << 16n; rcx |= (BigInt(varHeaderSize) & 0x3FFn) << 17n; if (isNested) rcx |= 1n << 31n; rcx |= (BigInt(repCount) & 0xFFFn) << 32n; rcx |= (BigInt(repStartIdx) & 0xFFFn) << 48n; return rcx; } // HvCallPostMessage = 0x005C, fast hypercall (TLFS section 11) const rcx = packHypercallInput({ callCode: 0x005C, fastFlag: 1, varHeaderSize: 0, isNested: 0, repCount: 0, repStartIdx: 0, }); console.log('rcx = 0x' + rcx.toString(16).padStart(16, '0')); // Output: rcx = 0x000000000001005c}

The call-code space is small and well-documented: a few hundred codes, each one a structured request with typed inputs and outputs. The hypercall path is also where the most consequential 2024 Hyper-V CVE lived. CVE-2024-21407 was a use-after-free in hvix64.exe's handling of a specific file-operation hypercall, the rare case where the bug was in the hypervisor binary itself rather than in a root-partition driver [@nvd-cve-2024-21407].

7.3 Intercepts

Intercepts are how the hypervisor virtualizes guest behavior. The TLFS distinguishes four categories: instruction intercepts (CPUID, MSR reads/writes, I/O-port instructions), exception intercepts (page faults, general protection faults), memory-access intercepts (a guest tries to read or write a specific guest-physical-address region), and partition-state intercepts (a guest hits a state that the hypervisor wants to be notified about). Each is configured per-partition through the Intel VMCS execution-control bits or the AMD VMCB control fields [@ms-tlfs-pdf].

A configurable hypervisor notification on a specific guest event. The hypervisor programs the VMCS or VMCB to fire a VM-EXIT when the guest issues a particular instruction, raises a particular exception, accesses a particular memory region, or transitions to a particular state. Intercepts are the policy mechanism that lets the hypervisor implement device emulation, security checks, and VTL transitions [@ms-tlfs-pdf].

For VBS, the load-bearing intercept is the memory-access intercept. When VTL0 code tries to access a region whose VTL0 SLAT mapping is unreadable or unwritable, the access traps to the hypervisor with the offending GPA; the hypervisor can deliver the intercept to the VTL1 Secure Kernel as a secure call, letting VTL1 see what VTL0 was trying to do and decide whether to allow it. This is how HVCI's W^X enforcement is wired: a VTL0 page that is marked writable in VTL0's SLAT is marked non-executable in the same SLAT; an attempt to switch the same page to executable becomes a memory-access intercept that VTL1 must approve.

7.4 The Synthetic Interrupt Controller (SynIC)

The Synthetic Interrupt Controller, SynIC, is the hypervisor's per-virtual-processor event delivery surface. Each VP has 16 Synthetic Interrupt Source (SINT) lines, a message page (where the hypervisor places message-shaped events), an event-flag page (where it places bit-flag events), and a set of synthetic timers. SynIC is the bus on which VMBus traffic between VSP and VSC moves; it is also the bus on which VTL transitions between VTL0 and VTL1 are delivered inside the root partition [@ms-tlfs-pdf].

A hypervisor-emulated interrupt controller, parallel to the hardware APIC, that delivers hypervisor-originated events to a virtual processor. Each VP has 16 SINT lines, a message page, an event-flag page, and synthetic timers. VMBus signaling rides on SynIC; secure-call delivery between VTL0 and VTL1 rides on SynIC; vTPM, virtual-PCI, and other paravirtualized device events ride on SynIC [@ms-tlfs-pdf].

For VBS, the secure-call ABI -- the way VTL0 code asks VTL1 to do something -- is built on SynIC. A VTL0 caller writes a request into a shared message page, signals a SINT, and yields the CPU; the hypervisor switches SLAT context to VTL1, delivers the message, and lets VTL1 read the request. When VTL1 finishes, it signals a SINT back to VTL0 and the hypervisor switches contexts again. Credential Guard's whole communication path between VTL0 LSASS and VTL1 LSAISO is one of these secure-call channels.

7.5 Memory and per-VTL SLAT

The last surface is also the most important: memory. Guest physical addresses (GPAs) are translated to system physical addresses (SPAs) by per-partition SLAT page tables. The hypervisor has exclusive control of these tables; no partition, including the root, can read or modify them directly. For VBS specifically, the hypervisor maintains two SLAT mappings per partition -- one for VTL0 and one for VTL1 -- and switches between them on VTL transitions.

This is the architectural reason VTL0 kernel mode, even with full Ring-0 code execution, cannot read or execute VTL1 memory. The VTL0 page-table walker on a load from a VTL1-only page does not see the page at all; the SLAT walker on the host returns no mapping; the hardware MMU raises an EPT/NPT violation; the hypervisor handles the violation according to the VTL0 partition's intercept policy. In the security-relevant case, the hypervisor delivers an access-denied result to VTL0 and continues. There is no kernel-mode mov instruction sequence that can defeat this, because the gating happens in hardware page-table walks that VTL0 kernel mode cannot influence.

Five surfaces. Two of them -- the hypercall ABI and the device-emulation paths that surface over VMBus -- are where every public Hyper-V escape since 2018 has lived. The other three (intercepts, SynIC, per-VTL SLAT) are the substrate on which VBS, HVCI, Credential Guard, and System Guard Secure Launch are built. We turn to those next.

8. How the Hypervisor Enforces Each VBS Feature

The hypervisor itself does not know anything about credentials, code signing, application allowlisting, or DMA protection. It knows about partitions, VTLs, intercepts, SLAT entries, and hypercalls. Each Windows security feature is built by composing those primitives in a specific way. The mapping is precise and worth walking, because it is what makes the substrate a security primitive rather than just a virtualization product [@ms-hardware-root-of-trust].

HVCI / Memory Integrity. Hypervisor-protected Code Integrity is the most consequential VBS feature on a per-byte basis: it changes Windows from a system that lets the kernel execute any signed driver to one where the kernel cannot execute any page until VTL1 has approved it. VTL1's code-integrity service inspects every kernel-mode page mapping change request before the SLAT entry that would make the page executable in VTL0 is granted. The W^X invariant -- a single page can be writable or executable, but never both -- is enforced not by NT kernel cooperation but by the per-VTL SLAT, exactly as described in section 7.5. An NT-kernel attempt to mark a writable page executable becomes a memory-access intercept that VTL1's CI service evaluates [@ms-enable-vbs-hvci]. The hypervisor primitives composed: per-VTL SLAT + memory-access intercepts + secure-call ABI.

A user-mode process that runs inside VTL1's Isolated User Mode (IUM). Trustlets must be signed with the Windows System Component Verification certificate (Signature Level 12) and carry the IUM EKU `1.3.6.1.4.1.311.10.3.37`. The shipping inbox trustlets include `LSAISO.EXE` (Credential Guard), `VMSP.EXE` (host side of virtual TPM), and the vTPM provisioning trustlet [@ms-iso-user-mode-trustlets, @ionescu-bh-2015].

Credential Guard. LSAISO.EXE -- the LSA-Isolated trustlet -- runs in VTL1 Isolated User Mode. NTLM password hashes and Kerberos Ticket-Granting Tickets that LSASS used to keep in normal VTL0 memory are moved to VTL1 memory that VTL0 cannot read. VTL0 LSASS performs credential operations by sending a request to LSAISO over a secure-call channel mediated by the hypervisor's SynIC; LSAISO does the cryptographic work and returns a result. The plaintext of the credential never leaves VTL1. This is why a Ring-0 attacker on a Credential Guard-enabled Windows install cannot dump LSASS hashes -- they aren't in LSASS [@ms-iso-user-mode-trustlets]. The hypervisor primitives composed: per-VTL SLAT (to hide LSAISO's memory) + SynIC (to deliver secure calls) + intercepts (to catch VTL0 attempts to access LSAISO memory). See the sibling Credential Guard / NTLMless article for VTL1 internals.

The VTL0-to-VTL1 calling convention. A VTL0 caller fills in a shared parameter page, signals a SynIC interrupt configured for VTL transition, and yields. The hypervisor switches SLAT context to VTL1, delivers the message, and lets the Secure Kernel dispatch it via `IumInvokeSecureService` to a registered VTL1 service. On return, the hypervisor switches contexts back. The whole round-trip is mediated by hypervisor primitives the calling VTL cannot bypass [@ionescu-bh-2015].

Application Control (WDAC). The same VTL1 code-integrity service that backs HVCI also evaluates user-mode policy. When VTL0 user mode tries to load a binary that is restricted by WDAC policy, the load becomes a secure call into VTL1; VTL1's policy engine evaluates the signature, the certificate chain, and the configured policy; the secure call returns approval or denial. WDAC policy lives in VTL1, the policy database lives in VTL1, and a VTL0 administrator who has been compromised cannot edit either. The hypervisor primitives composed: same as HVCI, plus a richer secure-call API for policy evaluation.

VBS Enclaves. A third-party application can load native code into a VTL1 IUM enclave. The enclave executes in VTL1, with its memory hidden from VTL0; the application talks to the enclave through a secure-call ABI exposed by the Secure Kernel. Architecturally parallel to Credential Guard but available to ordinary application developers. The hypervisor primitives composed: per-VTL SLAT (to hide enclave memory) + secure-call ABI (to invoke enclave code) + a Secure Kernel API for enclave creation, attestation, and destruction.

System Guard Secure Launch (DRTM). Intel TXT's SENTER instruction (and AMD's SKINIT on AMD platforms) executes a hardware-rooted dynamic measurement of the hypervisor and the Secure Kernel into TPM PCRs 17-22 after firmware initialization [@ms-system-guard-secure-launch]. This re-establishes the trust root post-firmware: a pre-boot firmware compromise that survived UEFI Secure Boot cannot silently poison the hypervisor's launch state without showing up as an unexpected measurement in a PCR that VTL1 can read. The hypervisor primitives composed: DRTM event registration with the hardware + TPM PCR extension + a VTL1-side attestation API. See the sibling Secure Boot article for the static-RTM half of the same story.

Kernel DMA Protection. External devices over Thunderbolt, USB4, or hot-plug PCIe can issue DMA to arbitrary physical addresses, bypassing the CPU's MMU entirely. The hypervisor configures the IOMMU (Intel VT-d / AMD-Vi) to deny DMA from externally-attached devices outside of explicitly-authorized memory regions, and to refuse DMA from any device before its kernel-mode driver has been loaded under a trusted policy [@ms-kernel-dma-protection]. The hypervisor primitives composed: hypervisor-owned IOMMU configuration + memory-access intercepts on the IOMMU configuration MMIO region.

The shape of the table is the point.

Feature	Composed primitives	Verbatim hypervisor mechanism
HVCI	per-VTL SLAT + memory-access intercepts + secure-call ABI	VTL1 vets each VTL0 page-mapping change before granting +X
Credential Guard	per-VTL SLAT + SynIC + intercepts	LSAISO trustlet memory absent from VTL0 SLAT mapping
WDAC (AppControl)	secure-call ABI + VTL1 policy engine	VTL0 binary load = secure call into VTL1 CI service
VBS Enclaves	per-VTL SLAT + secure-call ABI	Third-party VTL1 IUM enclave invoked over secure call
System Guard Secure Launch	hardware DRTM (TXT/SKINIT) + TPM PCR extension	`SENTER` / `SKINIT` measures hypervisor into PCRs 17-22
Kernel DMA Protection	hypervisor-owned IOMMU + MMIO intercepts	VT-d/AMD-Vi denies DMA outside authorized regions

The hypervisor knows nothing about NTLM hashes, Kerberos tickets, code-signing certificates, WDAC policy XML, or DMA-region authorization. All of that policy lives in VTL1 -- in the Secure Kernel, in LSAISO, in the WDAC service. The hypervisor only provides the *mechanism* for one piece of policy to evaluate a request from another piece of policy in isolation. This is the architectural separation that lets the hypervisor binary stay small and the Windows-side security feature set keep growing.

The pattern: each feature is a different composition of the same five primitives (partitions, hypercalls, intercepts, SynIC, per-VTL SLAT). The hypervisor is genuinely a primitive in the formal sense -- a small set of mechanisms that compose into many security policies. If the hypervisor is the mechanism, the boundary the hypervisor enforces is the contract. Microsoft commits to servicing certain attacks against that boundary and explicitly excludes others. To know what we are getting, we need to read the contract.

9. The Security Boundary Microsoft Commits To

The Microsoft Security Servicing Criteria for Windows is a public document. It enumerates which classes of attack Microsoft will issue a CVE and an out-of-band patch for, and which it will not. For the hypervisor, the document is unusually specific [@ms-msrc-servicing-criteria].

The two relevant boundaries:

Hypervisor / virtualization boundary. An L1-guest-to-host or guest-to-guest break is a serviced boundary. If a guest VM can execute code in the root partition or in another guest's address space, Microsoft will issue a CVE.
Virtual Secure Mode (VBS) boundary. VTL0 kernel-mode code reading or writing VTL1 memory, or executing VTL1 code, is a serviced break. If a Ring-0 attacker in VTL0 can defeat the per-VTL SLAT, Microsoft will issue a CVE.

What the servicing criteria does not commit to is also worth naming. A same-VTL elevation of privilege inside a guest (a guest user becoming guest SYSTEM) is not a hypervisor break -- it is a Windows EoP, serviced under the Windows kernel boundary, not the hypervisor boundary. A denial-of-service of the host from a guest is generally not a serviced hypervisor break unless it produces a memory corruption that an attacker can ride to RCE. An administrator in the root partition reading guest memory is not a break at all -- the root partition is part of the hypervisor's TCB by definition, and root-partition admin is hypervisor-admin in the threat model.

The dollar figures for these boundaries are documented in the Microsoft Hyper-V Bounty Program [@ms-msrc-bounty-hyperv]. The program ranges from $5,000 for the lowest-impact qualifying submission up to $250,000 for the highest. The eligibility language is verbatim:

An eligible submission includes a Remote Code Execution (RCE) vulnerability in Microsoft Hyper-V that enables a L1 guest virtual machine to compromise the hypervisor, escape from the guest virtual machine to the host, or escape to another L1 guest virtual machine. -- Microsoft Hyper-V Bounty Program [@ms-msrc-bounty-hyperv]

$250,000 is the highest standing Hyper-V bounty in the industry. Comparable programs from the other major hypervisor vendors do not publish the same calibration. KVM is a community project with no vendor-paid bounty pool of equivalent size. Xen is a Linux Foundation project that runs a bug bounty through HackerOne but does not publicly attach a $250,000 figure to a guest-to-host RCE. ESXi (Broadcom) does not publish a standing bounty program with a per-bug ceiling; bounty payments for ESXi RCEs typically flow through Pwn2Own and similar marketplaces, where Trend Micro's Zero Day Initiative sets the prize for any given competition.The bounty calibration is itself a data point. If $250,000 were too high, Microsoft would be drowning in submissions; if it were too low, the public CVE record would show more hypervisor breaks reported through Pwn2Own than directly to MSRC. The current equilibrium -- two to four Microsoft-direct Hyper-V CVEs per year, plus zero Pwn2Own Hyper-V guest-to-host escapes through Pwn2Own Berlin 2025 [@zdi-pwn2own-day3] -- is consistent with the bounty being calibrated roughly correctly relative to the cost of finding a real bug.

Vendor	Hypervisor	Published bounty	Ceiling	Servicing-criteria boundary published
Microsoft	Hyper-V / `hvix64.exe`	Yes	$250,000	Yes, verbatim language
Xen Project	Xen	Yes (HackerOne)	Lower, varies	Yes, security policy
KVM	KVM (community)	No standing program	--	No vendor-published criteria
Broadcom/VMware	ESXi	No standing public bounty	--	Vendor advisories per CVE
seL4 Project	seL4	No (proof-rooted argument)	--	Functional-correctness proof [@sel4-whitepaper]

The seL4 row is included because seL4 is the only hypervisor in the table whose claim to a security boundary is mathematical rather than operational. seL4 ships approximately ten thousand lines of C and assembly with a machine-checked proof of functional correctness against a higher-level specification. The proof took roughly twenty-five person-years and covers a microkernel that does not by itself ship the full surface area of Hyper-V. The Microsoft hypervisor is unverified at the §7-estimated line count an order of magnitude larger; its security argument is operational (a small TCB, heavy fuzzing, a standing bounty, public servicing) rather than mathematical.

A serviced boundary is a contract. Contracts are not promises; they are obligations that come due when an attacker finds a way around them. To see what the contract has actually had to pay out, we read the public CVE record.

10. The Public Track Record -- Six Worked CVEs Across Three Classes

We do not need an exhaustive Hyper-V CVE catalog to understand the boundary's real shape. Six worked examples, drawn from three distinct attack classes, cover every public failure mode the boundary has produced since 2018. We walk them in order.

Class A: Device emulation in the root partition

CVE-2021-28476 (vmswitch.sys, May 2021, CVSS 9.9). Discovered by Ophir Harpaz at Guardicore Labs and Peleg Hadar at SafeBreach Labs using Guardicore's hAFL1 hypervisor fuzzer, this was a guest-controlled OID_SWITCH_NIC_REQUEST OID parameter passed to the host-side vmswitch.sys driver. The driver dereferenced an attacker-influenced object pointer; the host kernel performed an arbitrary pointer dereference; the guest gained RCE in the root partition's kernel mode. The CVSS 9.9 score (AV:N/AC:L/PR:L/UI:N/S:C/C:H/I:H/A:H) reflects guest-to-host RCE with Azure-scale blast radius: the bug was reachable from the vmswitch driver shipped in Windows builds well before the May 2021 patch, per the Guardicore Labs technical analysis [@nvd-cve-2021-28476]. The bug is the canonical anchor for "device emulation in the root partition is the largest Hyper-V attack surface."

CVE-2025-21333 (NT Kernel Integration VSP, January 2025, CWE-122). The first publicly-acknowledged in-the-wild exploited Hyper-V CVE. The "Hyper-V NT Kernel Integration VSP" is a relatively new component that ties the Windows kernel-mode container architecture to Hyper-V's VSP/VSC pattern. A guest-controlled input triggered a heap-based buffer overflow on the host side of the integration; the host's address space was corruptible from a guest [@nvd-cve-2025-21333]. The operational pattern matches the vmswitch family: a host-side component receives structured, attacker-shaped input from a guest, and the host-side component overflows.

Class B: The hypercall input-validation path

CVE-2024-21407 (Hyper-V hypercall UAF, March 2024, CVSS 8.1, CWE-416). The rare case where the bug is in hvix64.exe / hvax64.exe itself, not in a root-partition driver. A guest crafted specially-formed file-operation hypercalls; the hypervisor dereferenced freed memory; the guest gained arbitrary host code execution [@nvd-cve-2024-21407].

CVE-2024-30092 (Hyper-V RCE, October 2024, CWE-20 + CWE-829). A Hyper-V remote code execution that combined improper input validation with inclusion of functionality from an untrusted control sphere -- another hypercall-path-class bug [@nvd-cve-2024-30092].

CVE-2024-49117 (Hyper-V RCE, December 2024, CVSS 8.8). A third 2024 Hyper-V RCE; the December Patch Tuesday entry rounded out a year in which three publicly-disclosed Hyper-V RCEs landed in twelve months, the most since the 2018 vmswitch family [@nvd-cve-2024-49117].

Class C: VTL0-to-VTL1 (the VBS break, not the hypervisor break)

CVE-2020-0917 and CVE-2020-0918 -- Amar and King, Black Hat USA 2020. Saar Amar and Daniel King's "Breaking VSM by Attacking SecureKernel" disclosed two paired vulnerabilities discovered with their Hyperseed hypercall fuzzer retargeted at securekernel!IumInvokeSecureService, the secure-call entry point. Vulnerability #1 -- which maps to CVE-2020-0917 -- is an out-of-bounds write in securekernel!SkmmObtainHotPatchUndoTable, the function that parses the hot-patch undo table at secure-call invocation time.The Black Hat USA 2020 deck (verified via pdftotext at the canonical MSRC-Security-Research GitHub URL) explicitly labels Vulnerability #1 as OOB Write, in slides titled "The Vulnerable Function" and "The OOB" in the "Hardening SK" section [@amar-king-bh-2020]. Several secondary writeups across the web have transcribed the bug class as "OOB read," which is incorrect; the deck itself is the primary source and says write. The functions involved are also commonly conflated: IumInvokeSecureService is the secure-call dispatcher Hyperseed retargets to reach the buggy code; the actual bug is in SkmmObtainHotPatchUndoTable. The NVD entries for both CVEs are tracked as CWE-269 (Improper Privilege Management). Vulnerability #2 -- CVE-2020-0918 -- is a design flaw in SkmmUnmapMdl that lets VTL0 pass a fully attacker-controlled Memory Descriptor List to SkmiReleaseUnknownPTEs.

The Microsoft response is documented end-to-end in the same deck: the Secure Kernel pool was migrated to segment heap in mid-2019, four W+X regions were reduced to +X only, and SkpgContext -- a HyperGuard equivalent for Secure Kernel -- was introduced.

This is a different failure class than vmswitch RCE: not guest-to-host, but VTL0-to-VTL1 -- a Secure Kernel break reached through the hypervisor's secure-call dispatch from a privileged VTL0 attacker. Microsoft services it under the VBS / VSM boundary in the servicing criteria document, even though no guest VM is involved.

Key idea: Every public Hyper-V CVE since 2018 lives in one of three narrow code paths -- device emulation, hypercall input validation, or VTL0-to-VTL1 secure-call dispatch. The TLFS-visible primitives (intercepts, SynIC, per-VTL SLAT) have produced none.

The Pwn2Own dimension

Through Pwn2Own Berlin 2025, no public live Hyper-V guest-to-host escape has been demonstrated at Pwn2Own. The cross-vendor analogue -- and the industry's best calibration of how hard a hypervisor escape is to find when a researcher has a public dollar incentive and a deadline -- is the first-ever ESXi escape in Pwn2Own history, executed by Nguyen Hoang Thach of STAR Labs SG on Day Two (May 16, 2025) using a single integer overflow vulnerability in the hypervisor's DMA-handling path. The award was $150,000 plus 15 Master of Pwn points; STAR Labs went on to win overall Master of Pwn for the competition with $320,000 across three days [@zdi-pwn2own-day3].

The technique class is a TOCTOU on a length field read twice during a DMA operation: the first read validates the length, the second read uses it; race the second read and you write past a fixed-size buffer on the host heap. The exploit class is structurally the same as the vmswitch family, just landed in a different vendor's device-emulation path.

CVE	Class	Year	CVSS	Location	Source
CVE-2021-28476	A: device emulation	2021	9.9	`vmswitch.sys` (root partition)	[@nvd-cve-2021-28476]
CVE-2025-21333	A: device emulation	2025	7.8	NT Kernel Integration VSP (root partition)	[@nvd-cve-2025-21333]
CVE-2024-21407	B: hypercall path	2024	8.1	`hvix64.exe` / `hvax64.exe` (hypervisor binary)	[@nvd-cve-2024-21407]
CVE-2024-30092	B: hypercall path	2024	7.5	Hyper-V hypercall validation	[@nvd-cve-2024-30092]
CVE-2024-49117	B: hypercall path	2024	8.8	Hyper-V hypercall validation	[@nvd-cve-2024-49117]
CVE-2020-0917/0918	C: VTL0-to-VTL1	2020	6.8 (per MSRC)	`securekernel.exe` (VTL1, reached via secure call)	[@amar-king-bh-2020]

flowchart LR subgraph CA["Class A: device emulation (root partition)"] Vmswitch["vmswitch.sys -- CVE-2021-28476"] Vsp["NT Kernel Integration VSP -- CVE-2025-21333"] end subgraph CB["Class B: hypercall input validation (hypervisor binary)"] UAF["CVE-2024-21407 (UAF)"] Input["CVE-2024-30092"] Hpcall["CVE-2024-49117"] end subgraph CC["Class C: VTL0-to-VTL1 (secure call dispatch)"] Oob["CVE-2020-0917 (OOB write)"] Mdl["CVE-2020-0918 (SkmmUnmapMdl)"] end Guest["Guest VM"] --> CA Guest --> CB Vtl0["Privileged VTL0 (kernel)"] --> CC

This is the third insight the article is built around. The reader's prior model may have been "hypervisors fail in mysterious, deep ways; the boundary is fragile in unknown places." The new model is "every public Hyper-V escape since 2018 lives in one of three narrow code paths, and the TLFS-visible primitives have produced none." The narrowness of the failure space is itself a security argument. The hypervisor's micro-kernelized design has held; what has not always held are the components Microsoft chose to put next to the hypervisor, in the root partition's user mode and kernel mode, by deliberate architectural choice in 2008.

Six worked examples; three classes; one boundary; an unflinching public record. The boundary is alive and producing CVEs at roughly two to four per year. But every CVE so far has lived somewhere the hypervisor itself controls. The interesting question is what lives in places it does not control.

11. The Residual Attack Surface -- Beneath, Beside, and Around

The hypervisor enforces a clean boundary against everything above it -- the NT kernel, user mode, even other guest VMs. It cannot, by construction, enforce anything against what lives below or beside it. Three structural classes of residual attack matter. We walk each.

11.1 Firmware below the hypervisor

System Management Mode (SMM), the UEFI runtime, the platform Manageability Engine (Intel ME), and the AMD Platform Security Processor (PSP) all run at higher privilege than the hypervisor for parts of boot and runtime. SMM in particular is a CPU mode that is invoked through System Management Interrupts (SMI) and has unrestricted access to all of physical memory, including the hypervisor's own pages. If the OEM-supplied SMM handler contains an exploitable bug, an SMI can run attacker code in a privilege mode strictly above the hypervisor's.

The threat is not hypothetical. The Binarly research team's 2023 LogoFAIL disclosures showed entire classes of image-parser bugs in UEFI firmware reachable from a privileged OS context; BootHole (CVE-2020-10713, a buffer overflow in GRUB2's grub.cfg parser) and BlackLotus (CVE-2022-21894, a UEFI Secure Boot bypass) showed that pre-boot bugs in widely-deployed bootloaders could ride past Secure Boot. None of these is a hypervisor bug; all of them are residual attack surface from the hypervisor's point of view.

Microsoft's mitigation is the dynamic root of trust for measurement -- System Guard Secure Launch -- which we touched on in section 8. After UEFI Secure Boot has done its static-RTM job, Intel TXT's SENTER (or AMD's SKINIT) executes a CPU-hardware-rooted late launch: the CPU resets to a known state, runs an Intel- or AMD-signed Authenticated Code Module (ACM), and measures the hypervisor binary into TPM PCRs 17-22 before transferring control to it. The result is that even if pre-boot firmware is compromised, the post-DRTM PCR values reflect the actual hypervisor binary; a compromised UEFI cannot silently substitute a different hypervisor without changing the attestation [@ms-system-guard-secure-launch, @ms-hardware-root-of-trust]. The residual after DRTM: OEMs that don't ship Secure Launch on their motherboards, or that ship buggy SMM handlers that can be invoked after launch.

11.2 Hardware side channels

Microarchitectural side-channel attacks cross the VTL boundary at the level of CPU implementation, not at the level of architectural specification. The 2018 Spectre and Meltdown disclosures -- followed by the L1TF, MDS, Retbleed, and CacheWarp families in the years since -- showed that speculatively-executed code on a CPU can leak microarchitectural state across privilege boundaries that the architectural ISA promises to protect.

Microsoft's mitigation cadence has been in-tree and aggressive: Kernel Virtual Address Shadow (the Windows equivalent of KPTI) for Meltdown; IBRS, STIBP, and retpolines for Spectre v2; HyperClear for L1TF on Hyper-V hosts. Each Patch Tuesday since 2018 has shipped at least one microarchitectural mitigation; cumulatively the cost has been measurable but bounded.

Note: The microarchitectural ceiling is hardware, not software. Intel TDX and AMD SEV-SNP -- the two confidential-computing architectures that move the trust root from the hypervisor to per-VM hardware encryption -- both explicitly disclaim resistance to this class. If the CPU leaks across a Spectre-class side channel, no software-level isolation primitive (VTL, partition, SEAM, SEV-SNP) can fully recover the property. The mitigation is hardware that doesn't leak, and that mitigation arrives one CPU generation at a time.

11.3 IOMMU and DMA bypass

The IOMMU -- Intel VT-d, AMD-Vi -- is the hardware that gates DMA from peripheral devices to physical memory. If the IOMMU is configured correctly, a Thunderbolt-attached device cannot read or write arbitrary memory; it can only DMA to regions the OS has explicitly mapped for it. If the IOMMU is disabled, configured permissively, or has firmware bugs of its own, DMA becomes an end-run around every architectural protection above it -- including the hypervisor's.

The threat is again not hypothetical. Bjorn Ruytenberg's Thunderspy disclosure in 2020 documented seven DMA-class vulnerabilities in Thunderbolt 3 firmware, demonstrating that an attacker with physical access could read or modify arbitrary memory on a powered-on system through a malicious peripheral [@thunderspy]. The Microsoft mitigation is Kernel DMA Protection (Windows 10 1803 and later): the hypervisor configures the IOMMU at boot to deny DMA from externally-attached devices outside of explicitly authorized regions, and DMA from any peripheral whose driver has not been loaded under a trusted policy is refused at the IOMMU [@ms-kernel-dma-protection]. The structural residual: pre-boot DMA, before Windows has finished configuring the IOMMU; client motherboards that still ship with VT-d or AMD-Vi disabled in BIOS; OEMs that disable Kernel DMA Protection by default.

11.4 Hypervisor downgrade and rollback

Alon Leviev's "Windows Downdate" at Black Hat USA 2024 disclosed a class of attack that the prior three sections do not cover: rollback of the hypervisor binary itself to a previously-vulnerable, but still validly-signed, build [@nvd-cve-2024-21302].

The structural argument: UEFI Secure Boot prevents loading an unsigned hvix64.exe. It does not prevent loading an older hvix64.exe that is unsigned only in the sense of being unrevoked. If Microsoft fixes a Secure Kernel bug in build N+1 and a VTL0 attacker can convince the system to load build N at the next reboot, the patched bug is alive again. CVE-2024-21302 demonstrated exactly this rollback against both the hypervisor and the Secure Kernel through manipulation of the Windows Update servicing pipeline. The mitigation is mandatory-update servicing combined with proactive revocation list (dbx) hygiene -- once an older binary's hash is in the UEFI revocation list, Secure Boot will refuse to load it -- and Microsoft completed mitigations across Windows 10 1507 through Windows Server 2019 in the July 8, 2025 update wave [@nvd-cve-2024-21302].

flowchart TD HW["Hardware (CPU, RAM, IOMMU, TPM)"] SM["System Management Mode (Ring -2) -- residual: SMM handler bugs"] FW["UEFI firmware -- residual: LogoFAIL, BootHole, BlackLotus"] DR["DRTM ACM (Intel TXT / AMD SKINIT)"] HV["Microsoft Hypervisor (hvix64 / hvax64)"] Iommu["IOMMU (VT-d / AMD-Vi) -- residual: Thunderspy, pre-boot DMA"] Vtl1["VTL1 (Secure Kernel + trustlets)"] Vtl0["VTL0 (NT kernel + user mode)"] Side["Microarchitectural side channels -- Spectre / Meltdown / MDS / Retbleed"] Update["Windows Update servicing -- residual: hypervisor rollback (CVE-2024-21302)"] HW --> SM SM --> FW FW --> DR DR --> HV HV --> Iommu HV --> Vtl1 HV --> Vtl0 Side -.->|"cross all boundaries"| HV Update -.->|"can roll hypervisor back"| HV The hypervisor is necessary but not sufficient. The firmware-Secure-Boot-DRTM substrate beneath it, the microarchitectural ceiling above it, the IOMMU configuration beside it, and the Windows Update pipeline that decides which hypervisor build runs next are co-equal members of the same boundary. None of them is the hypervisor; all of them have to do their job for the hypervisor's guarantees to hold. The substrate is real, but the boundary is the combination of the substrate and what holds it up.

Necessary, not sufficient. That phrase is the article's honest answer to the question "how good is the substrate?" The answer is that the substrate is genuine, the boundary is published, the bounty calibration is the highest in the industry, the public CVE record is alive and narrow, and the residual attack surface lives in places the hypervisor cannot by construction control. The substrate is what we have explored in detail; what holds it up is what we have just sketched. The last section turns from theory to practice.

12. Practical Guide, FAQ, and Closing

If you have read this far, the natural next question is "is this on, on my machine, and how do I check?" The practical answer is short.

12.1 Enabling and verifying VBS

VBS is configurable through several paths: Group Policy (Computer Configuration > Administrative Templates > System > Device Guard), Intune, MDM CSPs (DeviceGuard/EnableVirtualizationBasedSecurity, DeviceGuard/ConfigureSystemGuardLaunch), the Windows Security UI, or directly via bcdedit /set hypervisorlaunchtype Auto. Verification is best done with three small commands.

msinfo32 -> the Device Guard / Virtualization-based Security row. "Services Configured" lists what policy has requested; "Services Running" lists what is actually active. Kernel DMA Protection and Secure Launch each appear as their own row.
Get-CimInstance -ClassName Win32_DeviceGuard -> VirtualizationBasedSecurityStatus (0 = off, 1 = enabled but not running, 2 = running); SecurityServicesRunning array (HVCI, Credential Guard, etc.); RequiredSecurityProperties (the policy floor).
bcdedit /enum -> hypervisorlaunchtype Auto is the default; loadoptions DISABLE_VBS_* is how an administrator can opt out (you should not see these flags on a properly-configured machine).

{` // Given a parsed Win32_DeviceGuard object, compute whether VBS is healthy. // The actual Win32_DeviceGuard schema is on Microsoft Learn; this is the // decision logic an operator would write against it. function checkVbsHealth(dg) { const result = { ok: false, reasons: [] };

// VBS itself if (dg.VirtualizationBasedSecurityStatus !== 2) { result.reasons.push('VBS is not running (status != 2)'); }

// HVCI (Memory Integrity) if (!dg.SecurityServicesRunning.includes(2)) { result.reasons.push('HVCI / Memory Integrity is not running'); }

// Credential Guard if (!dg.SecurityServicesRunning.includes(1)) { result.reasons.push('Credential Guard is not running'); }

// Required floor properties (e.g. Secure Boot, DMA protection, SMM mitigation) const requiredFloor = [1, 2, 3]; // service codes per Win32_DeviceGuard for (const r of requiredFloor) { if (!dg.AvailableSecurityProperties.includes(r)) { result.reasons.push('Missing required security property: ' + r); } }

result.ok = result.reasons.length === 0; return result; }

const example = { VirtualizationBasedSecurityStatus: 2, SecurityServicesRunning: [1, 2, 3], AvailableSecurityProperties: [1, 2, 3, 4, 5], }; console.log(JSON.stringify(checkVbsHealth(example), null, 2)); // -> { ok: true, reasons: [] } `}

Note: Three commands, in order: msinfo32 for the human-readable summary; Get-CimInstance -ClassName Win32_DeviceGuard | Format-List * for the structured detail; bcdedit /enum {current} to confirm hypervisorlaunchtype Auto and the absence of DISABLE_VBS_* load options. If all three agree that VBS, HVCI, and Credential Guard are running, you are in the configuration this article describes.

12.2 Operational pitfalls

Two operational realities are worth flagging. First, HVCI has a driver block list and will refuse to enable Memory Integrity if any incompatible driver is installed; the usual offenders are older anti-cheat drivers, third-party virtualization clients (VMware Workstation pre-2021, VirtualBox pre-6.1), and certain disk-encryption or storage-filter drivers. Microsoft maintains a public block list; the Memory Integrity UI in Windows Security will report the specific blocking driver. Second, nested virtualization is supported for Hyper-V guests on Windows 10/11 client and Server 2016+, and is required by some development workflows (WSL2 with nested containers, certain Visual Studio device emulators). Nested virtualization changes the threat model -- the L0 hypervisor still owns the box, but the L1 guest now runs its own hypervisor with its own VTL split -- so a compromised L1 guest with VBS enabled still does not give an L1 attacker a path to the L0 host.

12.3 The substrate cross-reference

This article is the substrate of the Windows security series at paragmali.com. The siblings build on what is here:

Secure Boot in Windows -- the static-RTM half of the boot trust chain that hands off to the hypervisor.
VBS Trustlets: What Actually Runs in the Secure Kernel -- the VTL1 internals that the hypervisor's secure-call ABI delivers requests to.
NTLMless: The Death of NTLM in Windows -- the Credential Guard story from inside LSAISO.
Adminless: Administrator Protection in Windows -- the user-mode admin trust model that the kernel-mode VBS boundary makes possible.
Can This Code Do This? Windows Access Control -- the access-control surface that VBS supplements but does not replace.

12.4 Frequently asked questions

The 10-30 percent number is folklore from the pre-SLAT era or from systems running HVCI-incompatible drivers in compatibility mode. For typical workloads on modern hardware (post-2018 CPUs with VT-x or AMD-V and SLAT), the measured overhead of VBS plus HVCI plus Credential Guard sits in the low single digits. Gaming and high-throughput I/O workloads can show larger gaps, especially on systems where the BIOS forces nested virtualization off or where IOMMU is disabled. The trade-off for that overhead is the security-boundary set described in this article. No. VBS is a Virtual Trust Level split *inside* the root partition. There are no extra VMs. The normal Windows install is VTL0; the Secure Kernel plus its trustlets is VTL1. Both VTLs live in the same partition, share the same physical CPU, and are scheduled by the hypervisor as separate VTL contexts -- not as separate VMs. A Hyper-V guest VM, by contrast, is a child partition entirely separate from the root partition. The two architectures share a hypervisor binary but use different parts of it. No. SYSTEM is a high VTL0 user-mode token; the hypervisor sits architecturally above all of Ring 0, which is where SYSTEM-loaded kernel drivers ultimately run. The point of the entire article is that "SYSTEM owns the box" is wrong on a VBS-enabled Windows install. SYSTEM is the most privileged Windows identity; the hypervisor is the most privileged *software*, and the two are not the same thing. No. Secure Boot prevents loading an *unsigned* `hvix64.exe`. It does not prevent loading an older, signed-but-vulnerable `hvix64.exe` that has not been added to the UEFI revocation list. That gap is what CVE-2024-21302 (Windows Downdate) exploited, and the mitigation is mandatory-update servicing combined with prompt revocation-list (`dbx`) hygiene [@nvd-cve-2024-21302]. No. seL4 is formally verified at approximately ten thousand lines of code with a roughly twenty-five-person-year proof effort. The Microsoft hypervisor is unverified at an estimated one to two hundred thousand lines of code. The hypervisor's security argument is operational -- a small TCB, heavy continuous fuzzing, a standing \$5K-\$250K bounty, public servicing criteria, an unflinching public CVE record -- rather than mathematical [@sel4-whitepaper, @ms-msrc-bounty-hyperv]. Yes, in terms of binary identity, servicing criteria, and bounty eligibility. The Microsoft hypervisor that boots on a Windows 11 client laptop and the one that boots on an Azure host server are derived from the same codebase, ship with the same servicing commitments, and qualify for the same Hyper-V bounty. The threat model differs -- Azure adds multi-tenant guest-to-guest isolation, hardware confidential-VM extensions, and a different management surface -- but the substrate is shared.

12.5 Closing

The reason SYSTEM on a Windows 11 box cannot read LSASS, load an unsigned driver, or patch ntoskrnl.exe is now fully accounted for. An hvix64.exe or hvax64.exe loaded by hvloader.efi before winload.exe ever ran. A VTL split inside the root partition, made possible by Hepkin and Kishan's 2013 patent and shipped with Windows 10 RTM in 2015. Per-VTL SLAT enforcement that the NT kernel architecturally cannot touch, because the SLAT tables live in pages the hypervisor never maps into a VTL0 view. A Microsoft-published security boundary and a $5,000-$250,000 bounty calibrating the boundary's value, both of which are unique in the industry at this writing. A public CVE record of six worked examples across three narrow classes that the boundary has had to pay out on since 2018. And a residual attack surface -- firmware below, side channels above, IOMMU bypass beside, hypervisor rollback through the update pipeline -- that the substrate cannot, by construction, eliminate.

The hypervisor is what every other article in this series sits on. Now you have the substrate in hand. The Secure Kernel article reads differently when you have walked the per-VTL SLAT yourself. The Credential Guard article reads differently when you know that LSAISO is invoked through a hypercall-mediated secure call. The Secure Boot article reads differently when you know that the hypervisor's DRTM measurement re-establishes the trust root after firmware. The Adminless article reads differently when you know that the privilege ceiling on Windows 11 is not Ring 0 but a hardware boundary above it.

Above Ring Zero is not a metaphor. It is an instruction-set state. The Windows hypervisor lives there, owns the page tables that say what the OS can see, and is the architectural reason "SYSTEM-on-Windows-11" cannot do things SYSTEM used to be allowed to do.