Design of TDP MMU for TDX support¶
This document describes a (high level) design for TDX support of KVM TDP MMU of x86 KVM.
In this document, we use “TD” or “guest TD” to differentiate it from the current “VM” (Virtual Machine), which is supported by KVM today.
Background of TDX¶
TD private memory is designed to hold TD private content, encrypted by the CPU using the TD ephemeral key. An encryption engine holds a table of encryption keys, and an encryption key is selected for each memory transaction based on a Host Key Identifier (HKID). By design, the host VMM does not have access to the encryption keys.
In the first generation of MKTME, HKID is “stolen” from the physical address by allocating a configurable number of bits from the top of the physical address. The HKID space is partitioned into shared HKIDs for legacy MKTME accesses and private HKIDs for SEAM-mode-only accesses. We use 0 for the shared HKID on the host so that MKTME can be opaque or bypassed on the host.
During TDX non-root operation (i.e. guest TD), memory accesses can be qualified as either shared or private, based on the value of a new SHARED bit in the Guest Physical Address (GPA). The CPU translates shared GPAs using the usual VMX EPT (Extended Page Table) or “Shared EPT” (in this document), which resides in the host VMM memory. The Shared EPT is directly managed by the host VMM - the same as with the current VMX. Since guest TDs usually require I/O, and the data exchange needs to be done via shared memory, thus KVM needs to use the current EPT functionality even for TDs.
The CPU translates private GPAs using a separate Secure EPT. The Secure EPT pages are encrypted and integrity-protected with the TD’s ephemeral private key. Secure EPT can be managed _indirectly_ by the host VMM, using the TDX interface functions (SEAMCALLs), and thus conceptually Secure EPT is a subset of EPT because not all functionalities are available.
Since the execution of such interface functions takes much longer time than accessing memory directly, in KVM we use the existing TDP code to mirror the Secure EPT for the TD. And we think there are at least two options today in terms of the timing for executing such SEAMCALLs:
synchronous, i.e. while walking the TDP page tables, or
post-walk, i.e. record what needs to be done to the real Secure EPT during the walk, and execute SEAMCALLs later.
The option 1 seems to be more intuitive and simpler, but the Secure EPT concurrency rules are different from the ones of the TDP or EPT. For example, MEM.SEPT.RD acquire shared access to the whole Secure EPT tree of the target
Secure EPT(SEPT) operations¶
Secure EPT is an Extended Page Table for GPA-to-HPA translation of TD private HPA. A Secure EPT is designed to be encrypted with the TD’s ephemeral private key. SEPT pages are allocated by the host VMM via Intel TDX functions, but their content is intended to be hidden and is not architectural.
Unlike the conventional EPT, the CPU can’t directly read/write its entry. Instead, TDX SEAMCALL API is used. Several SEAMCALLs correspond to operation on the EPT entry.
TDH.MEM.SEPT.ADD():
Add a secure EPT page from the secure EPT tree. This corresponds to updating the non-leaf EPT entry with present bit set
TDH.MEM.SEPT.REMOVE():
Remove the secure page from the secure EPT tree. There is no corresponding to the EPT operation.
TDH.MEM.SEPT.RD():
Read the secure EPT entry. This corresponds to reading the EPT entry as memory. Please note that this is much slower than direct memory reading.
TDH.MEM.PAGE.ADD() and TDH.MEM.PAGE.AUG():
Add a private page to the secure EPT tree. This corresponds to updating the leaf EPT entry with present bit set.
THD.MEM.PAGE.REMOVE():
Remove a private page from the secure EPT tree. There is no corresponding to the EPT operation.
TDH.MEM.RANGE.BLOCK():
This (mostly) corresponds to clearing the present bit of the leaf EPT entry. Note that the private page is still linked in the secure EPT. To remove it from the secure EPT, TDH.MEM.SEPT.REMOVE() and TDH.MEM.PAGE.REMOVE() needs to be called.
TDH.MEM.TRACK():
Increment the TLB epoch counter. This (mostly) corresponds to EPT TLB flush. Note that the private page is still linked in the secure EPT. To remove it from the secure EPT, tdh_mem_page_remove() needs to be called.
Adding private page¶
The procedure of populating the private page looks as follows.
TDH.MEM.SEPT.ADD(512G level)
TDH.MEM.SEPT.ADD(1G level)
TDH.MEM.SEPT.ADD(2M level)
TDH.MEM.PAGE.AUG(4K level)
Those operations correspond to updating the EPT entries.
Dropping private page and TLB shootdown¶
The procedure of dropping the private page looks as follows.
TDH.MEM.RANGE.BLOCK(4K level)
This mostly corresponds to clear the present bit in the EPT entry. This prevents (or blocks) TLB entry from creating in the future. Note that the private page is still linked in the secure EPT tree and the existing cache entry in the TLB isn’t flushed.
TDH.MEM.TRACK(range) and TLB shootdown
This mostly corresponds to the EPT TLB shootdown. Because all vcpus share the same Secure EPT, all vcpus need to flush TLB.
TDH.MEM.TRACK(range) by one vcpu. It increments the global internal TLB epoch counter.
send IPI to remote vcpus
Other vcpu exits to VMM from guest TD and then re-enter. TDH.VP.ENTER().
TDH.VP.ENTER() checks the TLB epoch counter and If its TLB is old, flush TLB.
Note that only single vcpu issues tdh_mem_track().
Note that the private page is still linked in the secure EPT tree, unlike the conventional EPT.
TDH.MEM.PAGE.PROMOTE, TDH.MEM.PAGEDEMOTE(), TDH.MEM.PAGE.RELOCATE(), or TDH.MEM.PAGE.REMOVE()
There is no corresponding operation to the conventional EPT.
When changing page size (e.g. 4K <-> 2M) TDH.MEM.PAGE.PROMOTE() or TDH.MEM.PAGE.DEMOTE() is used. During those operation, the guest page is kept referenced in the Secure EPT.
When migrating page, TDH.MEM.PAGE.RELOCATE(). This requires both source page and destination page.
when destroying TD, TDH.MEM.PAGE.REMOVE() removes the private page from the secure EPT tree. In this case TLB shootdown is not needed because vcpus don’t run any more.
The basic idea for TDX support¶
Because shared EPT is the same as the existing EPT, use the existing logic for shared EPT. On the other hand, secure EPT requires additional operations instead of directly reading/writing of the EPT entry.
On EPT violation, The KVM mmu walks down the EPT tree from the root, determines the EPT entry to operate, and updates the entry. If necessary, a TLB shootdown is done. Because it’s very slow to directly walk secure EPT by TDX SEAMCALL, TDH.MEM.SEPT.RD(), the mirror of secure EPT is created and maintained. Add hooks to KVM MMU to reuse the existing code.
EPT violation on private GPA¶
EPT violation on private GPA or zapping private GPA
walk down the mirror of secure EPT tree (mostly same as the existing code) | | V mirror of secure EPT tree (KVM MMU software only. reuse of the existing code)
update the (mirrored) EPT entry. (mostly same as the existing code)
call the hooks with what EPT entry is changed
| NEW: hooks in KVM MMU | V secure EPT root(CPU refers)
the TDX backend calls necessary TDX SEAMCALLs to update real secure EPT.
The major modification is to add hooks for the TDX backend for additional operations and to pass down which EPT, shared EPT, or private EPT is used, and twist the behavior if we’re operating on private EPT.
The following depicts the relationship.
KVM | TDX module
| | |
-------------+---------- | |
| | | |
V V | |
shared GPA private GPA | V
CPU shared EPT pointer KVM private EPT pointer | CPU secure EPT pointer
| | | |
| | | |
V V | V
shared EPT private EPT<-------mirror----->Secure EPT
| | | |
| \--------------------+------\ |
| | | |
V | V V
shared guest page | private guest page
|
|
non-encrypted memory | encrypted memory
|
- shared EPT: CPU and KVM walk with shared GPA
Maintained by the existing code
- private EPT: KVM walks with private GPA
Maintained by the twisted existing code
- secure EPT: CPU walks with private GPA.
Maintained by TDX module with TDX SEAMCALLs via hooks
Tracking private EPT page¶
Shared EPT pages are managed by struct kvm_mmu_page. They are linked in a list structure. When necessary, the list is traversed to operate on. Private EPT pages have different characteristics. For example, private pages can’t be swapped out. When shrinking memory, we’d like to traverse only shared EPT pages and skip private EPT pages. Likewise, page migration isn’t supported for private pages (yet). Introduce an additional list to track shared EPT pages and track private EPT pages independently.
At the beginning of EPT violation, the fault handler knows fault GPA, thus it knows which EPT to operate on, private or shared. If it’s private EPT, an additional task is done. Something like “if (private) { callback a hook }”. Since the fault handler has deep function calls, it’s cumbersome to hold the information of which EPT is operating. Options to mitigate it are
Pass the information as an argument for the function call.
Record the information in struct kvm_mmu_page somehow.
Record the information in vcpu structure.
Option 2 was chosen. Because option 1 requires modifying all the functions. It would affect badly to the normal case. Option 3 doesn’t work well because in some cases, we need to walk both private and shared EPT.
The role of the EPT page can be utilized and one bit can be curved out from unused bits in struct kvm_mmu_page_role. When allocating the EPT page, initialize the information. Mostly struct kvm_mmu_page is available because we’re operating on EPT pages.
The original TDP MMU and race condition¶
Because vcpus share the EPT, once the EPT entry is zapped, we need to shootdown TLB. Send IPI to remote vcpus. Remote vcpus flush their down TLBs. Until TLB shootdown is done, vcpus may reference the zapped guest page.
TDP MMU uses read lock of mmu_lock to mitigate vcpu contention. When read lock is obtained, it depends on the atomic update of the EPT entry. (On the other hand legacy MMU uses write lock.) When vcpu is populating/zapping the EPT entry with a read lock held, other vcpu may be populating or zapping the same EPT entry at the same time.
To avoid the race condition, the entry is frozen. It means the EPT entry is set to the special value, REMOVED_SPTE which clears the present bit. And then after TLB shootdown, update the EPT entry to the final value.
Concurrent zapping¶
read lock
freeze the EPT entry (atomically set the value to REMOVED_SPTE) If other vcpu froze the entry, restart page fault.
TLB shootdown
send IPI to remote vcpus
TLB flush (local and remote)
For each entry update, TLB shootdown is needed because of the concurrency.
atomically set the EPT entry to the final value
read unlock
Concurrent populating¶
In the case of populating the non-present EPT entry, atomically update the EPT entry.
read lock
atomically update the EPT entry If other vcpu frozen the entry or updated the entry, restart page fault.
read unlock
In the case of updating the present EPT entry (e.g. page migration), the operation is split into two. Zapping the entry and populating the entry.
read lock
zap the EPT entry. follow the concurrent zapping case.
populate the non-present EPT entry.
read unlock
Non-concurrent batched zapping¶
In some cases, zapping the ranges is done exclusively with a write lock held. In this case, the TLB shootdown is batched into one.
write lock
zap the EPT entries by traversing them
TLB shootdown
write unlock
For Secure EPT, TDX SEAMCALLs are needed in addition to updating the mirrored EPT entry.
TDX concurrent zapping¶
Add a hook for TDX SEAMCALLs at the step of the TLB shootdown.
read lock
freeze the EPT entry(set the value to REMOVED_SPTE)
TLB shootdown via a hook
TLB.MEM.RANGE.BLOCK()
TLB.MEM.TRACK()
send IPI to remote vcpus
set the EPT entry to the final value
read unlock
TDX concurrent populating¶
TDX SEAMCALLs are required in addition to operating the mirrored EPT entry. The frozen entry is utilized by following the zapping case to avoid the race condition. A hook can be added.
read lock
freeze the EPT entry
hook
TDH_MEM_SEPT_ADD() for non-leaf or TDH_MEM_PAGE_AUG() for leaf.
set the EPT entry to the final value
read unlock
Without freezing the entry, the following race can happen. Suppose two vcpus are faulting on the same GPA and the 2M and 4K level entries aren’t populated yet.
vcpu 1: update 2M level EPT entry
vcpu 2: update 4K level EPT entry
vcpu 2: TDX SEAMCALL to update 4K secure EPT entry => error
vcpu 1: TDX SEAMCALL to update 2M secure EPT entry
TDX non-concurrent batched zapping¶
For simplicity, the procedure of concurrent populating is utilized. The procedure can be optimized later.
Co-existing with unmapping guest private memory¶
TODO. This needs to be addressed.
Optimizing TLB flush¶
It’s inefficient to issue TLB.MEM.TRACK for each EPT entry. Similar to EPT TLB flush, multiple TLB.MEM.TRACK and sending IPI (TLB shootdown) can be combined into one TLB.MEM.TRACK and one IPI. After the TLB shootdown, the PFNs are still needed to unlink the private pages from the secure EPT. PFN needs to be stashed somewhere. The choice is to keep the PFN in the EPT entry with the special flag. SPTE_PRIVATE_ZAPPED with the present flag cleared. And specially handle such EPT entry. Later get PFN and unlink the private page from secure EPT, clear the EPT entry into normal zapped EPT entry.
lock
loop on EPT entries. - set SEPT_PRIVATE_ZAPPED - keep PFN - clear other bits - TLB.MEM.RANGE.BLOCK()
TLB shootdown via a hook. kvm_flush_remote_tlbs_with_address() - TLB.MEM.TRACK() - send IPI to remote vcpus - loop on EPT entries
check if SPTE_PRIVATE_ZAPPED is set
get PFN
unlink private pages from secure EPT if necessary
make the EPT entry into initial zapped value (clear SPTE_PRIVATE_ZAPPED and PFN)
unlock
Restrictions or future work¶
The following features aren’t supported yet at the moment.
optimizing non-concurrent zap
Page migration