aboutsummaryrefslogtreecommitdiffstats
path: root/arch/x86/kvm/svm
Commit message (Collapse)AuthorAgeFilesLines
* KVM: SVM: Sync TPR from LAPIC into VMCB::V_TPR even if AVIC is activeMaciej S. Szmigiero2025-09-101-2/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit 3bbf3565f48c ("svm: Do not intercept CR8 when enable AVIC") inhibited pre-VMRUN sync of TPR from LAPIC into VMCB::V_TPR in sync_lapic_to_cr8() when AVIC is active. AVIC does automatically sync between these two fields, however it does so only on explicit guest writes to one of these fields, not on a bare VMRUN. This meant that when AVIC is enabled host changes to TPR in the LAPIC state might not get automatically copied into the V_TPR field of VMCB. This is especially true when it is the userspace setting LAPIC state via KVM_SET_LAPIC ioctl() since userspace does not have access to the guest VMCB. Practice shows that it is the V_TPR that is actually used by the AVIC to decide whether to issue pending interrupts to the CPU (not TPR in TASKPRI), so any leftover value in V_TPR will cause serious interrupt delivery issues in the guest when AVIC is enabled. Fix this issue by doing pre-VMRUN TPR sync from LAPIC into VMCB::V_TPR even when AVIC is enabled. Fixes: 3bbf3565f48c ("svm: Do not intercept CR8 when enable AVIC") Cc: [email protected] Signed-off-by: Maciej S. Szmigiero <[email protected]> Reviewed-by: Naveen N Rao (AMD) <[email protected]> Link: https://lore.kernel.org/r/c231be64280b1461e854e1ce3595d70cde3a2e9d.1756139678.git.maciej.szmigiero@oracle.com [sean: tag for stable@] Signed-off-by: Sean Christopherson <[email protected]>
* KVM: SEV: don't check have_run_cpus in sev_writeback_caches()Yury Norov2025-08-181-7/+3
| | | | | | | | | | | | | | Drop KVM's check on an empty cpumask when flushing caches when memory is being reclaimed from an SEV VM, as smp_call_function_many_cond() naturally (and correctly) handles an empty cpumask. This avoids an extra O(n) lookup in the common case where at least one pCPU has enterred the guest, which could be noticeable in some setups, e.g. if a small VM is pinned to the last few pCPUs in the system. Fixes: 6f38f8c57464 ("KVM: SVM: Flush cache only on CPUs running SEV guest") Signed-off-by: Yury Norov (NVIDIA) <[email protected]> [sean: rewrite changelog to capture performance angle] Signed-off-by: Sean Christopherson <[email protected]>
* Merge tag 'kvm-x86-sev-6.17' of https://github.com/kvm-x86/linux into HEADPaolo Bonzini2025-07-292-28/+83
|\ | | | | | | | | | | | | | | | | | | | | | | | | KVM SEV cache maintenance changes for 6.17 - Drop a superfluous WBINVD (on all CPUs!) when destroying a VM. - Use WBNOINVD instead of WBINVD when possible, for SEV cache maintenance, e.g. to minimize collateral damage when reclaiming memory from an SEV guest. - When reclaiming memory from an SEV guest, only do cache flushes on CPUs that have ever run a vCPU for the guest, i.e. don't flush the caches for CPUs that can't possibly have cache lines with dirty, encrypted data.
| * KVM: SVM: Flush cache only on CPUs running SEV guestZheyun Shen2025-07-142-9/+63
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | On AMD CPUs without ensuring cache consistency, each memory page reclamation in an SEV guest triggers a call to do WBNOINVD/WBINVD on all CPUs, thereby affecting the performance of other programs on the host. Typically, an AMD server may have 128 cores or more, while the SEV guest might only utilize 8 of these cores. Meanwhile, host can use qemu-affinity to bind these 8 vCPUs to specific physical CPUs. Therefore, keeping a record of the physical core numbers each time a vCPU runs can help avoid flushing the cache for all CPUs every time. Take care to allocate the cpumask used to track which CPUs have run a vCPU when copying or moving an "encryption context", as nothing guarantees memory in a mirror VM is a strict subset of the ASID owner, and the destination VM for intrahost migration needs to maintain it's own set of CPUs. E.g. for intrahost migration, if a CPU was used for the source VM but not the destination VM, then it can only have cached memory that was accessible to the source VM. And a CPU that was run in the source is also used by the destination is no different than a CPU that was run in the destination only. Note, KVM is guaranteed to do flush caches prior to sev_vm_destroy(), thanks to kvm_arch_guest_memory_reclaimed for SEV and SEV-ES, and kvm_arch_gmem_invalidate() for SEV-SNP. I.e. it's safe to free the cpumask prior to unregistering encrypted regions and freeing the ASID. Opportunistically clean up sev_vm_destroy()'s comment regarding what is (implicitly, what isn't) skipped for mirror VMs. Cc: Srikanth Aithal <[email protected]> Reviewed-by: Tom Lendacky <[email protected]> Signed-off-by: Zheyun Shen <[email protected]> Link: https://lore.kernel.org/r/[email protected] Link: https://lore.kernel.org/all/[email protected] Co-developed-by: Sean Christopherson <[email protected]> Signed-off-by: Sean Christopherson <[email protected]>
| * KVM: SEV: Prefer WBNOINVD over WBINVD for cache maintenance efficiencyKevin Loughlin2025-07-101-19/+26
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | AMD CPUs currently execute WBINVD in the host when unregistering SEV guest memory or when deactivating SEV guests. Such cache maintenance is performed to prevent data corruption, wherein the encrypted (C=1) version of a dirty cache line might otherwise only be written back after the memory is written in a different context (ex: C=0), yielding corruption. However, WBINVD is performance-costly, especially because it invalidates processor caches. Strictly-speaking, unless the SEV ASID is being recycled (meaning the SNP firmware requires the use of WBINVD prior to DF_FLUSH), the cache invalidation triggered by WBINVD is unnecessary; only the writeback is needed to prevent data corruption in remaining scenarios. To improve performance in these scenarios, use WBNOINVD when available instead of WBINVD. WBNOINVD still writes back all dirty lines (preventing host data corruption by SEV guests) but does *not* invalidate processor caches. Note that the implementation of wbnoinvd() ensures fall back to WBINVD if WBNOINVD is unavailable. In anticipation of forthcoming optimizations to limit the WBNOINVD only to physical CPUs that have executed SEV guests, place the call to wbnoinvd_on_all_cpus() in a wrapper function sev_writeback_caches(). Signed-off-by: Kevin Loughlin <[email protected]> Reviewed-by: Mingwei Zhang <[email protected]> Reviewed-by: Tom Lendacky <[email protected]> Link: https://lore.kernel.org/r/[email protected] [sean: tweak comment regarding CLFUSH] Cc: Francesco Lavra <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Sean Christopherson <[email protected]>
| * KVM: SVM: Remove wbinvd in sev_vm_destroy()Zheyun Shen2025-07-101-6/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Before sev_vm_destroy() is called, kvm_arch_guest_memory_reclaimed() has been called for SEV and SEV-ES and kvm_arch_gmem_invalidate() has been called for SEV-SNP. These functions have already handled flushing the memory. Therefore, this wbinvd_on_all_cpus() can simply be dropped. Suggested-by: Sean Christopherson <[email protected]> Signed-off-by: Zheyun Shen <[email protected]> Reviewed-by: Tom Lendacky <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Sean Christopherson <[email protected]>
* | Merge tag 'kvm-x86-svm-6.17' of https://github.com/kvm-x86/linux into HEADPaolo Bonzini2025-07-291-5/+1
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | KVM SVM changes for 6.17 Drop KVM's rejection of SNP's SMT and single-socket policy restrictions, and instead rely on firmware to verify that the policy can actually be supported. Don't bother checking that requested policy(s) can actually be satisfied, as an incompatible policy doesn't put the kernel at risk in any way, and providing guarantees with respect to the physical topology is outside of KVM's purview.
| * | KVM: SVM: Allow SNP guest policy to specify SINGLE_SOCKETTom Lendacky2025-06-201-3/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | KVM currently returns -EINVAL when it attempts to create an SNP guest if the SINGLE_SOCKET guest policy bit is set. The reason for this action is that KVM would need specific support (SNP_ACTIVATE_EX command support) to achieve this when running on a system with more than one socket. However, the SEV firmware will make the proper check and return POLICY_FAILURE during SNP_ACTIVATE if the single socket guest policy bit is set and the system has more than one socket: - System with one socket - Guest policy SINGLE_SOCKET == 0 ==> SNP_ACTIVATE succeeds - Guest policy SINGLE_SOCKET == 1 ==> SNP_ACTIVATE succeeds - System with more than one socket - Guest policy SINGLE_SOCKET == 0 ==> SNP_ACTIVATE succeeds - Guest policy SINGLE_SOCKET == 1 ==> SNP_ACTIVATE fails with POLICY_FAILURE Remove the check for the SINGLE_SOCKET policy bit from snp_launch_start() and allow the firmware to perform the proper checking. This does have the effect of allowing an SNP guest with the SINGLE_SOCKET policy bit set to run on a single socket system, but fail when run on a system with more than one socket. However, this should not affect existing SNP guests as setting the SINGLE_SOCKET policy bit is not allowed today. Signed-off-by: Tom Lendacky <[email protected]> Link: https://lore.kernel.org/r/4c51018dd3e4f2c543935134d2c4f47076f109f6.1748553480.git.thomas.lendacky@amd.com Signed-off-by: Sean Christopherson <[email protected]>
| * | KVM: SVM: Allow SNP guest policy disallow running with SMT enabledTom Lendacky2025-06-201-2/+1
| |/ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | KVM currently returns -EINVAL when it attempts to create an SNP guest if the SMT guest policy bit is not set. However, there is no reason to check this, as there is no specific support in KVM that is required to support this. The SEV firmware will determine if SMT has been enabled or disabled in the BIOS and process the policy in the proper way: - SMT enabled in BIOS - Guest policy SMT == 0 ==> SNP_LAUNCH_START fails with POLICY_FAILURE - Guest policy SMT == 1 ==> SNP_LAUNCH_START succeeds - SMT disabled in BIOS - Guest policy SMT == 0 ==> SNP_LAUNCH_START succeeds - Guest policy SMT == 1 ==> SNP_LAUNCH_START succeeds Remove the check for the SMT policy bit from snp_launch_start() and allow the firmware to perform the proper checking. Signed-off-by: Tom Lendacky <[email protected]> Link: https://lore.kernel.org/r/71043abdd9ef23b6f98fffa9c5c6045ac3a50187.1748553480.git.thomas.lendacky@amd.com Signed-off-by: Sean Christopherson <[email protected]>
* | Merge tag 'kvm-x86-mmu-6.17' of https://github.com/kvm-x86/linux into HEADPaolo Bonzini2025-07-291-0/+2
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | KVM x86 MMU changes for 6.17 - Exempt nested EPT from the the !USER + CR0.WP logic, as EPT doesn't interact with CR0.WP. - Move the TDX hardware setup code to tdx.c to better co-locate TDX code and eliminate a few global symbols. - Dynamically allocation the shadow MMU's hashed page list, and defer allocating the hashed list until it's actually needed (the TDP MMU doesn't use the list).
| * | KVM: x86: Use kvzalloc() to allocate VM structSean Christopherson2025-06-241-0/+2
| |/ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Allocate VM structs via kvzalloc(), i.e. try to use a contiguous physical allocation before falling back to __vmalloc(), to avoid the overhead of establishing the virtual mappings. For non-debug builds, The SVM and VMX (and TDX) structures are now just below 7000 bytes in the worst case scenario (see below), i.e. are order-1 allocations, and will likely remain that way for quite some time. Add compile-time assertions in vendor code to ensure the size of the structures, sans the memslot hash tables, are order-0 allocations, i.e. are less than 4KiB. There's nothing fundamentally wrong with a larger kvm_{svm,vmx,tdx} size, but given that the size of the structure (without the memslots hash tables) is below 2KiB after 18+ years of existence, more than doubling the size would be quite notable. Add sanity checks on the memslot hash table sizes, partly to ensure they aren't resized without accounting for the impact on VM structure size, and partly to document that the majority of the size of VM structures comes from the memslots. Reviewed-by: Xiaoyao Li <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Sean Christopherson <[email protected]>
* | Merge tag 'kvm-x86-misc-6.17' of https://github.com/kvm-x86/linux into HEADPaolo Bonzini2025-07-294-406/+357
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | KVM x86 misc changes for 6.17 - Prevert the host's DEBUGCTL.FREEZE_IN_SMM (Intel only) when running the guest. Failure to honor FREEZE_IN_SMM can bleed host state into the guest. - Explicitly check vmcs12.GUEST_DEBUGCTL on nested VM-Enter (Intel only) to prevent L1 from running L2 with features that KVM doesn't support, e.g. BTF. - Intercept SPEC_CTRL on AMD if the MSR shouldn't exist according to the vCPU's CPUID model. - Rework the MSR interception code so that the SVM and VMX APIs are more or less identical. - Recalculate all MSR intercepts from the "source" on MSR filter changes, and drop the dedicated "shadow" bitmaps (and their awful "max" size defines). - WARN and reject loading kvm-amd.ko instead of panicking the kernel if the nested SVM MSRPM offsets tracker can't handle an MSR. - Advertise support for LKGS (Load Kernel GS base), a new instruction that's loosely related to FRED, but is supported and enumerated independently. - Fix a user-triggerable WARN that syzkaller found by stuffing INIT_RECEIVED, a.k.a. WFS, and then putting the vCPU into VMX Root Mode (post-VMXON). Use the same approach KVM uses for dealing with "impossible" emulation when running a !URG guest, and simply wait until KVM_RUN to detect that the vCPU has architecturally impossible state. - Add KVM_X86_DISABLE_EXITS_APERFMPERF to allow disabling interception of APERF/MPERF reads, so that a "properly" configured VM can "virtualize" APERF/MPERF (with many caveats). - Reject KVM_SET_TSC_KHZ if vCPUs have been created, as changing the "default" frequency is unsupported for VMs with a "secure" TSC, and there's no known use case for changing the default frequency for other VM types.
| * | KVM: x86: Provide a capability to disable APERF/MPERF read interceptsJim Mattson2025-07-092-1/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Allow a guest to read the physical IA32_APERF and IA32_MPERF MSRs without interception. The IA32_APERF and IA32_MPERF MSRs are not virtualized. Writes are not handled at all. The MSR values are not zeroed on vCPU creation, saved on suspend, or restored on resume. No accommodation is made for processor migration or for sharing a logical processor with other tasks. No adjustments are made for non-unit TSC multipliers. The MSRs do not account for time the same way as the comparable PMU events, whether the PMU is virtualized by the traditional emulation method or the new mediated pass-through approach. Nonetheless, in a properly constrained environment, this capability can be combined with a guest CPUID table that advertises support for CPUID.6:ECX.APERFMPERF[bit 0] to induce a Linux guest to report the effective physical CPU frequency in /proc/cpuinfo. Moreover, there is no performance cost for this capability. Signed-off-by: Jim Mattson <[email protected]> Link: https://lore.kernel.org/r/[email protected] Reviewed-by: Xiaoyao Li <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Sean Christopherson <[email protected]>
| * | KVM: x86: Replace growing set of *_in_guest bools with a u64Jim Mattson2025-07-091-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Store each "disabled exit" boolean in a single bit rather than a byte. No functional change intended. Suggested-by: Sean Christopherson <[email protected]> Signed-off-by: Jim Mattson <[email protected]> Link: https://lore.kernel.org/r/[email protected] Reviewed-by: Xiaoyao Li <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Sean Christopherson <[email protected]>
| * | KVM: SVM: Simplify MSR interception logic for IA32_XSS MSRChao Gao2025-06-251-5/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Use svm_set_intercept_for_msr() directly to configure IA32_XSS MSR interception, ensuring consistency with other cases where MSRs are intercepted depending on guest caps and CPUIDs. No functional change intended. Signed-off-by: Chao Gao <[email protected]> Reviewed-by: Dapeng Mi <[email protected]> Reviewed-by: Binbin Wu <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Sean Christopherson <[email protected]>
| * | KVM: x86: Deduplicate MSR interception enabling and disablingChao Gao2025-06-242-26/+13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Extract a common function from MSR interception disabling logic and create disabling and enabling functions based on it. This removes most of the duplicated code for MSR interception disabling/enabling. No functional change intended. Signed-off-by: Chao Gao <[email protected]> Reviewed-by: Dapeng Mi <[email protected]> Link: https://lore.kernel.org/r/[email protected] [sean: s/enable/set, inline the wrappers] Signed-off-by: Sean Christopherson <[email protected]>
| * | KVM: x86: Simplify userspace filter logic when disabling MSR interceptionSean Christopherson2025-06-201-14/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Refactor {svm,vmx}_disable_intercept_for_msr() to simplify the handling of userspace filters that disallow access to an MSR. The more complicated logic is no longer needed or justified now that KVM recalculates all MSR intercepts on a userspace MSR filter change, i.e. now that KVM doesn't need to also update shadow bitmaps. No functional change intended. Suggested-by: Dapeng Mi <[email protected]> Reviewed-by: Dapeng Mi <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Sean Christopherson <[email protected]>
| * | KVM: SVM: Add a helper to allocate and initialize permissions bitmapsSean Christopherson2025-06-202-17/+22
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add a helper to allocate and initialize an MSR or I/O permissions map, as the logic is identical between the two map types, the only difference is the size of the bitmap. Opportunistically add a comment to explain why the bitmaps are initialized with 0xff, e.g. instead of the more common zero-initialized behavior, which is the main motivation for deduplicating the code. No functional change intended. Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Sean Christopherson <[email protected]>
| * | KVM: nSVM: Merge MSRPM in 64-bit chunks on 64-bit kernelsSean Christopherson2025-06-201-10/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When merging L0 and L1 MSRPMs as part of nested VMRUN emulation, access the bitmaps using "unsigned long" chunks, i.e. use 8-byte access for 64-bit kernels instead of arbitrarily working on 4-byte chunks. Opportunistically rename local variables in nested_svm_merge_msrpm() to more precisely/accurately reflect their purpose ("offset" in particular is extremely ambiguous). Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Sean Christopherson <[email protected]>
| * | KVM: SVM: Return -EINVAL instead of MSR_INVALID to signal out-of-range MSRSean Christopherson2025-06-202-11/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | Return -EINVAL instead of MSR_INVALID from svm_msrpm_bit_nr() to indicate that the MSR isn't covered by one of the (currently) three MSRPM ranges, and delete the MSR_INVALID macro now that all users are gone. Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Sean Christopherson <[email protected]>
| * | KVM: nSVM: Access MSRPM in 4-byte chunks only for merging L0 and L1 bitmapsSean Christopherson2025-06-201-34/+18
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Access the MSRPM using u32/4-byte chunks (and appropriately adjusted offsets) only when merging L0 and L1 bitmaps as part of emulating VMRUN. The only reason to batch accesses to MSRPMs is to avoid the overhead of uaccess operations (e.g. STAC/CLAC and bounds checks) when reading L1's bitmap pointed at by vmcb12. For all other uses, either per-bit accesses are more than fast enough (no uaccess), or KVM is only accessing a single bit (nested_svm_exit_handled_msr()) and so there's nothing to batch. In addition to (hopefully) documenting the uniqueness of the merging code, restricting chunked access to _just_ the merging code will allow for increasing the chunk size (to unsigned long) with minimal risk. Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Sean Christopherson <[email protected]>
| * | KVM: SVM: Store MSRPM pointer as "void *" instead of "u32 *"Sean Christopherson2025-06-203-10/+15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Store KVM's MSRPM pointers as "void *" instead of "u32 *" to guard against directly accessing the bitmaps outside of code that is explicitly written to access the bitmaps with a specific type. Opportunistically use svm_vcpu_free_msrpm() in svm_vcpu_free() instead of open coding an equivalent. Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Sean Christopherson <[email protected]>
| * | KVM: SVM: Move svm_msrpm_offset() to nested.cSean Christopherson2025-06-203-24/+23
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Move svm_msrpm_offset() from svm.c to nested.c now that all usage of the u32-index offsets is nested virtualization specific. No functional change intended. Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Sean Christopherson <[email protected]>
| * | KVM: SVM: Drop explicit check on MSRPM offset when emulating SEV-ES accessesSean Christopherson2025-06-201-3/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Now that msr_write_intercepted() defaults to true, i.e. accurately reflects hardware behavior for out-of-range MSRs, and doesn't WARN (or BUG) on an out-of-range MSR, drop sev_es_prevent_msr_access()'s svm_msrpm_offset() check that guarded against calling msr_write_intercepted() with a "bad" index. Opportunistically clean up the helper's formatting. Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Sean Christopherson <[email protected]>
| * | KVM: SVM: Merge "after set CPUID" intercept recalc helpersSean Christopherson2025-06-201-11/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Merge svm_recalc_intercepts_after_set_cpuid() and svm_recalc_instruction_intercepts() such that the "after set CPUID" helper simply invokes the type-specific helpers (MSRs vs. instructions), i.e. make svm_recalc_intercepts_after_set_cpuid() a single entry point for all intercept updates that need to be performed after a CPUID change. No functional change intended. Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Sean Christopherson <[email protected]>
| * | KVM: SVM: Fold svm_vcpu_init_msrpm() into its sole callerSean Christopherson2025-06-201-16/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Fold svm_vcpu_init_msrpm() into svm_recalc_msr_intercepts() now that there is only the one caller (and because the "init" misnomer is even more misleading than it was in the past). No functional change intended. Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Sean Christopherson <[email protected]>
| * | KVM: SVM: Rename init_vmcb_after_set_cpuid() to make it intercepts specificSean Christopherson2025-06-201-3/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Rename init_vmcb_after_set_cpuid() to svm_recalc_intercepts_after_set_cpuid() to more precisely describe its role. Strictly speaking, the name isn't perfect as toggling virtual VM{LOAD,SAVE} is arguably not recalculating an intercept, but practically speaking it's close enough. No functional change intended. Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Sean Christopherson <[email protected]>
| * | KVM: x86: Rename msr_filter_changed() => recalc_msr_intercepts()Sean Christopherson2025-06-201-7/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Rename msr_filter_changed() to recalc_msr_intercepts() and drop the trampoline wrapper now that both SVM and VMX use a filter-agnostic recalc helper to react to the new userspace filter. No functional change intended. Reviewed-by: Xin Li (Intel) <[email protected]> Reviewed-by: Binbin Wu <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Sean Christopherson <[email protected]>
| * | KVM: SVM: Manually recalc all MSR intercepts on userspace MSR filter changeSean Christopherson2025-06-203-280/+97
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | On a userspace MSR filter change, recalculate all MSR intercepts using the filter-agnostic logic instead of maintaining a "shadow copy" of KVM's desired intercepts. The shadow bitmaps add yet another point of failure, are confusing (e.g. what does "handled specially" mean!?!?), an eyesore, and a maintenance burden. Given that KVM *must* be able to recalculate the correct intercepts at any given time, and that MSR filter updates are not hot paths, there is zero benefit to maintaining the shadow bitmaps. Opportunistically switch from boot_cpu_has() to cpu_feature_enabled() as appropriate. Link: https://lore.kernel.org/all/[email protected] Link: https://lore.kernel.org/all/20241126180253.GAZ0YNTdXH1UGeqsu6@fat_crate.local Cc: Francesco Lavra <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Sean Christopherson <[email protected]>
| * | KVM: x86: Move definition of X2APIC_MSR() to lapic.hSean Christopherson2025-06-201-2/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Dedup the definition of X2APIC_MSR and put it in the local APIC code where it belongs. No functional change intended. Reviewed-by: Dapeng Mi <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Sean Christopherson <[email protected]>
| * | KVM: SVM: Drop "always" flag from list of possible passthrough MSRsSean Christopherson2025-06-201-62/+62
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Drop the "always" flag from the array of possible passthrough MSRs, and instead manually initialize the permissions for the handful of MSRs that KVM passes through by default. In addition to cutting down on boilerplate copy+paste code and eliminating a misleading flag (the MSRs aren't always passed through, e.g. thanks to MSR filters), this will allow for removing the direct_access_msrs array entirely. Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Sean Christopherson <[email protected]>
| * | KVM: SVM: Pass through GHCB MSR if and only if VM is an SEV-ES guestSean Christopherson2025-06-202-2/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Disable interception of the GHCB MSR if and only if the VM is an SEV-ES guest. While the exact behavior is completely undocumented in the APM, common sense and testing on SEV-ES capable CPUs says that accesses to the GHCB from non-SEV-ES guests will #GP. I.e. from the guest's perspective, no functional change intended. Fixes: 376c6d285017 ("KVM: SVM: Provide support for SEV-ES vCPU creation/loading") Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Sean Christopherson <[email protected]>
| * | KVM: SVM: Implement and adopt VMX style MSR intercepts APIsSean Christopherson2025-06-203-39/+90
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add and use SVM MSR interception APIs (in most paths) to match VMX's APIs and nomenclature. Specifically, add SVM variants of: vmx_disable_intercept_for_msr(vcpu, msr, type) vmx_enable_intercept_for_msr(vcpu, msr, type) vmx_set_intercept_for_msr(vcpu, msr, type, intercept) to eventually replace SVM's single helper: set_msr_interception(vcpu, msrpm, msr, allow_read, allow_write) which is awkward to use (in all cases, KVM either applies the same logic for both reads and writes, or intercepts one of read or write), and is unintuitive due to using '0' to indicate interception should be *set*. Keep the guts of the old API for the moment to avoid churning the MSR filter code, as that mess will be overhauled in the near future. Leave behind a temporary comment to call out that the shadow bitmaps have inverted polarity relative to the bitmaps consumed by hardware. No functional change intended. Reviewed-by: Chao Gao <[email protected]> Reviewed-by: Binbin Wu <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Sean Christopherson <[email protected]>
| * | KVM: SVM: Add helpers for accessing MSR bitmap that don't rely on offsetsSean Christopherson2025-06-202-16/+58
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add macro-built helpers for testing, setting, and clearing MSRPM entries without relying on precomputed offsets. This sets the stage for eventually removing general KVM use of precomputed offsets, which are quite confusing and rather inefficient for the vast majority of KVM's usage. Outside of merging L0 and L1 bitmaps for nested SVM, using u32-indexed offsets and accesses is at best unnecessary, and at worst introduces extra operations to retrieve the individual bit from within the offset u32 value. And simply calling them "offsets" is very confusing, as the "unit" of the offset isn't immediately obvious. Use the new helpers in set_msr_interception_bitmap() and msr_write_intercepted() to verify the math and operations, but keep the existing offset-based logic in set_msr_interception_bitmap() to sanity check the "clear" and "set" operations. Manipulating MSR interceptions isn't a hot path and no kernel release is ever expected to contain this specific version of set_msr_interception_bitmap() (it will be removed entirely in the near future). Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Sean Christopherson <[email protected]>
| * | KVM: nSVM: Don't initialize vmcb02 MSRPM with vmcb01's "always passthrough"Sean Christopherson2025-06-203-4/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Don't initialize vmcb02's MSRPM with KVM's set of "always passthrough" MSRs, as KVM always needs to consult L1's intercepts, i.e. needs to merge vmcb01 with vmcb12 and write the result to vmcb02. This will eventually allow for the removal of svm_vcpu_init_msrpm(). Note, the bitmaps are truly initialized by svm_vcpu_alloc_msrpm() (default to intercepting all MSRs), e.g. if there is a bug lurking elsewhere, the worst case scenario from dropping the call to svm_vcpu_init_msrpm() should be that KVM would fail to passthrough MSRs to L2. Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Sean Christopherson <[email protected]>
| * | KVM: nSVM: Omit SEV-ES specific passthrough MSRs from L0+L1 bitmap mergeSean Christopherson2025-06-201-8/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | Don't merge bitmaps on nested VMRUN for MSRs that KVM passes through only for SEV-ES guests. KVM doesn't support nested virtualization for SEV-ES, and likely never will. Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Sean Christopherson <[email protected]>
| * | KVM: nSVM: Use dedicated array of MSRPM offsets to merge L0 and L1 bitmapsSean Christopherson2025-06-203-11/+78
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Use a dedicated array of MSRPM offsets to merge L0 and L1 bitmaps, i.e. to merge KVM's vmcb01 bitmap with L1's vmcb12 bitmap. This will eventually allow for the removal of direct_access_msrs, as the only path where tracking the offsets is truly justified is the merge for nested SVM, where merging in chunks is an easy way to batch uaccess reads/writes. Opportunistically omit the x2APIC MSRs from the merge-specific array instead of filtering them out at runtime. Note, disabling interception of DEBUGCTL, XSS, EFER, PAT, GHCB, and TSC_AUX is mutually exclusive with nested virtualization, as KVM passes through those MSRs only for SEV-ES guests, and KVM doesn't support nested virtualization for SEV+ guests. Defer removing those MSRs to a future cleanup in order to make this refactoring as benign as possible. Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Sean Christopherson <[email protected]>
| * | KVM: SVM: Clean up macros related to architectural MSRPM definitionsSean Christopherson2025-06-202-9/+16
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Move SVM's MSR Permissions Map macros to svm.h in anticipation of adding helpers that are available to SVM code, and opportunistically replace a variety of open-coded literals with (hopefully) informative macros. Opportunistically open code ARRAY_SIZE(msrpm_ranges) instead of wrapping it as NUM_MSR_MAPS, which is an ambiguous name even if it were qualified with "SVM_MSRPM". Deliberately leave the ranges as open coded literals, as using macros to define the ranges actually introduces more potential failure points, since both the definitions and the usage have to be careful to use the correct index. The lack of clear intent behind the ranges will be addressed in future patches. No functional change intended. Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Sean Christopherson <[email protected]>
| * | KVM: SVM: Massage name and param of helper that merges vmcb01 and vmcb12 MSRPMsSean Christopherson2025-06-202-9/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Rename nested_svm_vmrun_msrpm() to nested_svm_merge_msrpm() to better capture its role, and opportunistically feed it @vcpu instead of @svm, as grabbing "svm" only to turn around and grab svm->vcpu is rather silly. No functional change intended. Reviewed-by: Dapeng Mi <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Sean Christopherson <[email protected]>
| * | KVM: x86: Use non-atomic bit ops to manipulate "shadow" MSR interceptsSean Christopherson2025-06-201-6/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Manipulate the MSR bitmaps using non-atomic bit ops APIs (two underscores), as the bitmaps are per-vCPU and are only ever accessed while vcpu->mutex is held. Reviewed-by: Binbin Wu <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Sean Christopherson <[email protected]>
| * | KVM: SVM: Kill the VM instead of the host if MSR interception is buggySean Christopherson2025-06-201-4/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | WARN and kill the VM instead of panicking the host if KVM attempts to set or query MSR interception for an unsupported MSR. Accessing the MSR interception bitmaps only meaningfully affects post-VMRUN behavior, and KVM_BUG_ON() is guaranteed to prevent the current vCPU from doing VMRUN, i.e. there is no need to panic the entire host. Opportunistically move the sanity checks about their use to index into the MSRPM, e.g. so that bugs only WARN and terminate the VM, as opposed to doing that _and_ generating an out-of-bounds load. Reviewed-by: Dapeng Mi <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Sean Christopherson <[email protected]>
| * | KVM: SVM: Use ARRAY_SIZE() to iterate over direct_access_msrsSean Christopherson2025-06-202-8/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Drop the unnecessary and dangerous value-terminated behavior of direct_access_msrs, and simply iterate over the actual size of the array. The use in svm_set_x2apic_msr_interception() is especially sketchy, as it relies on unused capacity being zero-initialized, and '0' being outside the range of x2APIC MSRs. To ensure the array and shadow_msr_intercept stay synchronized, simply assert that their sizes are identical (note the six 64-bit-only MSRs). Note, direct_access_msrs will soon be removed entirely; keeping the assert synchronized with the array isn't expected to be along-term maintenance burden. Reviewed-by: Chao Gao <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Sean Christopherson <[email protected]>
| * | KVM: SVM: Tag MSR bitmap initialization helpers with __initSean Christopherson2025-06-201-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Tag init_msrpm_offsets() and add_msr_offset() with __init, as they're used only during hardware setup to map potential passthrough MSRs to offsets in the bitmap. Reviewed-by: Chao Gao <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Sean Christopherson <[email protected]>
| * | KVM: SVM: Don't BUG if setting up the MSR intercept bitmaps failsSean Christopherson2025-06-201-12/+13
| | | | | | | | | | | | | | | | | | | | | | | | | | | WARN and reject module loading if there is a problem with KVM's MSR interception bitmaps. Panicking the host in this situation is inexcusable since it is trivially easy to propagate the error up the stack. Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Sean Christopherson <[email protected]>
| * | KVM: SVM: Allocate IOPM pages after initial setup in svm_hardware_setup()Sean Christopherson2025-06-201-9/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Allocate pages for the IOPM after initial setup has been completed in svm_hardware_setup(), so that sanity checks can be added in the setup flow without needing to free the IOPM pages. The IOPM is only referenced (via iopm_base) in init_vmcb() and svm_hardware_unsetup(), so there's no need to allocate it early on. No functional change intended (beyond the obvious ordering differences, e.g. if the allocation fails). Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Sean Christopherson <[email protected]>
| * | KVM: SVM: Disable interception of SPEC_CTRL iff the MSR exists for the guestSean Christopherson2025-06-201-3/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Disable interception of SPEC_CTRL when the CPU virtualizes (i.e. context switches) SPEC_CTRL if and only if the MSR exists according to the vCPU's CPUID model. Letting the guest access SPEC_CTRL is generally benign, but the guest would see inconsistent behavior if KVM happened to emulate an access to the MSR. Fixes: d00b99c514b3 ("KVM: SVM: Add support for Virtual SPEC_CTRL") Reported-by: Chao Gao <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Sean Christopherson <[email protected]>
| * | KVM: x86: Drop kvm_x86_ops.set_dr6() in favor of a new KVM_RUN flagSean Christopherson2025-06-201-4/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Instruct vendor code to load the guest's DR6 into hardware via a new KVM_RUN flag, and remove kvm_x86_ops.set_dr6(), whose sole purpose was to load vcpu->arch.dr6 into hardware when DR6 can be read/written directly by the guest. Note, TDX already WARNs on any run_flag being set, i.e. will yell if KVM thinks DR6 needs to be reloaded. TDX vCPUs force KVM_DEBUGREG_AUTO_SWITCH and never clear the flag, i.e. should never observe KVM_RUN_LOAD_GUEST_DR6. Cc: [email protected] Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Sean Christopherson <[email protected]>
| * | KVM: x86: Convert vcpu_run()'s immediate exit param into a generic bitmapSean Christopherson2025-06-201-2/+2
| |/ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Convert kvm_x86_ops.vcpu_run()'s "force_immediate_exit" boolean parameter into an a generic bitmap so that similar "take action" information can be passed to vendor code without creating a pile of boolean parameters. This will allow dropping kvm_x86_ops.set_dr6() in favor of a new flag, and will also allow for adding similar functionality for re-loading debugctl in the active VMCS. Opportunistically massage the TDX WARN and comment to prepare for adding more run_flags, all of which are expected to be mutually exclusive with TDX, i.e. should be WARNed on. No functional change intended. Cc: [email protected] Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Sean Christopherson <[email protected]>
* | Merge tag 'kvm-x86-irqs-6.17' of https://github.com/kvm-x86/linux into HEADPaolo Bonzini2025-07-293-399/+325
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | KVM IRQ changes for 6.17 - Rework irqbypass to track/match producers and consumers via an xarray instead of a linked list. Using a linked list leads to O(n^2) insertion times, which is hugely problematic for use cases that create large numbers of VMs. Such use cases typically don't actually use irqbypass, but eliminating the pointless registration is a future problem to solve as it likely requires new uAPI. - Track irqbypass's "token" as "struct eventfd_ctx *" instead of a "void *", to avoid making a simple concept unnecessarily difficult to understand. - Add CONFIG_KVM_IOAPIC for x86 to allow disabling support for I/O APIC, PIC, and PIT emulation at compile time. - Drop x86's irq_comm.c, and move a pile of IRQ related code into irq.c. - Fix a variety of flaws and bugs in the AVIC device posted IRQ code. - Inhibited AVIC if a vCPU's ID is too big (relative to what hardware supports) instead of rejecting vCPU creation. - Extend enable_ipiv module param support to SVM, by simply leaving IsRunning clear in the vCPU's physical ID table entry. - Disable IPI virtualization, via enable_ipiv, if the CPU is affected by erratum #1235, to allow (safely) enabling AVIC on such CPUs. - Dedup x86's device posted IRQ code, as the vast majority of functionality can be shared verbatime between SVM and VMX. - Harden the device posted IRQ code against bugs and runtime errors. - Use vcpu_idx, not vcpu_id, for GA log tag/metadata, to make lookups O(1) instead of O(n). - Generate GA Log interrupts if and only if the target vCPU is blocking, i.e. only if KVM needs a notification in order to wake the vCPU. - Decouple device posted IRQs from VFIO device assignment, as binding a VM to a VFIO group is not a requirement for enabling device posted IRQs. - Clean up and document/comment the irqfd assignment code. - Disallow binding multiple irqfds to an eventfd with a priority waiter, i.e. ensure an eventfd is bound to at most one irqfd through the entire host, and add a selftest to verify eventfd:irqfd bindings are globally unique.
| * | KVM: SVM: Generate GA log IRQs only if the associated vCPUs is blockingSean Christopherson2025-06-231-12/+51
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Configure IRTEs to GA log interrupts for device posted IRQs that hit non-running vCPUs if and only if the target vCPU is blocking, i.e. actually needs a wake event. If the vCPU has exited to userspace or was preempted, generating GA log entries and interrupts is wasteful and unnecessary, as the vCPU will be re-loaded and/or scheduled back in irrespective of the GA log notification (avic_ga_log_notifier() is just a fancy wrapper for kvm_vcpu_wake_up()). Use a should-be-zero bit in the vCPU's Physical APIC ID Table Entry to track whether or not the vCPU's associated IRTEs are configured to generate GA logs, but only set the synthetic bit in KVM's "cache", i.e. never set the should-be-zero bit in tables that are used by hardware. Use a synthetic bit instead of a dedicated boolean to minimize the odds of messing up the locking, i.e. so that all the existing rules that apply to avic_physical_id_entry for IS_RUNNING are reused verbatim for GA_LOG_INTR. Note, because KVM (by design) "puts" AVIC state in a "pre-blocking" phase, using kvm_vcpu_is_blocking() to track the need for notifications isn't a viable option. Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Sean Christopherson <[email protected]>