Attempting to boot the default 'bsd' kernel of OpenBSD 4.1 i386 in a guest
fails early in the kernel init inside p3_get_bus_clock while trying to read
the IA32_EBL_CR_POWERON MSR. KVM logs an 'unhandled MSR' message and the
guest kernel faults.
This patch is sufficient to allow OpenBSD to boot, after which it seems to
run fine. I'm not sure if this is the correct solution for dealing with
this particular MSR, but it works for me.
Signed-off-by: Matthew Gregan <kinetik@flim.org>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Everyone owns a piece of the exception bitmap, but they happily write to
the entire thing like there's no tomorrow. Centralize handling in
update_exception_bitmap() and have everyone call that.
Signed-off-by: Avi Kivity <avi@qumranet.com>
The lightweight vmexit path avoids saving and reloading certain host
state. However in certain cases lightweight vmexit handling can schedule()
which requires reloading the host state.
So we store the host state in the vcpu structure, and reloaded it if we
relinquish the vcpu.
Signed-off-by: Avi Kivity <avi@qumranet.com>
A typical demand page/copy on write pattern is:
- page fault on vaddr
- kvm propagates fault to guest
- guest handles fault, updates pte
- kvm traps write, clears shadow pte, resumes guest
- guest returns to userspace, re-faults on same vaddr
- kvm installs shadow pte, resumes guest
- guest continues
So, three vmexits for a single guest page fault. But if instead of clearing
the page table entry, we update to correspond to the value that the guest
has just written, we eliminate the third vmexit.
This patch does exactly that, reducing kbuild time by about 10%.
Signed-off-by: Avi Kivity <avi@qumranet.com>
When a guest writes to a page that has an mmu shadow, we have to clear
the shadow pte corresponding to the memory location touched by the guest.
Now, in nonpae mode, a single guest page may have two or four shadow
pages (because a nonpae page maps 4MB or 4GB, whereas the pae shadow maps
2MB or 1GB), so we when we look up the page we find up to three additional
aliases for the page. Since we _clear_ the shadow pte, it doesn't matter
except for a slight performance penalty, but if we want to _update_ the
shadow pte instead of clearing it, it is vital that we don't modify the
aliases.
Fortunately, exactly which page is needed (the "quadrant") is easily
computed, and is accessible in the shadow page header. All we need is
to ignore shadow pages from the wrong quadrants.
Signed-off-by: Avi Kivity <avi@qumranet.com>
Instead of calling two functions and repeating expensive checks, call one
function and provide it with before/after information.
Signed-off-by: Avi Kivity <avi@qumranet.com>
i386 wants fs for accessing the pda even on a lightweight exit, so ensure
we can always restore it. This fixes a regression on i386 introduced by
the lightweight vmexit patch.
Signed-off-by: Avi Kivity <avi@qumranet.com>
The kvm mmu tries to detects forks by looking for repeated writes to a
page table. If it sees a fork, it unshadows the page table so the page
table copying can proceed at native speed instead of being emulated.
However, the detector also triggered on simple demand paging access patterns:
a linear walk of memory would of course cause repeated writes to the same
pagetable page, causing it to unshadow prematurely.
Fix by resetting the fork detector if we detect a demand fault.
Signed-off-by: Avi Kivity <avi@qumranet.com>
Many msrs and the like will only be used by the host if we schedule() or
return to userspace. Therefore, we avoid saving them if we handle the
exit within the kernel, and if a reschedule is not requested.
Based on a patch from Eddie Dong <eddie.dong@intel.com> with a couple of
fixes by me.
Signed-off-by: Yaozu(Eddie) Dong <eddie.dong@intel.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
This allows us to remove write protection earlier than otherwise. Should
some mad OS choose to use byte writes to update pagetables, it will suffer
a performance hit, but still work correctly.
Signed-off-by: Avi Kivity <avi@qumranet.com>
The PC debug port is used for IO delay and does not require emulation.
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
This patch enables IO bitmaps control on vmx and unmask the 0x80 port to
avoid VMEXITs caused by accessing port 0x80. 0x80 is used as delays (see
include/asm/io.h), and handling VMEXITs on its access is unnecessary but
slows things down. This patch improves kernel build test at around
3%~5%.
Because every VM uses the same io bitmap, it is shared between
all VMs rather than a per-VM data structure.
Signed-off-by: Qing He <qing.he@intel.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
The lazy fpu changes did not take into account that some vmexit handlers
can sleep. Move loading the guest state into the inner loop so that it
can be reloaded if necessary, and move loading the host state into
vmx_vcpu_put() so it can be performed whenever we relinquish the vcpu.
Signed-off-by: Avi Kivity <avi@qumranet.com>
Fix following section mismatch warning in kvm-intel.o:
WARNING: o-i386/drivers/kvm/kvm-intel.o(.init.text+0xbd): Section mismatch: reference to .exit.text: (between 'hardware_setup' and 'vmx_disabled_by_bios')
The function free_kvm_area is used in the function alloc_kvm_area which
is marked __init.
The __exit area is discarded by some archs during link-time if a
module is built-in resulting in an oops.
Note: This warning is only seen by my local copy of modpost
but the change will soon hit upstream.
Signed-off-by: Sam Ravnborg <sam@ravnborg.org>
Cc: Avi Kivity <avi@qumranet.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
First thing mm.h does is including sched.h solely for can_do_mlock() inline
function which has "current" dereference inside. By dealing with can_do_mlock()
mm.h can be detached from sched.h which is good. See below, why.
This patch
a) removes unconditional inclusion of sched.h from mm.h
b) makes can_do_mlock() normal function in mm/mlock.c
c) exports can_do_mlock() to not break compilation
d) adds sched.h inclusions back to files that were getting it indirectly.
e) adds less bloated headers to some files (asm/signal.h, jiffies.h) that were
getting them indirectly
Net result is:
a) mm.h users would get less code to open, read, preprocess, parse, ... if
they don't need sched.h
b) sched.h stops being dependency for significant number of files:
on x86_64 allmodconfig touching sched.h results in recompile of 4083 files,
after patch it's only 3744 (-8.3%).
Cross-compile tested on
all arm defconfigs, all mips defconfigs, all powerpc defconfigs,
alpha alpha-up
arm
i386 i386-up i386-defconfig i386-allnoconfig
ia64 ia64-up
m68k
mips
parisc parisc-up
powerpc powerpc-up
s390 s390-up
sparc sparc-up
sparc64 sparc64-up
um-x86_64
x86_64 x86_64-up x86_64-defconfig x86_64-allnoconfig
as well as my two usual configs.
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Refine some depends statements to limit their visibility to the
environments that are actually supported.
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Since nonboot CPUs are now disabled after tasks and devices have been
frozen and the CPU hotplug infrastructure is used for this purpose, we need
special CPU hotplug notifications that will help the CPU-hotplug-aware
subsystems distinguish normal CPU hotplug events from CPU hotplug events
related to a system-wide suspend or resume operation in progress. This
patch introduces such notifications and causes them to be used during
suspend and resume transitions. It also changes all of the
CPU-hotplug-aware subsystems to take these notifications into consideration
(for now they are handled in the same way as the corresponding "normal"
ones).
[oleg@tv-sign.ru: cleanups]
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Cc: Gautham R Shenoy <ego@in.ibm.com>
Cc: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
It is illegal not to return from a pio or mmio request without completing
it, as mmio or pio is an atomic operation. Therefore, we can simplify
the userspace interface by avoiding the completion indication.
Signed-off-by: Avi Kivity <avi@qumranet.com>
When emulating an mmio read, we actually emulate twice: once to determine
the physical address of the mmio, and, after we've exited to userspace to
get the mmio value, we emulate again to place the value in the result
register and update any flags.
But we don't really need to enter the guest again for that, only to take
an immediate vmexit. So, if we detect that we're doing an mmio read,
emulate a single instruction before entering the guest again.
Signed-off-by: Avi Kivity <avi@qumranet.com>
We only have to save/restore MSR_GS_BASE on every VMEXIT. The rest can be
saved/restored when we leave the VCPU. Since we don't emulate the DEBUGCTL
MSRs and the guest cannot write to them, we don't have to worry about
saving/restoring them at all.
This shaves a whopping 40% off raw vmexit costs on AMD.
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
It might have worked in this case since PT_PRESENT_MASK is 1, but let's
express this correctly.
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Only save/restore the FPU host state when the guest is actually using the
FPU.
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Set all of the host mask bits for CR0 so that we can maintain a proper
shadow of CR0. This exposes CR0.TS, paving the way for lazy fpu handling.
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Avoid saving and restoring the guest fpu state on every exit. This
shaves ~100 cycles off the guest/host switch.
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Make the exit statistics per-vcpu instead of global. This gives a 3.5%
boost when running one virtual machine per core on my two socket dual core
(4 cores total) machine.
Signed-off-by: Avi Kivity <avi@qumranet.com>
By checking if a reschedule is needed, we avoid dropping the vcpu.
[With changes by me, based on Anthony Liguori's observations]
Signed-off-by: Avi Kivity <avi@qumranet.com>
Intel hosts only support syscall/sysret in long more (and only if efer.sce
is enabled), so only reload the related MSR_K6_STAR if the guest will
actually be able to use it.
This reduces vmexit cost by about 500 cycles (6400 -> 5870) on my setup.
Signed-off-by: Avi Kivity <avi@qumranet.com>
Some msrs are only used by x86_64 instructions, and are therefore
not needed when the guest is legacy mode. By not bothering to switch
them, we reduce vmexit latency by 2400 cycles (from about 8800) when
running a 32-bt guest on a 64-bit host.
Signed-off-by: Avi Kivity <avi@qumranet.com>
THe automatically switched msrs are never changed on the host (with
the exception of MSR_KERNEL_GS_BASE) and thus there is no need to save
them on every vm entry.
This reduces vmexit latency by ~400 cycles on i386 and by ~900 cycles (10%)
on x86_64.
Signed-off-by: Avi Kivity <avi@qumranet.com>
Usually, guest page faults are detected by the kvm page fault handler,
which detects if they are shadow faults, mmio faults, pagetable faults,
or normal guest page faults.
However, in ceratin circumstances, we can detect a page fault much later.
One of these events is the following combination:
- A two memory operand instruction (e.g. movsb) is executed.
- The first operand is in mmio space (which is the fault reported to kvm)
- The second operand is in an ummaped address (e.g. a guest page fault)
The Windows 2000 installer does such an access, an promptly hangs. Fix
by adding the missing page fault injection on that path.
Signed-off-by: Avi Kivity <avi@qumranet.com>
Some guests (Solaris) do not set up all four pdptrs, but leave some invalid.
kvm incorrectly treated these as valid page directories, pinning the
wrong pages and causing general confusion.
Fix by checking the valid bit of a pae pdpte. This closes sourceforge bug
1698922.
Signed-off-by: Avi Kivity <avi@qumranet.com>
Solaris panics if it sees a cpu with no fpu, and it seems to rely on this
bit. Closes sourceforge bug 1698920.
Signed-off-by: Avi Kivity <avi@qumranet.com>
The expression
sp - 6 < sp
where sp is a u16 is undefined in C since 'sp - 6' is promoted to int,
and signed overflow is undefined in C. gcc 4.2 actually warns about it.
Replace with a simpler test.
Signed-off-by: Eric Sesterhenn <snakebyte@gmx.de>
Signed-off-by: Avi Kivity <avi@qumranet.com>
This patch enables the virtualization of the last branch record MSRs on
SVM if this feature is available in hardware. It also introduces a small
and simple check feature for specific SVM extensions.
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
With this, we can specify that accesses to one physical memory range will
be remapped to another. This is useful for the vga window at 0xa0000 which
is used as a movable window into the (much larger) framebuffer.
Signed-off-by: Avi Kivity <avi@qumranet.com>
Mapping a guest page to a host page is a common operation. Currently,
one has first to find the memory slot where the page belongs (gfn_to_memslot),
then locate the page itself (gfn_to_page()).
This is clumsy, and also won't work well with memory aliases. So simplify
gfn_to_page() not to require memory slot translation first, and instead do it
internally.
Signed-off-by: Avi Kivity <avi@qumranet.com>
Functions that play around with the physical memory map
need a way to clear mappings to possibly nonexistent or
invalid memory. Both the mmu cache and the processor tlb
are cleared.
Signed-off-by: Dor Laor <dor.laor@qumranet.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
On x86, bit operations operate on a string of bits that can reside in
multiple words. For example, 'btsl %eax, (blah)' will touch the word
at blah+4 if %eax is between 32 and 63.
The x86 emulator compensates for that by advancing the operand address
by (bit offset / BITS_PER_LONG) and truncating the bit offset to the
range (0..BITS_PER_LONG-1). This has a side effect of forcing the operand
size to 8 bytes on 64-bit hosts.
Now, a 32-bit guest goes and fork()s a process. It write protects a stack
page at 0xbffff000 using the 'btr' instruction, at offset 0xffc in the page
table, with bit offset 1 (for the write permission bit).
The emulator now forces the operand size to 8 bytes as previously described,
and an innocent page table update turns into a cross-page-boundary write,
which is assumed by the mmu code not to be a page table, so it doesn't
actually clear the corresponding shadow page table entry. The guest and
host permissions are out of sync and guest memory is corrupted soon
afterwards, leading to guest failure.
Fix by not using BITS_PER_LONG as the word size; instead use the actual
operand size, so we get a 32-bit write in that case.
Note we still have to teach the mmu to handle cross-page-boundary writes
to guest page table; but for now this allows Damn Small Linux 0.4 (2.4.20)
to boot.
Signed-off-by: Avi Kivity <avi@qumranet.com>
Remove unused function
CC drivers/kvm/svm.o
drivers/kvm/svm.c:207: warning: ‘inject_db’ defined but not used
Signed-off-by: Michal Piotrowski <michal.k.k.piotrowski@gmail.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
When a vcpu is migrated from one cpu to another, its timestamp counter
may lose its monotonic property if the host has unsynced timestamp counters.
This can confuse the guest, sometimes to the point of refusing to boot.
As the rdtsc instruction is rather fast on AMD processors (7-10 cycles),
we can simply record the last host tsc when we drop the cpu, and adjust
the vcpu tsc offset when we detect that we've migrated to a different cpu.
Signed-off-by: Avi Kivity <avi@qumranet.com>
The kvm mmu keeps a shadow page for hugepage pdes; if several such pdes map
the same physical address, they share the same shadow page. This is a fairly
common case (kernel mappings on i386 nonpae Linux, for example).
However, if the two pdes map the same memory but with different permissions, kvm
will happily use the cached shadow page. If the access through the more
permissive pde will occur after the access to the strict pde, an endless pagefault
loop will be generated and the guest will make no progress.
Fix by making the access permissions part of the cache lookup key.
The fix allows Xen pae to boot on kvm and run guest domains.
Thanks to Jeremy Fitzhardinge for reporting the bug and testing the fix.
Signed-off-by: Avi Kivity <avi@qumranet.com>
This patch forbids the guest to execute monitor/mwait instructions on
SVM. This is necessary because the guest can execute these instructions
if they are available even if the kvm cpuid doesn't report its
existence.
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Some older (~2.6.7) kernels write MCG_STATUS register during kernel
boot (mce_clear_all() function, called from mce_init()). It's not
currently handled by kvm and will cause it to inject a GPF.
Following patch adds a "nop" handler for this.
Signed-off-by: Sergey Kiselev <sergey.kiselev@intel.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
As usual, we need to mangle segment registers when emulating real mode
as vm86 has specific constraints. We special case the reset segment base,
and set the "access rights" (or descriptor flags) to vm86 comaptible values.
This fixes reboot on vmx.
Signed-off-by: Avi Kivity <avi@qumranet.com>
The SET_SREGS ioctl modifies both cr0.pe (real mode/protected mode) and
guest segment registers. Since segment handling is modified by the mode on
Intel procesors, update the segment registers after the mode switch has taken
place.
Signed-off-by: Avi Kivity <avi@qumranet.com>
set_cr0_no_modeswitch() was a hack to avoid corrupting segment registers.
As we now cache the protected mode values on entry to real mode, this
isn't an issue anymore, and it interferes with reboot (which usually _is_
a modeswitch).
Signed-off-by: Avi Kivity <avi@qumranet.com>
The reset state has cs.selector == 0xf000 and cs.base == 0xffff0000,
which aren't compatible with vm86 mode, which is used for real mode
virtualization.
When we create a vcpu, we set cs.base to 0xf0000, but if we get there by
way of a reset, the values are inconsistent and vmx refuses to enter
guest mode.
Workaround by detecting the state and munging it appropriately.
Signed-off-by: Avi Kivity <avi@qumranet.com>
The initial, noncaching, version of the kvm mmu flushed the all nonglobal
shadow page table translations (much like a native tlb flush). The new
implementation flushes translations only when they change, rendering global
pte tracking superfluous.
This removes the unused tracking mechanism and storage space.
Signed-off-by: Avi Kivity <avi@qumranet.com>
The current string pio interface communicates using guest virtual addresses,
relying on userspace to translate addresses and to check permissions. This
interface cannot fully support guest smp, as the check needs to take into
account two pages at one in case an unaligned string transfer straddles a
page boundary.
Change the interface not to communicate guest addresses at all; instead use
a buffer page (mmaped by userspace) and do transfers there. The kernel
manages the virtual to physical translation and can perform the checks
atomically by taking the appropriate locks.
Signed-off-by: Avi Kivity <avi@qumranet.com>
Some ioctls ignore their arguments. By requiring them to be zero now,
we allow a nonzero value to have some special meaning in the future.
Signed-off-by: Avi Kivity <avi@qumranet.com>
This allows us to store offsets in the kernel/user kvm_run area, and be
sure that userspace has them mapped. As offsets can be outside the
kvm_run struct, userspace has no way of knowing how much to mmap.
Signed-off-by: Avi Kivity <avi@qumranet.com>
Allow a special signal mask to be used while executing in guest mode. This
allows signals to be used to interrupt a vcpu without requiring signal
delivery to a userspace handler, which is quite expensive. Userspace still
receives -EINTR and can get the signal via sigwait().
Signed-off-by: Avi Kivity <avi@qumranet.com>
This is redundant, as we also return -EINTR from the ioctl, but it
allows us to examine the exit_reason field on resume without seeing
old data.
Signed-off-by: Avi Kivity <avi@qumranet.com>
Currently, userspace is told about the nature of the last exit from the
guest using two fields, exit_type and exit_reason, where exit_type has
just two enumerations (and no need for more). So fold exit_type into
exit_reason, reducing the complexity of determining what really happened.
Signed-off-by: Avi Kivity <avi@qumranet.com>
KVM used to handle cpuid by letting userspace decide what values to
return to the guest. We now handle cpuid completely in the kernel. We
still let userspace decide which values the guest will see by having
userspace set up the value table beforehand (this is necessary to allow
management software to set the cpu features to the least common denominator,
so that live migration can work).
The motivation for the change is that kvm kernel code can be impacted by
cpuid features, for example the x86 emulator.
Signed-off-by: Avi Kivity <avi@qumranet.com>
Currently when passing the a PIO emulation request to userspace, we
rely on userspace updating %rax (on 'in' instructions) and %rsi/%rdi/%rcx
(on string instructions). This (a) requires two extra ioctls for getting
and setting the registers and (b) is unfriendly to non-x86 archs, when
they get kvm ports.
So fix by doing the register fixups in the kernel and passing to userspace
only an abstract description of the PIO to be done.
Signed-off-by: Avi Kivity <avi@qumranet.com>
Instead of passing a 'struct kvm_run' back and forth between the kernel and
userspace, allocate a page and allow the user to mmap() it. This reduces
needless copying and makes the interface expandable by providing lots of
free space.
Signed-off-by: Avi Kivity <avi@qumranet.com>
When auditing a 32-bit guest on a 64-bit host, sign extension of the page
table directory pointer table index caused bogus addresses to be shown on
audit errors.
Fix by declaring the index unsigned.
Signed-off-by: Avi Kivity <avi@qumranet.com>
Instead of twiddling the rip registers directly, use the
skip_emulated_instruction() function to do that for us.
Signed-off-by: Dor Laor <dor.laor@qumranet.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
The hypercall code mixes up the ->cache_regs() and ->decache_regs()
callbacks, resulting in guest register corruption.
Signed-off-by: Dor Laor <dor.laor@qumranet.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Nonpae guest pdes are shadowed by two pae ptes, so we double the offset
twice: once to account for the pte size difference, and once because we
need to shadow pdes for a single guest pde.
But when writing to the upper guest pde we also need to truncate the
lower bits, otherwise the multiply shifts these bits into the pde index
and causes an access to the wrong shadow pde. If we're at the end of the
page (accessing the very last guest pde) we can even overflow into the
next host page and oops.
Signed-off-by: Avi Kivity <avi@qumranet.com>
failed VM entry on VMX might still change %fs or %gs, thus make sure
that KVM always reloads the segment selectors. This is crutial on both
x86 and x86_64: x86 has __KERNEL_PDA in %fs on which things like
'current' depends and x86_64 has 0 there and needs MSR_GS_BASE to work.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Intel virtualization extensions do not support virtualizing real mode. So
kvm uses virtualized vm86 mode to run real mode code. Unfortunately, this
virtualized vm86 mode does not support the so called "big real" mode, where
the segment selector and base do not agree with each other according to the
real mode rules (base == selector << 4).
To work around this, kvm checks whether a selector/base pair violates the
virtualized vm86 rules, and if so, forces it into conformance. On a
transition back to protected mode, if we see that the guest did not touch
a forced segment, we restore it back to the original protected mode value.
This pile of hacks breaks down if the gdt has changed in real mode, as it
can cause a segment selector to point to a system descriptor instead of a
normal data segment. In fact, this happens with the Windows bootloader
and the qemu acpi bios, where a protected mode memcpy routine issues an
innocent 'pop %es' and traps on an attempt to load a system descriptor.
"Fix" by checking if the to-be-restored selector points at a system segment,
and if so, coercing it into a normal data segment. The long term solution,
of course, is to abandon vm86 mode and use emulation for big real mode.
Signed-off-by: Avi Kivity <avi@qumranet.com>
PAGE_MASK is an unsigned long, so using it to mask physical addresses on
i386 (which are 64-bit wide) leads to truncation. This can result in
page->private of unrelated memory pages being modified, with disasterous
results.
Fix by not using PAGE_MASK for physical addresses; instead calculate
the correct value directly from PAGE_SIZE. Also fix a similar BUG_ON().
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Avi Kivity <avi@qumranet.com>
KVM shadow page tables are always in pae mode, regardless of the guest
setting. This means that a guest pde (mapping 4MB of memory) is mapped
to two shadow pdes (mapping 2MB each).
When the guest writes to a pte or pde, we intercept the write and emulate it.
We also remove any shadowed mappings corresponding to the write. Since the
mmu did not account for the doubling in the number of pdes, it removed the
wrong entry, resulting in a mismatch between shadow page tables and guest
page tables, followed shortly by guest memory corruption.
This patch fixes the problem by detecting the special case of writing to
a non-pae pde and adjusting the address and number of shadow pdes zapped
accordingly.
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Avi Kivity <avi@qumranet.com>
The vmx code currently treats the guest's sysenter support msrs as 32-bit
values, which breaks 32-bit compat mode userspace on 64-bit guests. Fix by
using the native word width of the machine.
Signed-off-by: Avi Kivity <avi@qumranet.com>
Otherwise, the core module thinks the arch module is loaded, and won't
let you reload it after you've fixed the bug.
Signed-off-by: Avi Kivity <avi@qumranet.com>
Use the standard magic.h for kvmfs.
Cc: Avi Kivity <avi@qumranet.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Avi Kivity <avi@qumranet.com>
A bogus 'return r' can cause an otherwise successful module load to fail.
This both denies users the use of kvm, and it also denies them the use of
their machine, as it leaves a filesystem registered with its callbacks
pointing into now-freed module memory.
Fix by returning a zero like a good module.
Thanks to Richard Lucassen <mailinglists@lucassen.org> (?) for reporting
the problem and for providing access to a machine which exhibited it.
Signed-off-by: Avi Kivity <avi@qumranet.com>
Enabling dirty page logging is done using KVM_SET_MEMORY_REGION ioctl.
If the memory region already exists, we need to remove write accesses,
so writes will be caught, and dirty pages will be logged.
Signed-off-by: Uri Lublin <uril@qumranet.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Since dirty_bitmap is an unsigned long array, the alignment and size need
to take that into account.
Signed-off-by: Uri Lublin <uril@qumranet.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
A few places where we modify guest memory fail to call mark_page_dirty(),
causing live migration to fail. This adds the missing calls.
Signed-off-by: Uri Lublin <uril@qumranet.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Allocate a distinct inode for every vcpu in a VM. This has the following
benefits:
- the filp cachelines are no longer bounced when f_count is incremented on
every ioctl()
- the API and internal code are distinctly clearer; for example, on the
KVM_GET_REGS ioctl, there is no need to copy the vcpu number from
userspace and then copy the registers back; the vcpu identity is derived
from the fd used to make the call
Right now the performance benefits are completely theoretical since (a) we
don't support more than one vcpu per VM and (b) virtualization hardware
inefficiencies completely everwhelm any cacheline bouncing effects. But
both of these will change, and we need to prepare the API today.
Signed-off-by: Avi Kivity <avi@qumranet.com>
This reflects the changed scope, from device-wide to single vm (previously
every device open created a virtual machine).
Signed-off-by: Avi Kivity <avi@qumranet.com>
This avoids having filp->f_op and the corresponding inode->i_fop different,
which is a little unorthodox.
The ioctl list is split into two: global kvm ioctls and per-vm ioctls. A new
ioctl, KVM_CREATE_VM, is used to create VMs and return the VM fd.
Signed-off-by: Avi Kivity <avi@qumranet.com>
The kvmfs inodes will represent virtual machines and vcpus, as necessary,
reducing cacheline bouncing due to inodes and filps being shared.
Signed-off-by: Avi Kivity <avi@qumranet.com>
This patch changes the SVM code to intercept SMIs and handle it
outside the guest.
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
This adds a special MSR based hypercall API to KVM. This is to be
used by paravirtual kernels and virtual drivers.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Besides using an established api, this allows using kvm in older kernels.
Signed-off-by: Markus Rechberger <markus.rechberger@amd.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
The whole thing is rotten, but this allows vmx to boot with the guest reboot
fix.
Signed-off-by: Markus Rechberger <markus.rechberger@amd.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
We fail to mark a page dirty in three cases:
- setting the accessed bit in a pte
- setting the dirty bit in a pte
- emulating a write into a pagetable
This fix adds the missing cases.
Signed-off-by: Avi Kivity <avi@qumranet.com>
Convert the PDA code to use %fs rather than %gs as the segment for
per-processor data. This is because some processors show a small but
measurable performance gain for reloading a NULL segment selector (as %fs
generally is in user-space) versus a non-NULL one (as %gs generally is).
On modern processors the difference is very small, perhaps undetectable.
Some old AMD "K6 3D+" processors are noticably slower when %fs is used
rather than %gs; I have no idea why this might be, but I think they're
sufficiently rare that it doesn't matter much.
This patch also fixes the math emulator, which had not been adjusted to
match the changed struct pt_regs.
[frederik.deweerdt@gmail.com: fixit with gdb]
[mingo@elte.hu: Fix KVM too]
Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Cc: Ian Campbell <Ian.Campbell@XenSource.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Acked-by: Zachary Amsden <zach@vmware.com>
Cc: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: Frederik Deweerdt <frederik.deweerdt@gmail.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Add the necessary callbacks to suspend and resume a host running kvm. This is
just a repeat of the cpu hotplug/unplug work.
Signed-off-by: Avi Kivity <avi@qumranet.com>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
On hotplug, we execute the hardware extension enable sequence. On unplug, we
decache any vcpus that last ran on the exiting cpu, and execute the hardware
extension disable sequence.
Signed-off-by: Avi Kivity <avi@qumranet.com>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Like the inline code it replaces, this function decaches the vmcs from the cpu
it last executed on. in addition:
- vcpu_clear() works if the last cpu is also the cpu we're running on
- it is faster on larger smps by virtue of using smp_call_function_single()
Includes fix from Ingo Molnar.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This will allow us to iterate over all vcpus and see which cpus they are
running on.
[akpm@osdl.org: use standard (ugly) initialisers]
Signed-off-by: Avi Kivity <avi@qumranet.com>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
vcpu_load() can return NULL and it sometimes does in failure paths (for
example when the userspace ABI version is too old) - causing a preemption
count underflow in the ->vcpu_free() later on. So check for NULL.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Or 32-bit userspace will get confused.
Signed-off-by: Avi Kivity <avi@qumranet.com>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We report the value of cr8 to userspace on an exit. Also let userspace change
cr8 when we re-enter the guest. The lets 64-bit guest code maintain the tpr
correctly.
Thanks for Yaniv Kamay for the idea.
Signed-off-by: Dor Laor <dor.laor@qumranet.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This allows us to run the mmu testsuite on amd.
Signed-off-by: Avi Kivity <avi@qumranet.com>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The kvm mmu relies on cr0.wp being set even if the guest does not set it. The
vmx code correctly forces cr0.wp at all times, the svm code does not, so it
can't boot solaris without this patch.
Signed-off-by: Avi Kivity <avi@qumranet.com>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Just like svm.
Signed-off-by: Avi Kivity <avi@qumranet.com>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
gva_to_gpa() needs to be updated to the new walk_addr() calling convention,
otherwise it may oops under some circumstances.
Use the opportunity to remove all the code duplication in gva_to_gpa(), which
essentially repeats the calculations in walk_addr().
Signed-off-by: Avi Kivity <avi@qumranet.com>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Forms like "0(%rsp)" generate an instruction with an unnecessary one byte
displacement under certain circumstances. replace with the equivalent
"(%rsp)".
Signed-off-by: Avi Kivity <avi@qumranet.com>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Intel hosts, without long mode, and with nx support disabled in the bios
have an efer that is readable but not writable. This causes a lockup on
switch to guest mode (even though it should exit with reason 34 according
to the documentation).
Signed-off-by: Avi Kivity <avi@qumranet.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Fix what looks like an obvious typo in the file drivers/kvm/svm.c.
Signed-off-by: Robert P. J. Day <rpjday@mindspring.com>
Acked-by: Avi Kivity <avi@qumranet.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch implements forwarding of SHUTDOWN intercepts from the guest on to
userspace on AMD SVM. A SHUTDOWN event occurs when the guest produces a
triple fault (e.g. on reboot). This also fixes the bug that a guest reboot
actually causes a host reboot under some circumstances.
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
With the recent guest page fault change, we perform access checks on our
own instead of relying on the cpu. This means we have to perform the nx
checks as well.
Software like the google toolbar on windows appears to rely on this
somehow.
Signed-off-by: Avi Kivity <avi@qumranet.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Check pte permission bits in walk_addr(), instead of scattering the checks all
over the code. This has the following benefits:
1. We no longer set the accessed bit for accessed which fail permission checks.
2. Setting the accessed bit is simplified.
3. Under some circumstances, we used to pretend a page fault was fixed when
it would actually fail the access checks. This caused an unnecessary
vmexit.
4. The error code for guest page faults is now correct.
The fix helps netbsd further along booting, and allows kvm to pass the new mmu
testsuite.
Signed-off-by: Avi Kivity <avi@qumranet.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This allows netbsd 3.1 i386 to get further along installing.
Signed-off-by: Avi Kivity <avi@qumranet.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
There's an obvious typo in svm_{get,set}_idt, causing it to access the ldt
instead.
Because these functions are only called for save/load on AMD, the bug does not
impact normal operation. With the fix, save/load works as expected on AMD
hosts.
Signed-off-by: Uri Lublin <uril@qumranet.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
If a page is marked as dirty in the guest pte, set_pte_common() can set the
writable bit on newly-instantiated shadow pte. This optimization avoids
a write fault after the initial read fault.
However, if a write fault instantiates the pte, fix_write_pf() incorrectly
reports the fault as a guest page fault, and the guest oopses on what appears
to be a correctly-mapped page.
Fix is to detect the condition and only report a guest page fault on a user
access to a kernel page.
With the fix, a kvm guest can survive a whole night of running the kernel
hacker's screensaver (make -j9 in a loop).
Signed-off-by: Avi Kivity <avi@qumranet.com>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The various bit string instructions (bts, btc, etc.) fail to adjust the
address correctly if the bit address is beyond BITS_PER_LONG.
This bug creeped in as the emulator originally relied on cr2 to contain the
memory address; however we now decode it from the mod r/m bits, and must
adjust the offset to account for large bit indices.
The patch is rather large because it switches src and dst decoding around, so
that the bit index is available when decoding the memory address.
This fixes workloads like the FC5 installer.
Signed-off-by: Avi Kivity <avi@qumranet.com>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The kvm mmio read path looks like:
1. guest read faults
2. kvm emulates read, calls emulator_read_emulated()
3. fails as a read requires userspace help
4. exit to userspace
5. userspace emulates read, kvm sets vcpu->mmio_read_completed
6. re-enter guest, fault again
7. kvm emulates read, calls emulator_read_emulated()
8. succeeds as vcpu->mmio_read_emulated is set
9. instruction completes and guest is resumed
A problem surfaces if the userspace exit (step 5) also requests an interrupt
injection. In that case, the guest does not re-execute the original
instruction, but the interrupt handler. The next time an mmio read is
exectued (likely for a different address), step 3 will find
vcpu->mmio_read_completed set and return the value read for the original
instruction.
The problem manifested itself in a few annoying ways:
- little squares appear randomly on console when switching virtual terminals
- ne2000 fails under nfs read load
- rtl8139 complains about "pci errors" even though the device model is
incapable of issuing them.
Fix by skipping interrupt injection if an mmio read is pending.
A better fix is to avoid re-entry into the guest, and re-emulating immediately
instead. However that's a bit more complex.
Signed-off-by: Avi Kivity <avi@qumranet.com>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This makes the vmwrite errors on vm shutdown go away.
Signed-off-by: Avi Kivity <avi@qumranet.com>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Both "=r" and "=g" breaks my build on i386:
$ make
CC [M] drivers/kvm/vmx.o
{standard input}: Assembler messages:
{standard input}:3318: Error: bad register name `%sil'
make[1]: *** [drivers/kvm/vmx.o] Error 1
make: *** [_module_drivers/kvm] Error 2
The reason is that setbe requires an 8-bit register but "=r" does not
constrain the target register to be one that has an 8-bit version on
i386.
According to
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=10153
the correct constraint is "=q".
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
No need to test for rflags.if as both VT and SVM specs assure us that on exit
caused from interrupt window opening, 'if' is set.
Signed-off-by: Dor Laor <dor.laor@qumranet.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Prevent the guest's loading of a corrupt cr3 (pointing at no guest phsyical
page) from crashing the host.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
If we emulate a write, we fail to set the dirty bit on the guest pte, leading
the guest to believe the page is clean, and thus lose data. Bad.
Fix by setting the guest pte dirty bit under such conditions.
Signed-off-by: Avi Kivity <avi@qumranet.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
It overwrites the right cr3 set from mmu setup. Happens only with the test
harness.
Signed-off-by: Avi Kivity <avi@qumranet.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Fixes oops on early close of /dev/kvm.
Signed-off-by: Avi Kivity <avi@qumranet.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This will allow us to see the root cause when a vmwrite error happens.
Signed-off-by: Avi Kivity <avi@qumranet.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
If we reduce permissions on a pte, we must flush the cached copy of the pte
from the guest's tlb.
This is implemented at the moment by flushing the entire guest tlb, and can be
improved by flushing just the relevant virtual address, if it is known.
Signed-off-by: Avi Kivity <avi@qumranet.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The mmu sometimes needs memory for reverse mapping and parent pte chains.
however, we can't allocate from within the mmu because of the atomic context.
So, move the allocations to a central place that can be executed before the
main mmu machinery, where we can bail out on failure before any damage is
done.
(error handling is deffered for now, but the basic structure is there)
Signed-off-by: Avi Kivity <avi@qumranet.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Because mmu pages have attached rmap and parent pte chain structures, we need
to zap them before freeing so the attached structures are freed.
Signed-off-by: Avi Kivity <avi@qumranet.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
cmpxchg8b uses edx:eax as the compare operand, not edi:eax.
cmpxchg8b is used by 32-bit pae guests to set page table entries atomically,
and this is emulated touching shadowed guest page tables.
Also, implement it for 32-bit hosts.
Signed-off-by: Avi Kivity <avi@qumranet.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
We always need cr3 to point to something valid, so if we detect that we're
freeing a root page, simply push it back to the top of the active list.
Signed-off-by: Avi Kivity <avi@qumranet.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
In fork() (or when we protect a page that is no longer a page table), we can
experience floods of writes to a page, which have to be emulated. This is
expensive.
So, if we detect such a flood, zap the page so subsequent writes can proceed
natively.
Signed-off-by: Avi Kivity <avi@qumranet.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
A misaligned access affects two shadow ptes instead of just one.
Since a misaligned access is unlikely to occur on a real page table, just zap
the page out of existence, avoiding further trouble.
Signed-off-by: Avi Kivity <avi@qumranet.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Since we write protect shadowed guest page tables, there is no need to trap
page invalidations (the guest will always change the mapping before issuing
the invlpg instruction).
Signed-off-by: Avi Kivity <avi@qumranet.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
When beginning to process a page fault, make sure we have enough shadow pages
available to service the fault. If not, free some pages.
Signed-off-by: Avi Kivity <avi@qumranet.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
... and so must not free it unconditionally.
Move the freeing to kvm_mmu_zap_page().
Signed-off-by: Avi Kivity <avi@qumranet.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
When removing a page table, we must maintain the parent_pte field all child
shadow page tables.
Signed-off-by: Avi Kivity <avi@qumranet.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
A page table may have been recycled into a regular page, and so any
instruction can be executed on it. Unprotect the page and let the cpu do its
thing.
Signed-off-by: Avi Kivity <avi@qumranet.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Iterate over all shadow pages which correspond to a the given guest page table
and remove the mappings.
A subsequent page fault will reestablish the new mapping.
Signed-off-by: Avi Kivity <avi@qumranet.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
As the mmu write protects guest page table, we emulate those writes. Since
they are not mmio, there is no need to go to userspace to perform them.
So, perform the writes in the kernel if possible, and notify the mmu about
them so it can take the approriate action.
Signed-off-by: Avi Kivity <avi@qumranet.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This fixes a problem where set_pte_common() looked for shadowed pages based on
the page directory gfn (a huge page) instead of the actual gfn being mapped.
Signed-off-by: Avi Kivity <avi@qumranet.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
When we cache a guest page table into a shadow page table, we need to prevent
further access to that page by the guest, as that would render the cache
incoherent.
Signed-off-by: Avi Kivity <avi@qumranet.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Define a hashtable for caching shadow page tables. Look up the cache on
context switch (cr3 change) or during page faults.
The key to the cache is a combination of
- the guest page table frame number
- the number of paging levels in the guest
* we can cache real mode, 32-bit mode, pae, and long mode page
tables simultaneously. this is useful for smp bootup.
- the guest page table table
* some kernels use a page as both a page table and a page directory. this
allows multiple shadow pages to exist for that page, one per level
- the "quadrant"
* 32-bit mode page tables span 4MB, whereas a shadow page table spans
2MB. similarly, a 32-bit page directory spans 4GB, while a shadow
page directory spans 1GB. the quadrant allows caching up to 4 shadow page
tables for one guest page in one level.
- a "metaphysical" bit
* for real mode, and for pse pages, there is no guest page table, so set
the bit to avoid write protecting the page.
Signed-off-by: Avi Kivity <avi@qumranet.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This allows further manipulation on the shadow page table.
Signed-off-by: Avi Kivity <avi@qumranet.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This lets us not write protect a partial page, and is anyway what a real
processor does.
Signed-off-by: Avi Kivity <avi@qumranet.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Since we're not going to cache the pae-mode shadow root pages, allocate a
single pae shadow that will hold the four lower-level pages, which will act as
roots.
Signed-off-by: Avi Kivity <avi@qumranet.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
It is never necessary to fetch a guest entry from an intermediate page table
level (except for large pages), so avoid some confusion by always descending
into the lowest possible level.
Rename init_walker() to walk_addr() as it is no longer restricted to
initialization.
Signed-off-by: Avi Kivity <avi@qumranet.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
In pae mode, a load of cr3 loads the four third-level page table entries in
addition to cr3 itself.
Signed-off-by: Avi Kivity <avi@qumranet.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Saving the table gfns removes the need to walk the guest and host page tables
in lockstep.
Signed-off-by: Avi Kivity <avi@qumranet.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Keep in each host page frame's page->private a pointer to the shadow pte which
maps it. If there are multiple shadow ptes mapping the page, set bit 0 of
page->private, and use the rest as a pointer to a linked list of all such
mappings.
Reverse mappings are needed because we when we cache shadow page tables, we
must protect the guest page tables from being modified by the guest, as that
would invalidate the cached ptes.
Signed-off-by: Avi Kivity <avi@qumranet.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Hardware virtualization implementations allow the guests to freely change some
of the bits in cr0 and cr4, but trap when changing the other bits. This is
useful to avoid excessive exits due to changing, for example, the ts flag.
It also means the kvm's copy of cr0 and cr4 may be stale with respect to these
bits. most of the time this doesn't matter as these bits are not very
interesting. Other times, however (for example when returning cr0 to
userspace), they are, so get the fresh contents of these bits from the guest
by means of a new arch operation.
Signed-off-by: Avi Kivity <avi@qumranet.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The current interrupt injection mechanism might delay an interrupt under
the following circumstances:
- if injection fails because the guest is not interruptible (rflags.IF clear,
or after a 'mov ss' or 'sti' instruction). Userspace can check rflags,
but the other cases or not testable under the current API.
- if injection fails because of a fault during delivery. This probably
never happens under normal guests.
- if injection fails due to a physical interrupt causing a vmexit so that
it can be handled by the host.
In all cases the guest proceeds without processing the interrupt, reducing
the interactive feel and interrupt throughput of the guest.
This patch fixes the situation by allowing userspace to request an exit
when the 'interrupt window' opens, so that it can re-inject the interrupt
at the right time. Guest interactivity is very visibly improved.
Signed-off-by: Dor Laor <dor.laor@qumranet.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
If we load the wrong arch module, it leaves behind kvm_arch_ops set, which
prevents loading of the correct arch module later.
Fix be not setting kvm_arch_ops until we're sure it's good.
Signed-off-by: Yoshimi Ichiyanagi <ichiyanagi.yoshimi@lab.ntt.co.jp>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
KVM does kmalloc() in an atomic section while having preemption disabled via
vcpu_load(). Fix this by moving the ->*_msr setup from the vcpu_setup method
to the vcpu_create method.
(This is also a small speedup for setting up a vcpu, which can in theory be
more frequent than the vcpu_create method).
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
fix an GFP_KERNEL allocation in atomic section: kvm_dev_ioctl_create_vcpu()
called kvm_mmu_init(), which calls alloc_pages(), while holding the vcpu.
The fix is to set up the MMU state in two phases: kvm_mmu_create() and
kvm_mmu_setup().
(NOTE: free_vcpus does an kvm_mmu_destroy() call so there's no need for any
extra teardown branch on allocation/init failure here.)
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Cc: Avi Kivity <avi@qumranet.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
__free_page() doesn't like a NULL argument, so check before calling it. A
NULL can only happen if memory is exhausted during allocation of a memory
slot.
Signed-off-by: Avi Kivity <avi@qumranet.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
No need to append _MSR to msr names, a prefix should suffice.
Signed-off-by: Nguyen Anh Quynh <aquynh@gmail.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
These msrs are referenced by benchmarking software when pretending to be an
Intel cpu.
Signed-off-by: Avi Kivity <avi@qumranet.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The latest version of kvm doesn't initialize kvm_arch_ops in kvm_init(), which
causes an error with the following sequence.
1. Load the supported arch's module.
2. Load the unsupported arch's module.$B!!(B(loading error)
3. Unload the unsupported arch's module.
You'll get the following error message after step 3. "BUG: unable to handle
to handle kernel paging request at virtual address xxxxxxxx"
The problem here is that the unsupported arch's module overwrites kvm_arch_ops
of the supported arch's module at step 2.
This patch initializes kvm_arch_ops upon loading architecture specific kvm
module, and prevents overwriting kvm_arch_ops when kvm_arch_ops is already set
correctly.
Signed-off-by: Avi Kivity <avi@qumranet.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Instead of doing tricky stuff with the arch dependent virtualization
registers, take a peek at the guest's efer.
This simlifies some code, and fixes some confusion in the mmu branch.
Signed-off-by: Avi Kivity <avi@qumranet.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
current_cpu_data invokes smp_processor_id(), which is inadvisable when
preemption is enabled. Switch to boot_cpu_data instead.
Resolves sourceforge bug 1621401.
Signed-off-by: Avi Kivity <avi@qumranet.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Add compile-time and run-time API versioning.
Signed-off-by: Avi Kivity <avi@qumranet.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This allows plan9 to get a little further booting.
Signed-off-by: Michael Riepe <michael@mr511.de>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This allows opensolaris to boot on kvm/intel.
Signed-off-by: Michael Riepe <michael@mr511.de>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Some msrs, such as MSR_STAR, are not available on all processors. Exporting
them causes qemu to try to fetch them, which will fail.
So, check all msrs for validity at module load time.
Signed-off-by: Michael Riepe <michael@mr511.de>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Fixes sf bug 1614113 (segfaults in nbench).
Signed-off-by: Avi Kivity <avi@qumranet.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This is necessary for linux guests.
Signed-off-by: Avi Kivity <avi@qumranet.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Consolidate the logic for checking whether a vcpu index is valid. Also, use
likely(), as a valid value should be the overwhelmingly common case.
Signed-off-by: James Morris <jmorris@namei.org>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
It seems macbooks set bit 2 but not bit 0, which is an "enabled but vmxon will
fault" setting.
Signed-off-by: Avi Kivity <avi@qumranet.com>
Tested-by: Alex Larsson (sometimes testing helps)
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
They're not on speaking terms.
Signed-off-by: Avi Kivity <avi@qumranet.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The pcd, pwt, and pat bits on page table entries affect the cpu cache. Since
the cache is a host resource, the guest should not be able to control it.
Moreover, the meaning of these bits changes depending on whether pat is
enabled or not.
So, force these bits to zero on shadow page table entries at all times.
Signed-off-by: Avi Kivity <avi@qumranet.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The arch splitting patchset left an extra put_cpu() in core code, where it can
cause trouble for CONFIG_PREEMPT kernels.
Reported-by: Huihong Luo <huisinro@yahoo.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This makes the SET_SREGS ioctl behave symmetrically to the GET_SREGS ioctl wrt
the segment access rights flag.
Signed-off-by: Uri Lublin <uril@qumranet.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
As per akpm's request.
Signed-off-by: Avi Kivity <avi@qumranet.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
By letting gcc choose the temporary register for us, we lose arch dependency
and some ugliness. Conceivably gcc will also generate marginally better code.
Signed-off-by: Avi Kivity <avi@qumranet.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Instead of in the main drivers menu.
Signed-off-by: Avi Kivity <avi@qumranet.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
load_TR_desc() lives in asm/desc.h, so #include that file.
Signed-off-by: Anthony Liguori <anthony@codemonkey.ws>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
web site: http://kvm.sourceforge.net
mailing list: kvm-devel@lists.sourceforge.net
(http://lists.sourceforge.net/lists/listinfo/kvm-devel)
The following patchset adds a driver for Intel's hardware virtualization
extensions to the x86 architecture. The driver adds a character device
(/dev/kvm) that exposes the virtualization capabilities to userspace. Using
this driver, a process can run a virtual machine (a "guest") in a fully
virtualized PC containing its own virtual hard disks, network adapters, and
display.
Using this driver, one can start multiple virtual machines on a host.
Each virtual machine is a process on the host; a virtual cpu is a thread in
that process. kill(1), nice(1), top(1) work as expected. In effect, the
driver adds a third execution mode to the existing two: we now have kernel
mode, user mode, and guest mode. Guest mode has its own address space mapping
guest physical memory (which is accessible to user mode by mmap()ing
/dev/kvm). Guest mode has no access to any I/O devices; any such access is
intercepted and directed to user mode for emulation.
The driver supports i386 and x86_64 hosts and guests. All combinations are
allowed except x86_64 guest on i386 host. For i386 guests and hosts, both pae
and non-pae paging modes are supported.
SMP hosts and UP guests are supported. At the moment only Intel
hardware is supported, but AMD virtualization support is being worked on.
Performance currently is non-stellar due to the naive implementation of the
mmu virtualization, which throws away most of the shadow page table entries
every context switch. We plan to address this in two ways:
- cache shadow page tables across tlb flushes
- wait until AMD and Intel release processors with nested page tables
Currently a virtual desktop is responsive but consumes a lot of CPU. Under
Windows I tried playing pinball and watching a few flash movies; with a recent
CPU one can hardly feel the virtualization. Linux/X is slower, probably due
to X being in a separate process.
In addition to the driver, you need a slightly modified qemu to provide I/O
device emulation and the BIOS.
Caveats (akpm: might no longer be true):
- The Windows install currently bluescreens due to a problem with the
virtual APIC. We are working on a fix. A temporary workaround is to
use an existing image or install through qemu
- Windows 64-bit does not work. That's also true for qemu, so it's
probably a problem with the device model.
[bero@arklinux.org: build fix]
[simon.kagstrom@bth.se: build fix, other fixes]
[uril@qumranet.com: KVM: Expose interrupt bitmap]
[akpm@osdl.org: i386 build fix]
[mingo@elte.hu: i386 fixes]
[rdreier@cisco.com: add log levels to all printks]
[randy.dunlap@oracle.com: Fix sparse NULL and C99 struct init warnings]
[anthony@codemonkey.ws: KVM: AMD SVM: 32-bit host support]
Signed-off-by: Yaniv Kamay <yaniv@qumranet.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Cc: Simon Kagstrom <simon.kagstrom@bth.se>
Cc: Bernhard Rosenkraenzer <bero@arklinux.org>
Signed-off-by: Uri Lublin <uril@qumranet.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Roland Dreier <rolandd@cisco.com>
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Anthony Liguori <anthony@codemonkey.ws>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>