Commit graph

2208 commits

Author SHA1 Message Date
Gleb Natapov
242ec97c35 KVM: x86: reset edge sense circuit of i8259 on init
The spec says that during initialization "The edge sense circuit is
reset which means that following initialization an interrupt request
(IR) input must make a low-to-high transition to generate an interrupt",
but currently if edge triggered interrupt is in IRR it is delivered
after i8259 initialization.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-03-05 14:57:30 +02:00
Avi Kivity
1a18a69b76 KVM: x86 emulator: reject SYSENTER in compatibility mode on AMD guests
If the guest thinks it's an AMD, it will not have prepared the SYSENTER MSRs,
and if the guest executes SYSENTER in compatibility mode, it will fails.

Detect this condition and #UD instead, like the spec says.

Signed-off-by: Avi Kivity <avi@redhat.com>
2012-03-05 14:57:20 +02:00
Julian Stecklina
a52315e1d5 KVM: Don't mistreat edge-triggered INIT IPI as INIT de-assert. (LAPIC)
If the guest programs an IPI with level=0 (de-assert) and trig_mode=0 (edge),
it is erroneously treated as INIT de-assert and ignored, but to quote the
spec: "For this delivery mode [INIT de-assert], the level flag must be set to
0 and trigger mode flag to 1."

Signed-off-by: Julian Stecklina <js@alien8.de>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-03-05 14:52:43 +02:00
Davidlohr Bueso
e2358851ef KVM: SVM: comment nested paging and virtualization module parameters
Also use true instead of 1 for enabling by default.

Signed-off-by: Davidlohr Bueso <dave@gnu.org>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-03-05 14:52:43 +02:00
Takuya Yoshikawa
e4b35cc960 KVM: MMU: Remove unused kvm parameter from rmap_next()
Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-03-05 14:52:43 +02:00
Takuya Yoshikawa
9373e2c057 KVM: MMU: Remove unused kvm parameter from __gfn_to_rmap()
Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-03-05 14:52:42 +02:00
Avi Kivity
2adb5ad9fe KVM: x86 emulator: Remove byte-sized MOVSX/MOVZX hack
Currently we treat MOVSX/MOVZX with a byte source as a byte instruction,
and change the destination operand size with a hack.  Change it to be
a word instruction, so the destination receives its natural size, and
change the source to be SrcMem8.

Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2012-03-05 14:52:42 +02:00
Avi Kivity
28867cee75 KVM: x86 emulator: add 8-bit memory operands
Useful for MOVSX/MOVZX.

Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2012-03-05 14:52:42 +02:00
Boris Ostrovsky
2b036c6b86 KVM: SVM: Add support for AMD's OSVW feature in guests
In some cases guests should not provide workarounds for errata even when the
physical processor is affected. For example, because of erratum 400 on family
10h processors a Linux guest will read an MSR (resulting in VMEXIT) before
going to idle in order to avoid getting stuck in a non-C0 state. This is not
necessary: HLT and IO instructions are intercepted and therefore there is no
reason for erratum 400 workaround in the guest.

This patch allows us to present a guest with certain errata as fixed,
regardless of the state of actual hardware.

Signed-off-by: Boris Ostrovsky <boris.ostrovsky@amd.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-03-05 14:52:21 +02:00
Davidlohr Bueso
4a58ae614a KVM: MMU: unnecessary NX state assignment
We can remove the first ->nx state assignment since it is assigned afterwards anyways.

Signed-off-by: Davidlohr Bueso <dave@gnu.org>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-03-05 14:52:21 +02:00
Carsten Otte
5b1c1493af KVM: s390: ucontrol: export SIE control block to user
This patch exports the s390 SIE hardware control block to userspace
via the mapping of the vcpu file descriptor. In order to do so,
a new arch callback named kvm_arch_vcpu_fault  is introduced for all
architectures. It allows to map architecture specific pages.

Signed-off-by: Carsten Otte <cotte@de.ibm.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-03-05 14:52:19 +02:00
Carsten Otte
e08b963716 KVM: s390: add parameter for KVM_CREATE_VM
This patch introduces a new config option for user controlled kernel
virtual machines. It introduces a parameter to KVM_CREATE_VM that
allows to set bits that alter the capabilities of the newly created
virtual machine.
The parameter is passed to kvm_arch_init_vm for all architectures.
The only valid modifier bit for now is KVM_VM_S390_UCONTROL.
This requires CAP_SYS_ADMIN privileges and creates a user controlled
virtual machine on s390 architectures.

Signed-off-by: Carsten Otte <cotte@de.ibm.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-03-05 14:52:18 +02:00
Xiao Guangrong
a138fe7535 KVM: MMU: remove the redundant get_written_sptes
get_written_sptes is called twice in kvm_mmu_pte_write, one of them can be
removed

Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-03-05 14:52:18 +02:00
Takuya Yoshikawa
6addd1aa2c KVM: MMU: Add missing large page accounting to drop_large_spte()
Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-03-05 14:52:18 +02:00
Takuya Yoshikawa
37178b8bf0 KVM: MMU: Remove for_each_unsync_children() macro
There is only one user of it and for_each_set_bit() does the same.

Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-03-05 14:52:17 +02:00
Ingo Molnar
737f24bda7 Merge branch 'perf/urgent' into perf/core
Conflicts:
	tools/perf/builtin-record.c
	tools/perf/builtin-top.c
	tools/perf/perf.h
	tools/perf/util/top.h

Merge reason: resolve these cherry-picking conflicts.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2012-03-05 09:20:08 +01:00
Joerg Roedel
1018faa6cf perf/x86/kvm: Fix Host-Only/Guest-Only counting with SVM disabled
It turned out that a performance counter on AMD does not
count at all when the GO or HO bit is set in the control
register and SVM is disabled in EFER.

This patch works around this issue by masking out the HO bit
in the performance counter control register when SVM is not
enabled.

The GO bit is not touched because it is only set when the
user wants to count in guest-mode only. So when SVM is
disabled the counter should not run at all and the
not-counting is the intended behaviour.

Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Avi Kivity <avi@redhat.com>
Cc: Stephane Eranian <eranian@google.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Gleb Natapov <gleb@redhat.com>
Cc: Robert Richter <robert.richter@amd.com>
Cc: stable@vger.kernel.org # v3.2
Link: http://lkml.kernel.org/r/1330523852-19566-1-git-send-email-joerg.roedel@amd.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2012-03-02 12:16:39 +01:00
Ingo Molnar
c5905afb0e static keys: Introduce 'struct static_key', static_key_true()/false() and static_key_slow_[inc|dec]()
So here's a boot tested patch on top of Jason's series that does
all the cleanups I talked about and turns jump labels into a
more intuitive to use facility. It should also address the
various misconceptions and confusions that surround jump labels.

Typical usage scenarios:

        #include <linux/static_key.h>

        struct static_key key = STATIC_KEY_INIT_TRUE;

        if (static_key_false(&key))
                do unlikely code
        else
                do likely code

Or:

        if (static_key_true(&key))
                do likely code
        else
                do unlikely code

The static key is modified via:

        static_key_slow_inc(&key);
        ...
        static_key_slow_dec(&key);

The 'slow' prefix makes it abundantly clear that this is an
expensive operation.

I've updated all in-kernel code to use this everywhere. Note
that I (intentionally) have not pushed through the rename
blindly through to the lowest levels: the actual jump-label
patching arch facility should be named like that, so we want to
decouple jump labels from the static-key facility a bit.

On non-jump-label enabled architectures static keys default to
likely()/unlikely() branches.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Acked-by: Jason Baron <jbaron@redhat.com>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Cc: a.p.zijlstra@chello.nl
Cc: mathieu.desnoyers@efficios.com
Cc: davem@davemloft.net
Cc: ddaney.cavm@gmail.com
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/20120222085809.GA26397@elte.hu
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2012-02-24 10:05:59 +01:00
Linus Torvalds
1361b83a13 i387: Split up <asm/i387.h> into exported and internal interfaces
While various modules include <asm/i387.h> to get access to things we
actually *intend* for them to use, most of that header file was really
pretty low-level internal stuff that we really don't want to expose to
others.

So split the header file into two: the small exported interfaces remain
in <asm/i387.h>, while the internal definitions that are only used by
core architecture code are now in <asm/fpu-internal.h>.

The guiding principle for this was to expose functions that we export to
modules, and leave them in <asm/i387.h>, while stuff that is used by
task switching or was marked GPL-only is in <asm/fpu-internal.h>.

The fpu-internal.h file could be further split up too, especially since
arch/x86/kvm/ uses some of the remaining stuff for its module.  But that
kvm usage should probably be abstracted out a bit, and at least now the
internal FPU accessor functions are much more contained.  Even if it
isn't perhaps as contained as it _could_ be.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/alpine.LFD.2.02.1202211340330.5354@i5.linux-foundation.org
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2012-02-21 14:12:54 -08:00
Linus Torvalds
f94edacf99 i387: move TS_USEDFPU flag from thread_info to task_struct
This moves the bit that indicates whether a thread has ownership of the
FPU from the TS_USEDFPU bit in thread_info->status to a word of its own
(called 'has_fpu') in task_struct->thread.has_fpu.

This fixes two independent bugs at the same time:

 - changing 'thread_info->status' from the scheduler causes nasty
   problems for the other users of that variable, since it is defined to
   be thread-synchronous (that's what the "TS_" part of the naming was
   supposed to indicate).

   So perfectly valid code could (and did) do

	ti->status |= TS_RESTORE_SIGMASK;

   and the compiler was free to do that as separate load, or and store
   instructions.  Which can cause problems with preemption, since a task
   switch could happen in between, and change the TS_USEDFPU bit. The
   change to TS_USEDFPU would be overwritten by the final store.

   In practice, this seldom happened, though, because the 'status' field
   was seldom used more than once, so gcc would generally tend to
   generate code that used a read-modify-write instruction and thus
   happened to avoid this problem - RMW instructions are naturally low
   fat and preemption-safe.

 - On x86-32, the current_thread_info() pointer would, during interrupts
   and softirqs, point to a *copy* of the real thread_info, because
   x86-32 uses %esp to calculate the thread_info address, and thus the
   separate irq (and softirq) stacks would cause these kinds of odd
   thread_info copy aliases.

   This is normally not a problem, since interrupts aren't supposed to
   look at thread information anyway (what thread is running at
   interrupt time really isn't very well-defined), but it confused the
   heck out of irq_fpu_usable() and the code that tried to squirrel
   away the FPU state.

   (It also caused untold confusion for us poor kernel developers).

It also turns out that using 'task_struct' is actually much more natural
for most of the call sites that care about the FPU state, since they
tend to work with the task struct for other reasons anyway (ie
scheduling).  And the FPU data that we are going to save/restore is
found there too.

Thanks to Arjan Van De Ven <arjan@linux.intel.com> for pointing us to
the %esp issue.

Cc: Arjan van de Ven <arjan@linux.intel.com>
Reported-and-tested-by: Raphael Prevost <raphael@buro.asia>
Acked-and-tested-by: Suresh Siddha <suresh.b.siddha@intel.com>
Tested-by: Peter Anvin <hpa@zytor.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-02-18 10:19:41 -08:00
Linus Torvalds
6d59d7a9f5 i387: don't ever touch TS_USEDFPU directly, use helper functions
This creates three helper functions that do the TS_USEDFPU accesses, and
makes everybody that used to do it by hand use those helpers instead.

In addition, there's a couple of helper functions for the "change both
CR0.TS and TS_USEDFPU at the same time" case, and the places that do
that together have been changed to use those.  That means that we have
fewer random places that open-code this situation.

The intent is partly to clarify the code without actually changing any
semantics yet (since we clearly still have some hard to reproduce bug in
this area), but also to make it much easier to use another approach
entirely to caching the CR0.TS bit for software accesses.

Right now we use a bit in the thread-info 'status' variable (this patch
does not change that), but we might want to make it a full field of its
own or even make it a per-cpu variable.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-02-16 13:33:12 -08:00
Gleb Natapov
5753785fa9 KVM: do not #GP on perf MSR writes when vPMU is disabled
Return to behaviour perf MSR had before introducing vPMU in case vPMU
is disabled. Some guests access those registers unconditionally and do
not expect it to fail.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2012-02-01 11:44:46 +02:00
Stephan Bärwolf
c2226fc9e8 KVM: x86: fix missing checks in syscall emulation
On hosts without this patch, 32bit guests will crash (and 64bit guests
may behave in a wrong way) for example by simply executing following
nasm-demo-application:

    [bits 32]
    global _start
    SECTION .text
    _start: syscall

(I tested it with winxp and linux - both always crashed)

    Disassembly of section .text:

    00000000 <_start>:
       0:   0f 05                   syscall

The reason seems a missing "invalid opcode"-trap (int6) for the
syscall opcode "0f05", which is not available on Intel CPUs
within non-longmodes, as also on some AMD CPUs within legacy-mode.
(depending on CPU vendor, MSR_EFER and cpuid)

Because previous mentioned OSs may not engage corresponding
syscall target-registers (STAR, LSTAR, CSTAR), they remain
NULL and (non trapping) syscalls are leading to multiple
faults and finally crashs.

Depending on the architecture (AMD or Intel) pretended by
guests, various checks according to vendor's documentation
are implemented to overcome the current issue and behave
like the CPUs physical counterparts.

[mtosatti: cleanup/beautify code]

Signed-off-by: Stephan Baerwolf <stephan.baerwolf@tu-ilmenau.de>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2012-02-01 11:43:40 +02:00
Stephan Bärwolf
bdb42f5afe KVM: x86: extend "struct x86_emulate_ops" with "get_cpuid"
In order to be able to proceed checks on CPU-specific properties
within the emulator, function "get_cpuid" is introduced.
With "get_cpuid" it is possible to virtually call the guests
"cpuid"-opcode without changing the VM's context.

[mtosatti: cleanup/beautify code]

Signed-off-by: Stephan Baerwolf <stephan.baerwolf@tu-ilmenau.de>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2012-02-01 11:43:33 +02:00
Rusty Russell
476bc0015b module_param: make bool parameters really bool (arch)
module_param(bool) used to counter-intuitively take an int.  In
fddd5201 (mid-2009) we allowed bool or int/unsigned int using a messy
trick.

It's time to remove the int/unsigned int option.  For this version
it'll simply give a warning, but it'll break next kernel version.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2012-01-13 09:32:18 +10:30
Avi Kivity
222d21aa07 KVM: x86 emulator: implement RDPMC (0F 33)
Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-12-27 11:24:43 +02:00
Avi Kivity
80bdec64c0 KVM: x86 emulator: fix RDPMC privilege check
RDPMC is only privileged if CR4.PCE=0.  check_rdpmc() already implements this,
so all we need to do is drop the Priv flag.

Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-12-27 11:24:41 +02:00
Gleb Natapov
a6c06ed1a6 KVM: Expose the architectural performance monitoring CPUID leaf
Provide a CPUID leaf that describes the emulated PMU.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-12-27 11:24:40 +02:00
Avi Kivity
fee84b079d KVM: VMX: Intercept RDPMC
Intercept RDPMC and forward it to the PMU emulation code.

Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-12-27 11:24:38 +02:00
Avi Kivity
332b56e484 KVM: SVM: Intercept RDPMC
Intercept RDPMC and forward it to the PMU emulation code.

Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-12-27 11:24:37 +02:00
Avi Kivity
022cd0e840 KVM: Add generic RDPMC support
Add a helper function that emulates the RDPMC instruction operation.

Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-12-27 11:24:35 +02:00
Gleb Natapov
f5132b0138 KVM: Expose a version 2 architectural PMU to a guests
Use perf_events to emulate an architectural PMU, version 2.

Based on PMU version 1 emulation by Avi Kivity.

[avi: adjust for cpuid.c]
[jan: fix anonymous field initialization for older gcc]

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-12-27 11:24:29 +02:00
Avi Kivity
8934208221 KVM: Expose kvm_lapic_local_deliver()
Needed to deliver performance monitoring interrupts.

Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-12-27 11:23:39 +02:00
Takuya Yoshikawa
e0dac408d0 KVM: x86 emulator: Use opcode::execute for Group 9 instruction
Group 9: 0F C7

Rename em_grp9() to em_cmpxchg8b() and register it.

Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-12-27 11:23:38 +02:00
Takuya Yoshikawa
c04ec8393f KVM: x86 emulator: Use opcode::execute for Group 4/5 instructions
Group 4: FE
Group 5: FF

Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-12-27 11:23:36 +02:00
Takuya Yoshikawa
c15af35f54 KVM: x86 emulator: Use opcode::execute for Group 1A instruction
Group 1A: 8F

Register em_pop() directly and remove em_grp1a().

Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-12-27 11:23:35 +02:00
Gleb Natapov
d546cb406e KVM: drop bsp_vcpu pointer from kvm struct
Drop bsp_vcpu pointer from kvm struct since its only use is incorrect
anyway.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2011-12-27 11:22:32 +02:00
Jan Kiszka
a647795efb KVM: x86: Consolidate PIT legacy test
Move the test for KVM_PIT_FLAGS_HPET_LEGACY into create_pit_timer
instead of replicating it on the caller site.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2011-12-27 11:22:30 +02:00
Jan Kiszka
bb5a798ad5 KVM: x86: Do not rely on implicit inclusions
Works so far by change, but it is not guaranteed to stay like this.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2011-12-27 11:22:29 +02:00
Avi Kivity
43771ebfc9 KVM: Make KVM_INTEL depend on CPU_SUP_INTEL
PMU virtualization needs to talk to Intel-specific bits of perf; these are
only available when CPU_SUP_INTEL=y.

Fixes

  arch/x86/built-in.o: In function `atomic_switch_perf_msrs':
  vmx.c:(.text+0x6b1d4): undefined reference to `perf_guest_get_msrs'

Reported-by: Ingo Molnar <mingo@elte.hu>
Reported-by: Randy Dunlap <rdunlap@xenotime.net>
Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2011-12-27 11:22:27 +02:00
Sasha Levin
ff5c2c0316 KVM: Use memdup_user instead of kmalloc/copy_from_user
Switch to using memdup_user when possible. This makes code more
smaller and compact, and prevents errors.

Signed-off-by: Sasha Levin <levinsasha928@gmail.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-12-27 11:22:21 +02:00
Sasha Levin
cdfca7b346 KVM: Use kmemdup() instead of kmalloc/memcpy
Switch to kmemdup() in two places to shorten the code and avoid possible bugs.

Signed-off-by: Sasha Levin <levinsasha928@gmail.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-12-27 11:22:20 +02:00
Jan Kiszka
234b639206 KVM: x86 emulator: Remove set-but-unused cr4 from check_cr_write
This was probably copy&pasted from the cr0 case, but it's unneeded here.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-12-27 11:22:16 +02:00
Jan Kiszka
3d56cbdf35 KVM: MMU: Drop unused return value of kvm_mmu_remove_some_alloc_mmu_pages
freed_pages is never evaluated, so remove it as well as the return code
kvm_mmu_remove_some_alloc_mmu_pages so far delivered to its only user.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-12-27 11:22:15 +02:00
Alex,Shi
086c985501 KVM: use this_cpu_xxx replace percpu_xxx funcs
percpu_xxx funcs are duplicated with this_cpu_xxx funcs, so replace them
for further code clean up.

And in preempt safe scenario, __this_cpu_xxx funcs has a bit better
performance since __this_cpu_xxx has no redundant preempt_disable()

Signed-off-by: Alex Shi <alex.shi@intel.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-12-27 11:22:13 +02:00
Xiao Guangrong
e37fa7853c KVM: MMU: audit: inline audit function
inline audit function and little cleanup

Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-12-27 11:22:12 +02:00
Xiao Guangrong
d750ea2886 KVM: MMU: remove oos_shadow parameter
The unsync code should be stable now, maybe it is the time to remove this
parameter to cleanup the code a little bit

Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-12-27 11:22:10 +02:00
Xiao Guangrong
e459e3228d KVM: MMU: move the relevant mmu code to mmu.c
Move the mmu code in kvm_arch_vcpu_init() to kvm_mmu_create()

Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-12-27 11:22:09 +02:00
Xiao Guangrong
9edb17d55f KVM: x86: remove the dead code of KVM_EXIT_HYPERCALL
KVM_EXIT_HYPERCALL is not used anymore, so remove the code

Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-12-27 11:22:07 +02:00
Xiao Guangrong
0375f7fad9 KVM: MMU: audit: replace mmu audit tracepoint with jump-label
The tracepoint is only used to audit mmu code, it should not be exposed to
user, let us replace it with jump-label.

Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-12-27 11:22:05 +02:00
Sasha Levin
831bf664e9 KVM: Refactor and simplify kvm_dev_ioctl_get_supported_cpuid
This patch cleans and simplifies kvm_dev_ioctl_get_supported_cpuid by using a table
instead of duplicating code as Avi suggested.

This patch also fixes a bug where kvm_dev_ioctl_get_supported_cpuid would return
-E2BIG when amount of entries passed was just right.

Signed-off-by: Sasha Levin <levinsasha928@gmail.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-12-27 11:22:02 +02:00
Liu, Jinsong
fb215366b3 KVM: expose latest Intel cpu new features (BMI1/BMI2/FMA/AVX2) to guest
Intel latest cpu add 6 new features, refer http://software.intel.com/file/36945
The new feature cpuid listed as below:

1. FMA		CPUID.EAX=01H:ECX.FMA[bit 12]
2. MOVBE	CPUID.EAX=01H:ECX.MOVBE[bit 22]
3. BMI1		CPUID.EAX=07H,ECX=0H:EBX.BMI1[bit 3]
4. AVX2		CPUID.EAX=07H,ECX=0H:EBX.AVX2[bit 5]
5. BMI2		CPUID.EAX=07H,ECX=0H:EBX.BMI2[bit 8]
6. LZCNT	CPUID.EAX=80000001H:ECX.LZCNT[bit 5]

This patch expose these features to guest.
Among them, FMA/MOVBE/LZCNT has already been defined, MOVBE/LZCNT has
already been exposed.

This patch defines BMI1/AVX2/BMI2, and exposes FMA/BMI1/AVX2/BMI2 to guest.

Signed-off-by: Liu, Jinsong <jinsong.liu@intel.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-12-27 11:22:01 +02:00
Avi Kivity
00b27a3efb KVM: Move cpuid code to new file
The cpuid code has grown; put it into a separate file.

Signed-off-by: Avi Kivity <avi@redhat.com>
2011-12-27 11:21:49 +02:00
Takuya Yoshikawa
2b5e97e1fa KVM: x86 emulator: Use opcode::execute for INS/OUTS from/to port in DX
INSB       : 6C
INSW/INSD  : 6D
OUTSB      : 6E
OUTSW/OUTSD: 6F

The I/O port address is read from the DX register when we decode the
operand because we see the SrcDX/DstDX flag is set.

Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-12-27 11:17:46 +02:00
Xiao Guangrong
28a37544fb KVM: introduce id_to_memslot function
Introduce id_to_memslot to get memslot by slot id

Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-12-27 11:17:39 +02:00
Xiao Guangrong
be6ba0f096 KVM: introduce kvm_for_each_memslot macro
Introduce kvm_for_each_memslot to walk all valid memslot

Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-12-27 11:17:37 +02:00
Xiao Guangrong
be593d6286 KVM: introduce update_memslots function
Introduce update_memslots to update slot which will be update to
kvm->memslots

Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-12-27 11:17:35 +02:00
Xiao Guangrong
93a5cef07d KVM: introduce KVM_MEM_SLOTS_NUM macro
Introduce KVM_MEM_SLOTS_NUM macro to instead of
KVM_MEMORY_SLOTS + KVM_PRIVATE_MEM_SLOTS

Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-12-27 11:17:34 +02:00
Takuya Yoshikawa
ff227392cd KVM: x86 emulator: Use opcode::execute for BSF/BSR
BSF: 0F BC
BSR: 0F BD

Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2011-12-27 11:17:32 +02:00
Takuya Yoshikawa
e940b5c20f KVM: x86 emulator: Use opcode::execute for CMPXCHG
CMPXCHG: 0F B0, 0F B1

Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2011-12-27 11:17:31 +02:00
Takuya Yoshikawa
e1e210b0a7 KVM: x86 emulator: Use opcode::execute for WRMSR/RDMSR
WRMSR: 0F 30
RDMSR: 0F 32

Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2011-12-27 11:17:29 +02:00
Takuya Yoshikawa
bc00f8d2c2 KVM: x86 emulator: Use opcode::execute for MOV to cr/dr
MOV: 0F 22 (move to control registers)
MOV: 0F 23 (move to debug registers)

Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2011-12-27 11:17:28 +02:00
Takuya Yoshikawa
d4ddafcdf2 KVM: x86 emulator: Use opcode::execute for CALL
CALL: E8

Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2011-12-27 11:17:26 +02:00
Takuya Yoshikawa
ce7faab24f KVM: x86 emulator: Use opcode::execute for BT family
BT : 0F A3
BTS: 0F AB
BTR: 0F B3
BTC: 0F BB

Group 8: 0F BA

Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2011-12-27 11:17:25 +02:00
Takuya Yoshikawa
d7841a4b1b KVM: x86 emulator: Use opcode::execute for IN/OUT
IN : E4, E5, EC, ED
OUT: E6, E7, EE, EF

Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2011-12-27 11:17:23 +02:00
Gleb Natapov
46199f33c2 KVM: VMX: remove unneeded vmx_load_host_state() calls.
vmx_load_host_state() does not handle msrs switching (except
MSR_KERNEL_GS_BASE) since commit 26bb0981b3. Remove call to it
where it is no longer make sense.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-12-27 11:17:22 +02:00
Takuya Yoshikawa
95d4c16ce7 KVM: Optimize dirty logging by rmap_write_protect()
Currently, write protecting a slot needs to walk all the shadow pages
and checks ones which have a pte mapping a page in it.

The walk is overly heavy when dirty pages in that slot are not so many
and checking the shadow pages would result in unwanted cache pollution.

To mitigate this problem, we use rmap_write_protect() and check only
the sptes which can be reached from gfns marked in the dirty bitmap
when the number of dirty pages are less than that of shadow pages.

This criterion is reasonable in its meaning and worked well in our test:
write protection became some times faster than before when the ratio of
dirty pages are low and was not worse even when the ratio was near the
criterion.

Note that the locking for this write protection becomes fine grained.
The reason why this is safe is descripted in the comments.

Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-12-27 11:17:20 +02:00
Takuya Yoshikawa
7850ac5420 KVM: Count the number of dirty pages for dirty logging
Needed for the next patch which uses this number to decide how to write
protect a slot.

Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-12-27 11:17:19 +02:00
Takuya Yoshikawa
9b9b149236 KVM: MMU: Split gfn_to_rmap() into two functions
rmap_write_protect() calls gfn_to_rmap() for each level with gfn fixed.
This results in calling gfn_to_memslot() repeatedly with that gfn.

This patch introduces __gfn_to_rmap() which takes the slot as an
argument to avoid this.

This is also needed for the following dirty logging optimization.

Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-12-27 11:17:17 +02:00
Takuya Yoshikawa
d6eebf8b80 KVM: MMU: Clean up BUG_ON() conditions in rmap_write_protect()
Remove redundant checks and use is_large_pte() macro.

Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-12-27 11:17:13 +02:00
Chris Wright
fb92045843 KVM: MMU: remove KVM host pv mmu support
The host side pv mmu support has been marked for feature removal in
January 2011.  It's not in use, is slower than shadow or hardware
assisted paging, and a maintenance burden.  It's November 2011, time to
remove it.

Signed-off-by: Chris Wright <chrisw@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-12-27 11:17:10 +02:00
Jan Kiszka
3f2e5260f5 KVM: x86: Simplify kvm timer handler
The vcpu reference of a kvm_timer can't become NULL while the timer is
valid, so drop this redundant test. This also makes it pointless to
carry a separate __kvm_timer_fn, fold it into kvm_timer_fn.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2011-12-27 11:17:05 +02:00
Xiao Guangrong
a30f47cb15 KVM: MMU: improve write flooding detected
Detecting write-flooding does not work well, when we handle page written, if
the last speculative spte is not accessed, we treat the page is
write-flooding, however, we can speculative spte on many path, such as pte
prefetch, page synced, that means the last speculative spte may be not point
to the written page and the written page can be accessed via other sptes, so
depends on the Accessed bit of the last speculative spte is not enough

Instead of detected page accessed, we can detect whether the spte is accessed
after it is written, if the spte is not accessed but it is written frequently,
we treat is not a page table or it not used for a long time

Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-12-27 11:17:02 +02:00
Xiao Guangrong
5d9ca30e96 KVM: MMU: fix detecting misaligned accessed
Sometimes, we only modify the last one byte of a pte to update status bit,
for example, clear_bit is used to clear r/w bit in linux kernel and 'andb'
instruction is used in this function, in this case, kvm_mmu_pte_write will
treat it as misaligned access, and the shadow page table is zapped

Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-12-27 11:17:01 +02:00
Xiao Guangrong
889e5cbced KVM: MMU: split kvm_mmu_pte_write function
kvm_mmu_pte_write is too long, we split it for better readable

Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-12-27 11:16:59 +02:00
Xiao Guangrong
f8734352c6 KVM: MMU: remove unnecessary kvm_mmu_free_some_pages
In kvm_mmu_pte_write, we do not need to alloc shadow page, so calling
kvm_mmu_free_some_pages is really unnecessary

Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-12-27 11:16:58 +02:00
Xiao Guangrong
f57f2ef58f KVM: MMU: fast prefetch spte on invlpg path
Fast prefetch spte for the unsync shadow page on invlpg path

Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-12-27 11:16:56 +02:00
Xiao Guangrong
505aef8f30 KVM: MMU: cleanup FNAME(invlpg)
Directly Use mmu_page_zap_pte to zap spte in FNAME(invlpg), also remove the
same code between FNAME(invlpg) and FNAME(sync_page)

Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-12-27 11:16:54 +02:00
Xiao Guangrong
d01f8d5e02 KVM: MMU: do not mark accessed bit on pte write path
In current code, the accessed bit is always set when page fault occurred,
do not need to set it on pte write path

Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-12-27 11:16:53 +02:00
Xiao Guangrong
6f6fbe98c3 KVM: x86: cleanup port-in/port-out emulated
Remove the same code between emulator_pio_in_emulated and
emulator_pio_out_emulated

Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-12-27 11:16:51 +02:00
Xiao Guangrong
1cb3f3ae5a KVM: x86: retry non-page-table writing instructions
If the emulation is caused by #PF and it is non-page_table writing instruction,
it means the VM-EXIT is caused by shadow page protected, we can zap the shadow
page and retry this instruction directly

The idea is from Avi

Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-12-27 11:16:50 +02:00
Xiao Guangrong
d5ae7ce835 KVM: x86: tag the instructions which are used to write page table
The idea is from Avi:
| tag instructions that are typically used to modify the page tables, and
| drop shadow if any other instruction is used.
| The list would include, I'd guess, and, or, bts, btc, mov, xchg, cmpxchg,
| and cmpxchg8b.

This patch is used to tag the instructions and in the later path, shadow page
is dropped if it is written by other instructions

Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-12-27 11:16:48 +02:00
Xiao Guangrong
f759e2b4c7 KVM: MMU: avoid pte_list_desc running out in kvm_mmu_pte_write
kvm_mmu_pte_write is unsafe since we need to alloc pte_list_desc in the
function when spte is prefetched, unfortunately, we can not know how many
spte need to be prefetched on this path, that means we can use out of the
free  pte_list_desc object in the cache, and BUG_ON() is triggered, also some
path does not fill the cache, such as INS instruction emulated that does not
trigger page fault

Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-12-27 11:16:47 +02:00
Nadav Har'El
51cfe38ea5 KVM: nVMX: Fix warning-causing idt-vectoring-info behavior
When L0 wishes to inject an interrupt while L2 is running, it emulates an exit
to L1 with EXIT_REASON_EXTERNAL_INTERRUPT. This was explained in the original
nVMX patch 23, titled "Correct handling of interrupt injection".

Unfortunately, it is possible (though rare) that at this point there is valid
idt_vectoring_info in vmcs02. For example, L1 injected some interrupt to L2,
and when L2 tried to run this interrupt's handler, it got a page fault - so
it returns the original interrupt vector in idt_vectoring_info. The problem
is that if this is the case, we cannot exit to L1 with EXTERNAL_INTERRUPT
like we wished to, because the VMX spec guarantees that idt_vectoring_info
and exit_reason_external_interrupt can never happen together. This is not
just specified in the spec - a KVM L1 actually prints a kernel warning
"unexpected, valid vectoring info" if we violate this guarantee, and some
users noticed these warnings in L1's logs.

In order to better emulate a processor, which would never return the external
interrupt and the idt-vectoring-info together, we need to separate the two
injection steps: First, complete L1's injection into L2 (i.e., enter L2,
injecting to it the idt-vectoring-info); Second, after entry into L2 succeeds
and it exits back to L0, exit to L1 with the EXIT_REASON_EXTERNAL_INTERRUPT.
Most of this is already in the code - the only change we need is to remain
in L2 (and not exit to L1) in this case.

Note that the previous patch ensures (by using KVM_REQ_IMMEDIATE_EXIT) that
although we do enter L2 first, it will exit immediately after processing its
injection, allowing us to promptly inject to L1.

Note how we test vmcs12->idt_vectoring_info_field; This isn't really the
vmcs12 value (we haven't exited to L1 yet, so vmcs12 hasn't been updated),
but rather the place we save, at the end of vmx_vcpu_run, the vmcs02 value
of this field. This was explained in patch 25 ("Correct handling of idt
vectoring info") of the original nVMX patch series.

Thanks to Dave Allan and to Federico Simoncelli for reporting this bug,
to Abel Gordon for helping me figure out the solution, and to Avi Kivity
for helping to improve it.

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-12-27 11:16:45 +02:00
Nadav Har'El
d6185f20a0 KVM: nVMX: Add KVM_REQ_IMMEDIATE_EXIT
This patch adds a new vcpu->requests bit, KVM_REQ_IMMEDIATE_EXIT.
This bit requests that when next entering the guest, we should run it only
for as little as possible, and exit again.

We use this new option in nested VMX: When L1 launches L2, but L0 wishes L1
to continue running so it can inject an event to it, we unfortunately cannot
just pretend to have run L2 for a little while - We must really launch L2,
otherwise certain one-off vmcs12 parameters (namely, L1 injection into L2)
will be lost. So the existing code runs L2 in this case.
But L2 could potentially run for a long time until it exits, and the
injection into L1 will be delayed. The new KVM_REQ_IMMEDIATE_EXIT allows us
to request that L2 will be entered, as necessary, but will exit as soon as
possible after entry.

Our implementation of this request uses smp_send_reschedule() to send a
self-IPI, with interrupts disabled. The interrupts remain disabled until the
guest is entered, and then, after the entry is complete (often including
processing an injection and jumping to the relevant handler), the physical
interrupt is noticed and causes an exit.

On recent Intel processors, we could have achieved the same goal by using
MTF instead of a self-IPI. Another technique worth considering in the future
is to use VM_EXIT_ACK_INTR_ON_EXIT and a highest-priority vector IPI - to
slightly improve performance by avoiding the useless interrupt handler
which ends up being called when smp_send_reschedule() is used.

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-12-27 11:16:43 +02:00
Jan Kiszka
4d25a066b6 KVM: Don't automatically expose the TSC deadline timer in cpuid
Unlike all of the other cpuid bits, the TSC deadline timer bit is set
unconditionally, regardless of what userspace wants.

This is broken in several ways:
 - if userspace doesn't use KVM_CREATE_IRQCHIP, and doesn't emulate the TSC
   deadline timer feature, a guest that uses the feature will break
 - live migration to older host kernels that don't support the TSC deadline
   timer will cause the feature to be pulled from under the guest's feet;
   breaking it
 - guests that are broken wrt the feature will fail.

Fix by not enabling the feature automatically; instead report it to userspace.
Because the feature depends on KVM_CREATE_IRQCHIP, which we cannot guarantee
will be called, we expose it via a KVM_CAP_TSC_DEADLINE_TIMER and not
KVM_GET_SUPPORTED_CPUID.

Fixes the Illumos guest kernel, which uses the TSC deadline timer feature.

[avi: add the KVM_CAP + documentation]

Reported-by: Alexey Zaytsev <alexey.zaytsev@gmail.com>
Tested-by: Alexey Zaytsev <alexey.zaytsev@gmail.com>
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-12-26 13:27:44 +02:00
Jan Kiszka
0924ab2cfa KVM: x86: Prevent starting PIT timers in the absence of irqchip support
User space may create the PIT and forgets about setting up the irqchips.
In that case, firing PIT IRQs will crash the host:

BUG: unable to handle kernel NULL pointer dereference at 0000000000000128
IP: [<ffffffffa10f6280>] kvm_set_irq+0x30/0x170 [kvm]
...
Call Trace:
 [<ffffffffa11228c1>] pit_do_work+0x51/0xd0 [kvm]
 [<ffffffff81071431>] process_one_work+0x111/0x4d0
 [<ffffffff81071bb2>] worker_thread+0x152/0x340
 [<ffffffff81075c8e>] kthread+0x7e/0x90
 [<ffffffff815a4474>] kernel_thread_helper+0x4/0x10

Prevent this by checking the irqchip mode before starting a timer. We
can't deny creating the PIT if the irqchips aren't set up yet as
current user land expects this order to work.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2011-12-25 17:13:18 +02:00
Gleb Natapov
e7fc6f93b4 KVM: VMX: Check for automatic switch msr table overflow
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-11-17 16:28:09 +02:00
Gleb Natapov
d7cd97964b KVM: VMX: Add support for guest/host-only profiling
Support guest/host-only profiling by switch perf msrs on
a guest entry if needed.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-11-17 16:28:00 +02:00
Gleb Natapov
8bf00a5299 KVM: VMX: add support for switching of PERF_GLOBAL_CTRL
Some cpus have special support for switching PERF_GLOBAL_CTRL msr.
Add logic to detect if such support exists and works properly and extend
msr switching code to use it if available. Also extend number of generic
msr switching entries to 8.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-11-17 16:27:54 +02:00
Linus Torvalds
0cfdc72439 Merge branch 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu
* 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu: (33 commits)
  iommu/core: Remove global iommu_ops and register_iommu
  iommu/msm: Use bus_set_iommu instead of register_iommu
  iommu/omap: Use bus_set_iommu instead of register_iommu
  iommu/vt-d: Use bus_set_iommu instead of register_iommu
  iommu/amd: Use bus_set_iommu instead of register_iommu
  iommu/core: Use bus->iommu_ops in the iommu-api
  iommu/core: Convert iommu_found to iommu_present
  iommu/core: Add bus_type parameter to iommu_domain_alloc
  Driver core: Add iommu_ops to bus_type
  iommu/core: Define iommu_ops and register_iommu only with CONFIG_IOMMU_API
  iommu/amd: Fix wrong shift direction
  iommu/omap: always provide iommu debug code
  iommu/core: let drivers know if an iommu fault handler isn't installed
  iommu/core: export iommu_set_fault_handler()
  iommu/omap: Fix build error with !IOMMU_SUPPORT
  iommu/omap: Migrate to the generic fault report mechanism
  iommu/core: Add fault reporting mechanism
  iommu/core: Use PAGE_SIZE instead of hard-coded value
  iommu/core: use the existing IS_ALIGNED macro
  iommu/msm: ->unmap() should return order of unmapped page
  ...

Fixup trivial conflicts in drivers/iommu/Makefile: "move omap iommu to
dedicated iommu folder" vs "Rename the DMAR and INTR_REMAP config
options" just happened to touch lines next to each other.
2011-10-30 15:46:19 -07:00
Jan Kiszka
f1c1da2bde KVM: SVM: Keep intercepting task switching with NPT enabled
AMD processors apparently have a bug in the hardware task switching
support when NPT is enabled. If the task switch triggers a NPF, we can
get wrong EXITINTINFO along with that fault. On resume, spurious
exceptions may then be injected into the guest.

We were able to reproduce this bug when our guest triggered #SS and the
handler were supposed to run over a separate task with not yet touched
stack pages.

Work around the issue by continuing to emulate task switches even in
NPT mode.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2011-10-30 12:24:10 +02:00
Joerg Roedel
1abb4ba596 Merge branches 'amd/fixes', 'debug/dma-api', 'arm/omap', 'arm/msm', 'core', 'iommu/fault-reporting' and 'api/iommu-ops-per-bus' into next
Conflicts:
	drivers/iommu/amd_iommu.c
	drivers/iommu/iommu.c
2011-10-21 14:38:55 +02:00
Joerg Roedel
a1b60c1cd9 iommu/core: Convert iommu_found to iommu_present
With per-bus iommu_ops the iommu_found function needs to
work on a bus_type too. This patch adds a bus_type parameter
to that function and converts all call-places.
The function is also renamed to iommu_present because the
function now checks if an iommu is present for a given bus
and does not check for a global iommu anymore.

Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
2011-10-21 14:37:20 +02:00
Liu, Jinsong
a3e06bbe84 KVM: emulate lapic tsc deadline timer for guest
This patch emulate lapic tsc deadline timer for guest:
Enumerate tsc deadline timer capability by CPUID;
Enable tsc deadline timer mode by lapic MMIO;
Start tsc deadline timer by WRMSR;

[jan: use do_div()]
[avi: fix for !irqchip_in_kernel()]
[marcelo: another fix for !irqchip_in_kernel()]

Signed-off-by: Liu, Jinsong <jinsong.liu@intel.com>
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-10-05 15:34:56 +02:00
Avi Kivity
7460fb4a34 KVM: Fix simultaneous NMIs
If simultaneous NMIs happen, we're supposed to queue the second
and next (collapsing them), but currently we sometimes collapse
the second into the first.

Fix by using a counter for pending NMIs instead of a bool; since
the counter limit depends on whether the processor is currently
in an NMI handler, which can only be checked in vcpu context
(via the NMI mask), we add a new KVM_REQ_NMI to request recalculation
of the counter.

Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2011-09-25 19:52:59 +03:00
Avi Kivity
1cd196ea42 KVM: x86 emulator: convert push %sreg/pop %sreg to direct decode
Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2011-09-25 19:52:58 +03:00
Avi Kivity
d4b4325fdb KVM: x86 emulator: switch lds/les/lss/lfs/lgs to direct decode
Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2011-09-25 19:52:57 +03:00
Avi Kivity
c191a7a0f4 KVM: x86 emulator: streamline decode of segment registers
The opcodes

  push %seg
  pop %seg
  l%seg, %mem, %reg  (e.g. lds/les/lss/lfs/lgs)

all have an segment register encoded in the instruction.  To allow reuse,
decode the segment number into src2 during the decode stage instead of the
execution stage.

Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2011-09-25 19:52:56 +03:00
Avi Kivity
41ddf9784c KVM: x86 emulator: simplify OpMem64 decode
Use the same technique as the other OpMem variants, and goto mem_common.

Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2011-09-25 19:52:55 +03:00
Avi Kivity
0fe5912884 KVM: x86 emulator: switch src decode to decode_operand()
Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2011-09-25 19:52:54 +03:00
Avi Kivity
5217973ef8 KVM: x86 emulator: qualify OpReg inhibit_byte_regs hack
OpReg decoding has a hack that inhibits byte registers for movsx and movzx
instructions.  It should be replaced by something better, but meanwhile,
qualify that the hack is only active for the destination operand.

Note these instructions only use OpReg for the destination, but better to
be explicit about it.

Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2011-09-25 19:52:53 +03:00
Avi Kivity
608aabe316 KVM: x86 emulator: switch OpImmUByte decode to decode_imm()
Similar to SrcImmUByte.

Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2011-09-25 19:52:52 +03:00
Avi Kivity
20c29ff205 KVM: x86 emulator: free up some flag bits near src, dst
Op fields are going to grow by a bit, we need two free bits.

Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2011-09-25 19:52:51 +03:00
Avi Kivity
4dd6a57df7 KVM: x86 emulator: switch src2 to generic decode_operand()
Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2011-09-25 19:52:50 +03:00
Avi Kivity
b1ea50b2b6 KVM: x86 emulator: expand decode flags to 64 bits
Unifiying the operands means not taking advantage of the fact that some
operand types can only go into certain operands (for example, DI can only
be used by the destination), so we need more bits to hold the operand type.

Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2011-09-25 19:52:49 +03:00
Avi Kivity
a99455499a KVM: x86 emulator: split dst decode to a generic decode_operand()
Instead of decoding each operand using its own code, use a generic
function.  Start with the destination operand.

Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2011-09-25 19:52:48 +03:00
Avi Kivity
f09ed83e21 KVM: x86 emulator: move memop, memopp into emulation context
Simplifies further generalization of decode.

Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2011-09-25 19:52:47 +03:00
Avi Kivity
3329ece161 KVM: x86 emulator: convert group 3 instructions to direct decode
Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2011-09-25 19:52:46 +03:00
Jan Kiszka
9bc5791d4a KVM: x86: Add module parameter for lapic periodic timer limit
Certain guests, specifically RTOSes, request faster periodic timers than
what we allow by default. Add a module parameter to adjust the limit for
non-standard setups. Also add a rate-limited warning in case the guest
requested more.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2011-09-25 19:52:44 +03:00
Jan Kiszka
bd80158aff KVM: Clean up and extend rate-limited output
The use of printk_ratelimit is discouraged, replace it with
pr*_ratelimited or __ratelimit. While at it, convert remaining
guest-triggerable printks to rate-limited variants.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2011-09-25 19:52:43 +03:00
Jan Kiszka
7712de872c KVM: x86: Avoid guest-triggerable printks in APIC model
Convert remaining printks that the guest can trigger to apic_printk.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2011-09-25 19:52:42 +03:00
Jan Kiszka
1e2b1dd797 KVM: x86: Move kvm_trace_exit into atomic vmexit section
This avoids that events causing the vmexit are recorded before the
actual exit reason.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2011-09-25 19:52:41 +03:00
Avi Kivity
caa8a168e3 KVM: x86 emulator: disable writeback for TEST
The TEST instruction doesn't write its destination operand.  This
could cause problems if an MMIO register was accessed using the TEST
instruction.  Recently Windows XP was observed to use TEST against
the APIC ICR; this can cause spurious IPIs.

Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2011-09-25 19:52:40 +03:00
Avi Kivity
e8f2b1d621 KVM: x86 emulator: simplify emulate_1op_rax_rdx()
emulate_1op_rax_rdx() is always called with the same parameters.  Simplify
by passing just the emulation context.

Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2011-09-25 19:52:37 +03:00
Avi Kivity
9fef72ce10 KVM: x86 emulator: merge the two emulate_1op_rax_rdx implementations
We have two emulate-with-extended-accumulator implementations: once
which expect traps (_ex) and one which doesn't (plain).  Drop the
plain implementation and always use the one which expects traps;
it will simply return 0 in the _ex argument and we can happily ignore
it.

Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2011-09-25 19:52:36 +03:00
Avi Kivity
d1eef45d59 KVM: x86 emulator: simplify emulate_1op()
emulate_1op() is always called with the same parameters.  Simplify
by passing just the emulation context.

Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2011-09-25 19:52:35 +03:00
Avi Kivity
29053a60d7 KVM: x86 emulator: simplify emulate_2op_cl()
emulate_2op_cl() is always called with the same parameters.  Simplify
by passing just the emulation context.

Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2011-09-25 19:52:34 +03:00
Avi Kivity
761441b9f4 KVM: x86 emulator: simplify emulate_2op_cl()
emulate_2op_cl() is always called with the same parameters.  Simplify
by passing just the emulation context.

Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2011-09-25 19:52:33 +03:00
Avi Kivity
a31b9ceadb KVM: x86 emulator: simplify emulate_2op_SrcV()
emulate_2op_SrcV(), and its siblings, emulate_2op_SrcV_nobyte()
and emulate_2op_SrcB(), all use the same calling conventions
and all get passed exactly the same parameters.  Simplify them
by passing just the emulation context.

Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2011-09-25 19:52:32 +03:00
Kevin Tian
58fbbf26eb KVM: APIC: avoid instruction emulation for EOI writes
Instruction emulation for EOI writes can be skipped, since sane
guest simply uses MOV instead of string operations. This is a nice
improvement when guest doesn't support x2apic or hyper-V EOI
support.

a single VM bandwidth is observed with ~8% bandwidth improvement
(7.4Gbps->8Gbps), by saving ~5% cycles from EOI emulation.

Signed-off-by: Kevin Tian <kevin.tian@intel.com>
<Based on earlier work from>:
Signed-off-by: Eddie Dong <eddie.dong@intel.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-09-25 19:52:17 +03:00
Nadav Har'El
45133ecaae KVM: SVM: Fix TSC MSR read in nested SVM
When the TSC MSR is read by an L2 guest (when L1 allowed this MSR to be
read without exit), we need to return L2's notion of the TSC, not L1's.

The current code incorrectly returned L1 TSC, because svm_get_msr() was also
used in x86.c where this was assumed, but now that these places call the new
svm_read_l1_tsc(), the MSR read can be fixed.

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
Tested-by: Joerg Roedel <joerg.roedel@amd.com>
Acked-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-09-25 19:18:03 +03:00
Nadav Har'El
27fc51b21c KVM: nVMX: Fix nested VMX TSC emulation
This patch fixes two corner cases in nested (L2) handling of TSC-related
issues:

1. Somewhat suprisingly, according to the Intel spec, if L1 allows WRMSR to
the TSC MSR without an exit, then this should set L1's TSC value itself - not
offset by vmcs12.TSC_OFFSET (like was wrongly done in the previous code).

2. Allow L1 to disable the TSC_OFFSETING control, and then correctly ignore
the vmcs12.TSC_OFFSET.

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-09-25 19:18:02 +03:00
Nadav Har'El
d5c1785d2f KVM: L1 TSC handling
KVM assumed in several places that reading the TSC MSR returns the value for
L1. This is incorrect, because when L2 is running, the correct TSC read exit
emulation is to return L2's value.

We therefore add a new x86_ops function, read_l1_tsc, to use in places that
specifically need to read the L1 TSC, NOT the TSC of the current level of
guest.

Note that one change, of one line in kvm_arch_vcpu_load, is made redundant
by a different patch sent by Zachary Amsden (and not yet applied):
kvm_arch_vcpu_load() should not read the guest TSC, and if it didn't, of
course we didn't have to change the call of kvm_get_msr() to read_l1_tsc().

[avi: moved callback to kvm_x86_ops tsc block]

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
Acked-by: Zachary Amsdem <zamsden@gmail.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-09-25 19:18:02 +03:00
Yang, Wei Y
cd46868c7f KVM: MMU: Fix SMEP failure during fetch
This patch fix kvm-unit-tests hanging and incorrect PT_ACCESSED_MASK
bit set in the case of SMEP fault.  The code updated 'eperm' after
the variable was checked.

Signed-off-by: Yang, Wei <wei.y.yang@intel.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-09-25 19:18:02 +03:00
Avi Kivity
e4e517b4be KVM: MMU: Do not unconditionally read PDPTE from guest memory
Architecturally, PDPTEs are cached in the PDPTRs when CR3 is reloaded.
On SVM, it is not possible to implement this, but on VMX this is possible
and was indeed implemented until nested SVM changed this to unconditionally
read PDPTEs dynamically.  This has noticable impact when running PAE guests.

Fix by changing the MMU to read PDPTRs from the cache, falling back to
reading from memory for the nested MMU.

Signed-off-by: Avi Kivity <avi@redhat.com>
Tested-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2011-09-25 19:18:01 +03:00
Julia Lawall
cf3ace79c0 KVM: VMX: trivial: use BUG_ON
Use BUG_ON(x) rather than if(x) BUG();

The semantic patch that fixes this problem is as follows:
(http://coccinelle.lip6.fr/)

// <smpl>
@@ identifier x; @@
-if (x) BUG();
+BUG_ON(x);

@@ identifier x; @@
-if (!x) BUG();
+BUG_ON(!x);
// </smpl>

Signed-off-by: Julia Lawall <julia@diku.dk>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2011-09-25 19:18:01 +03:00
Marcelo Tosatti
742bc67042 KVM: x86: report valid microcode update ID
Windows Server 2008 SP2 checked build with smp > 1 BSOD's during
boot due to lack of microcode update:

*** Assertion failed: The system BIOS on this machine does not properly
support the processor.  The system BIOS did not load any microcode update.
A BIOS containing the latest microcode update is needed for system reliability.
(CurrentUpdateRevision != 0)
***   Source File: d:\longhorn\base\hals\update\intelupd\update.c, line 440

Report a non-zero microcode update signature to make it happy.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-09-25 19:18:01 +03:00
Takuya Yoshikawa
1d2887e2d8 KVM: x86 emulator: Make x86_decode_insn() return proper macros
Return EMULATION_OK/FAILED consistently.  Also treat instruction fetch
errors, not restricted to X86EMUL_UNHANDLEABLE, as EMULATION_FAILED;
although this cannot happen in practice, the current logic will continue
the emulation even if the decoder fails to fetch the instruction.

Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-09-25 19:18:01 +03:00
Takuya Yoshikawa
7d88bb4803 KVM: x86 emulator: Let compiler know insn_fetch() rarely fails
Fetching the instruction which was to be executed by the guest cannot
fail normally.  So compiler should always predict that it will succeed.

Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-09-25 19:18:00 +03:00
Takuya Yoshikawa
e85a10852c KVM: x86 emulator: Drop _size argument from insn_fetch()
_type is enough to know the size.

Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-09-25 19:17:59 +03:00
Takuya Yoshikawa
807941b121 KVM: x86 emulator: Use ctxt->_eip directly in do_insn_fetch_byte()
Instead of passing ctxt->_eip from insn_fetch() call sites, get it from
ctxt in do_insn_fetch_byte().  This is done by replacing the argument
_eip of insn_fetch() with _ctxt, which should be better than letting the
macro use ctxt silently in its body.

Though this changes the place where ctxt->_eip is incremented from
insn_fetch() to do_insn_fetch_byte(), this does not have any real
effect.

Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-09-25 19:17:59 +03:00
Sasha Levin
743eeb0b01 KVM: Intelligent device lookup on I/O bus
Currently the method of dealing with an IO operation on a bus (PIO/MMIO)
is to call the read or write callback for each device registered
on the bus until we find a device which handles it.

Since the number of devices on a bus can be significant due to ioeventfds
and coalesced MMIO zones, this leads to a lot of overhead on each IO
operation.

Instead of registering devices, we now register ranges which points to
a device. Lookup is done using an efficient bsearch instead of a linear
search.

Performance test was conducted by comparing exit count per second with
200 ioeventfds created on one byte and the guest is trying to access a
different byte continuously (triggering usermode exits).
Before the patch the guest has achieved 259k exits per second, after the
patch the guest does 274k exits per second.

Cc: Avi Kivity <avi@redhat.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Sasha Levin <levinsasha928@gmail.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-09-25 19:17:59 +03:00
Stefan Hajnoczi
0d460ffc09 KVM: Use __print_symbolic() for vmexit tracepoints
The vmexit tracepoints format the exit_reason to make it human-readable.
Since the exit_reason depends on the instruction set (vmx or svm),
formatting is handled with ftrace_print_symbols_seq() by referring to
the appropriate exit reason table.

However, the ftrace_print_symbols_seq() function is not meant to be used
directly in tracepoints since it does not export the formatting table
which userspace tools like trace-cmd and perf use to format traces.

In practice perf dies when formatting vmexit-related events and
trace-cmd falls back to printing the numeric value (with extra
formatting code in the kvm plugin to paper over this limitation).  Other
userspace consumers of vmexit-related tracepoints would be in similar
trouble.

To avoid significant changes to the kvm_exit tracepoint, this patch
moves the vmx and svm exit reason tables into arch/x86/kvm/trace.h and
selects the right table with __print_symbolic() depending on the
instruction set.  Note that __print_symbolic() is designed for exporting
the formatting table to userspace and allows trace-cmd and perf to work.

Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-09-25 19:17:59 +03:00
Stefan Hajnoczi
e097e5ffd6 KVM: Record instruction set in all vmexit tracepoints
The kvm_exit tracepoint recently added the isa argument to aid decoding
exit_reason.  The semantics of exit_reason depend on the instruction set
(vmx or svm) and the isa argument allows traces to be analyzed on other
machines.

Add the isa argument to kvm_nested_vmexit and kvm_nested_vmexit_inject
so these tracepoints can also be self-describing.

Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-09-25 19:17:58 +03:00
Mike Waychison
d1613ad5d0 KVM: Really fix HV_X64_MSR_APIC_ASSIST_PAGE
Commit 0945d4b228 tried to fix the get_msr path for the
HV_X64_MSR_APIC_ASSIST_PAGE msr, but was poorly tested.  We should be
returning 0 if the read succeeded, and passing the value back to the
caller via the pdata out argument, not returning the value directly.

Signed-off-by: Mike Waychison <mikew@google.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-09-25 19:17:58 +03:00
Mike Waychison
14fa67ee95 KVM: x86: get_msr support for HV_X64_MSR_APIC_ASSIST_PAGE
"get" support for the HV_X64_MSR_APIC_ASSIST_PAGE msr was missing, even
though it is explicitly enumerated as something the vmm should save in
msrs_to_save and reported to userland via the KVM_GET_MSR_INDEX_LIST
ioctl.

Add "get" support for HV_X64_MSR_APIC_ASSIST_PAGE.  We simply return the
guest visible value of this register, which seems to be correct as a set
on the register is validated for us already.

Signed-off-by: Mike Waychison <mikew@google.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2011-09-25 19:17:58 +03:00
Sasha Levin
8c3ba334f8 KVM: x86: Raise the hard VCPU count limit
The patch raises the hard limit of VCPU count to 254.

This will allow developers to easily work on scalability
and will allow users to test high VCPU setups easily without
patching the kernel.

To prevent possible issues with current setups, KVM_CAP_NR_VCPUS
now returns the recommended VCPU limit (which is still 64) - this
should be a safe value for everybody, while a new KVM_CAP_MAX_VCPUS
returns the hard limit which is now 254.

Cc: Avi Kivity <avi@redhat.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Pekka Enberg <penberg@kernel.org>
Suggested-by: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Sasha Levin <levinsasha928@gmail.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2011-09-25 19:17:57 +03:00
Xiao Guangrong
22388a3c8c KVM: x86: cleanup the code of read/write emulation
Using the read/write operation to remove the same code

Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-09-25 19:17:17 +03:00
Xiao Guangrong
77d197b2ca KVM: x86: abstract the operation for read/write emulation
The operations of read emulation and write emulation are very similar, so we
can abstract the operation of them, in larter patch, it is used to cleanup the
same code

Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-09-25 19:17:17 +03:00
Xiao Guangrong
ca7d58f375 KVM: x86: fix broken read emulation spans a page boundary
If the range spans a page boundary, the mmio access can be broke, fix it as
write emulation.

And we already get the guest physical address, so use it to read guest data
directly to avoid walking guest page table again

Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-09-25 19:17:17 +03:00
Avi Kivity
9be3be1f15 KVM: x86 emulator: fix Src2CL decode
Src2CL decode (used for double width shifts) erronously decodes only bit 3
of %rcx, instead of bits 7:0.

Fix by decoding %cl in its entirety.

Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2011-09-25 19:14:58 +03:00
Zhao Jin
41bc3186b3 KVM: MMU: fix incorrect return of spte
__update_clear_spte_slow should return original spte while the
current code returns low half of original spte combined with high
half of new spte.

Signed-off-by: Zhao Jin <cronozhj@gmail.com>
Reviewed-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2011-09-25 19:13:25 +03:00
Linus Torvalds
ab7e2dbf9b Merge branch 'kvm-updates/3.1' of git://git.kernel.org/pub/scm/virt/kvm/kvm
* 'kvm-updates/3.1' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
  KVM: uses TASKSTATS, depends on NET
  KVM: fix TASK_DELAY_ACCT kconfig warning
2011-08-16 11:14:44 -07:00
Randy Dunlap
df3d8ae1f8 KVM: uses TASKSTATS, depends on NET
CONFIG_TASKSTATS just had a change to use netlink, including
a change to "depends on NET".  Since "select" does not follow
dependencies, KVM also needs to depend on NET to prevent build
errors when CONFIG_NET is not enabled.

Sample of the reported "undefined reference" build errors:

taskstats.c:(.text+0x8f686): undefined reference to `nla_put'
taskstats.c:(.text+0x8f721): undefined reference to `nla_reserve'
taskstats.c:(.text+0x8f8fb): undefined reference to `init_net'
taskstats.c:(.text+0x8f905): undefined reference to `netlink_unicast'
taskstats.c:(.text+0x8f934): undefined reference to `kfree_skb'
taskstats.c:(.text+0x8f9e9): undefined reference to `skb_clone'
taskstats.c:(.text+0x90060): undefined reference to `__alloc_skb'
taskstats.c:(.text+0x901e9): undefined reference to `skb_put'
taskstats.c:(.init.text+0x4665): undefined reference to `genl_register_family'
taskstats.c:(.init.text+0x4699): undefined reference to `genl_register_ops'
taskstats.c:(.init.text+0x4710): undefined reference to `genl_unregister_ops'
taskstats.c:(.init.text+0x471c): undefined reference to `genl_unregister_family'

Signed-off-by: Randy Dunlap <rdunlap@xenotime.net>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2011-08-16 19:00:41 +03:00
Randy Dunlap
fd079facb3 KVM: fix TASK_DELAY_ACCT kconfig warning
Fix kconfig dependency warning:

warning: (KVM) selects TASK_DELAY_ACCT which has unmet direct dependencies (TASKSTATS)

Signed-off-by: Randy Dunlap <rdunlap@xenotime.net>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-07-27 14:22:06 +03:00
Arun Sharma
60063497a9 atomic: use <linux/atomic.h>
This allows us to move duplicated code in <asm/atomic.h>
(atomic_inc_not_zero() for now) to <linux/atomic.h>

Signed-off-by: Arun Sharma <asharma@fb.com>
Reviewed-by: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: David Miller <davem@davemloft.net>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Acked-by: Mike Frysinger <vapier@gentoo.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-07-26 16:49:47 -07:00
Linus Torvalds
d3ec4844d4 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (43 commits)
  fs: Merge split strings
  treewide: fix potentially dangerous trailing ';' in #defined values/expressions
  uwb: Fix misspelling of neighbourhood in comment
  net, netfilter: Remove redundant goto in ebt_ulog_packet
  trivial: don't touch files that are removed in the staging tree
  lib/vsprintf: replace link to Draft by final RFC number
  doc: Kconfig: `to be' -> `be'
  doc: Kconfig: Typo: square -> squared
  doc: Konfig: Documentation/power/{pm => apm-acpi}.txt
  drivers/net: static should be at beginning of declaration
  drivers/media: static should be at beginning of declaration
  drivers/i2c: static should be at beginning of declaration
  XTENSA: static should be at beginning of declaration
  SH: static should be at beginning of declaration
  MIPS: static should be at beginning of declaration
  ARM: static should be at beginning of declaration
  rcu: treewide: Do not use rcu_read_lock_held when calling rcu_dereference_check
  Update my e-mail address
  PCIe ASPM: forcedly -> forcibly
  gma500: push through device driver tree
  ...

Fix up trivial conflicts:
 - arch/arm/mach-ep93xx/dma-m2p.c (deleted)
 - drivers/gpio/gpio-ep93xx.c (renamed and context nearby)
 - drivers/net/r8169.c (just context changes)
2011-07-25 13:56:39 -07:00
Linus Torvalds
5fabc487c9 Merge branch 'kvm-updates/3.1' of git://git.kernel.org/pub/scm/virt/kvm/kvm
* 'kvm-updates/3.1' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (143 commits)
  KVM: IOMMU: Disable device assignment without interrupt remapping
  KVM: MMU: trace mmio page fault
  KVM: MMU: mmio page fault support
  KVM: MMU: reorganize struct kvm_shadow_walk_iterator
  KVM: MMU: lockless walking shadow page table
  KVM: MMU: do not need atomicly to set/clear spte
  KVM: MMU: introduce the rules to modify shadow page table
  KVM: MMU: abstract some functions to handle fault pfn
  KVM: MMU: filter out the mmio pfn from the fault pfn
  KVM: MMU: remove bypass_guest_pf
  KVM: MMU: split kvm_mmu_free_page
  KVM: MMU: count used shadow pages on prepareing path
  KVM: MMU: rename 'pt_write' to 'emulate'
  KVM: MMU: cleanup for FNAME(fetch)
  KVM: MMU: optimize to handle dirty bit
  KVM: MMU: cache mmio info on page fault path
  KVM: x86: introduce vcpu_mmio_gva_to_gpa to cleanup the code
  KVM: MMU: do not update slot bitmap if spte is nonpresent
  KVM: MMU: fix walking shadow page table
  KVM guest: KVM Steal time registration
  ...
2011-07-24 09:07:03 -07:00
Xiao Guangrong
4f0226482d KVM: MMU: trace mmio page fault
Add tracepoints to trace mmio page fault

Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-07-24 11:50:41 +03:00