Commit graph

15750 commits

Author SHA1 Message Date
Matt Fleming
9ca8f72a92 x86, efi: Handover Protocol
As things currently stand, traditional EFI boot loaders and the EFI
boot stub are carrying essentially the same initialisation code
required to setup an EFI machine for booting a kernel. There's really
no need to have this code in two places and the hope is that, with
this new protocol, initialisation and booting of the kernel can be
left solely to the kernel's EFI boot stub. The responsibilities of the
boot loader then become,

   o Loading the kernel image from boot media

File system code still needs to be carried by boot loaders for the
scenario where the kernel and initrd files reside on a file system
that the EFI firmware doesn't natively understand, such as ext4, etc.

   o Providing a user interface

Boot loaders still need to display any menus/interfaces, for example
to allow the user to select from a list of kernels.

Bump the boot protocol number because we added the 'handover_offset'
field to indicate the location of the handover protocol entry point.

Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Peter Jones <pjones@redhat.com>
Cc: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Matt Fleming <matt.fleming@intel.com>
Acked-and-Tested-by: Matthew Garrett <mjg@redhat.com>
Link: http://lkml.kernel.org/r/1342689828-16815-1-git-send-email-matt@console-pimps.org
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
2012-07-20 16:18:58 -07:00
Alex Shi
7efa1c8796 x86/tlb: Fix build warning and crash when building for !SMP
The incompatible parameter of flush_tlb_mm_range cause build warning.
Fix it by correct parameter.

Ingo Molnar found that this could also cause a user space crash.

Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Reported-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Alex Shi <alex.shi@intel.com>
Link: http://lkml.kernel.org/r/1342747103-19765-1-git-send-email-alex.shi@intel.com
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
2012-07-20 15:01:48 -07:00
H. Peter Anvin
30d5c4546a x86, cpufeature: Add the RDSEED and ADX features
Add the RDSEED and ADX features documented in section 9.1 of the Intel
Architecture Instruction Set Extensions Programming Reference,
document 319433, version 013b, available from
http://software.intel.com/en-us/avx/

The PREFETCHW bit is already supported in Linux under the name
3DNOWPREFETCH.

Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Link: http://lkml.kernel.org/n/tip-lgr6482ufk1bvxzvc2hr8qbp@git.kernel.org
2012-07-20 13:36:41 -07:00
Michael S. Tsirkin
1a577b7247 KVM: fix race with level interrupts
When more than 1 source id is in use for the same GSI, we have the
following race related to handling irq_states race:

CPU 0 clears bit 0. CPU 0 read irq_state as 0. CPU 1 sets level to 1.
CPU 1 calls kvm_ioapic_set_irq(1). CPU 0 calls kvm_ioapic_set_irq(0).
Now ioapic thinks the level is 0 but irq_state is not 0.

Fix by performing all irq_states bitmap handling under pic/ioapic lock.
This also removes the need for atomics with irq_states handling.

Reported-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2012-07-20 16:12:00 -03:00
Geert Uytterhoeven
2e76c2838a module.c: spelling s/postition/position/g
Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2012-07-20 10:38:35 +02:00
zhenzhong.duan
c3d93f8801 xen: populate correct number of pages when across mem boundary (v2)
When populate pages across a mem boundary at bootup, the page count
populated isn't correct. This is due to mem populated to non-mem
region and ignored.

Pfn range is also wrongly aligned when mem boundary isn't page aligned.

For a dom0 booted with dom_mem=3368952K(0xcd9ff000-4k) dmesg diff is:
 [    0.000000] Freeing 9e-100 pfn range: 98 pages freed
 [    0.000000] 1-1 mapping on 9e->100
 [    0.000000] 1-1 mapping on cd9ff->100000
 [    0.000000] Released 98 pages of unused memory
 [    0.000000] Set 206435 page(s) to 1-1 mapping
-[    0.000000] Populating cd9fe-cda00 pfn range: 1 pages added
+[    0.000000] Populating cd9fe-cd9ff pfn range: 1 pages added
+[    0.000000] Populating 100000-100061 pfn range: 97 pages added
 [    0.000000] BIOS-provided physical RAM map:
 [    0.000000] Xen: 0000000000000000 - 000000000009e000 (usable)
 [    0.000000] Xen: 00000000000a0000 - 0000000000100000 (reserved)
 [    0.000000] Xen: 0000000000100000 - 00000000cd9ff000 (usable)
 [    0.000000] Xen: 00000000cd9ffc00 - 00000000cda53c00 (ACPI NVS)
...
 [    0.000000] Xen: 0000000100000000 - 0000000100061000 (usable)
 [    0.000000] Xen: 0000000100061000 - 000000012c000000 (unusable)
...
 [    0.000000] MEMBLOCK configuration:
...
-[    0.000000]  reserved[0x4]       [0x000000cd9ff000-0x000000cd9ffbff], 0xc00 bytes
-[    0.000000]  reserved[0x5]       [0x00000100000000-0x00000100060fff], 0x61000 bytes

Related xen memory layout:
(XEN) Xen-e820 RAM map:
(XEN)  0000000000000000 - 000000000009ec00 (usable)
(XEN)  00000000000f0000 - 0000000000100000 (reserved)
(XEN)  0000000000100000 - 00000000cd9ffc00 (usable)

Signed-off-by: Zhenzhong Duan <zhenzhong.duan@oracle.com>
[v2: If xen_do_chunk fail(populate), abort this chunk and any others]
Suggested by David, thanks.

Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2012-07-19 15:52:06 -04:00
Olaf Hering
00e37bdb01 xen PVonHVM: move shared_info to MMIO before kexec
Currently kexec in a PVonHVM guest fails with a triple fault because the
new kernel overwrites the shared info page. The exact failure depends on
the size of the kernel image. This patch moves the pfn from RAM into
MMIO space before the kexec boot.

The pfn containing the shared_info is located somewhere in RAM. This
will cause trouble if the current kernel is doing a kexec boot into a
new kernel. The new kernel (and its startup code) can not know where the
pfn is, so it can not reserve the page. The hypervisor will continue to
update the pfn, and as a result memory corruption occours in the new
kernel.

One way to work around this issue is to allocate a page in the
xen-platform pci device's BAR memory range. But pci init is done very
late and the shared_info page is already in use very early to read the
pvclock. So moving the pfn from RAM to MMIO is racy because some code
paths on other vcpus could access the pfn during the small   window when
the old pfn is moved to the new pfn. There is even a  small window were
the old pfn is not backed by a mfn, and during that time all reads
return -1.

Because it is not known upfront where the MMIO region is located it can
not be used right from the start in xen_hvm_init_shared_info.

To minimise trouble the move of the pfn is done shortly before kexec.
This does not eliminate the race because all vcpus are still online when
the syscore_ops will be called. But hopefully there is no work pending
at this point in time. Also the syscore_op is run last which reduces the
risk further.

Signed-off-by: Olaf Hering <olaf@aepfle.de>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2012-07-19 15:52:05 -04:00
Olaf Hering
4ff2d06255 xen: simplify init_hvm_pv_info
init_hvm_pv_info is called only in PVonHVM context, move it into ifdef.
init_hvm_pv_info does not fail, make it a void function.
remove arguments from init_hvm_pv_info because they are not used by the
caller.

Signed-off-by: Olaf Hering <olaf@aepfle.de>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2012-07-19 15:52:04 -04:00
Olaf Hering
4648da7cb4 xen: remove cast from HYPERVISOR_shared_info assignment
Both have type struct shared_info so no cast is needed.

Signed-off-by: Olaf Hering <olaf@aepfle.de>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2012-07-19 15:52:03 -04:00
David Vrabel
1c32cdc633 xen/x86: avoid updating TLS descriptors if they haven't changed
When switching tasks in a Xen PV guest, avoid updating the TLS
descriptors if they haven't changed.  This improves the speed of
context switches by almost 10% as much of the time the descriptors are
the same or only one is different.

The descriptors written into the GDT by Xen are modified from the
values passed in the update_descriptor hypercall so we keep shadow
copies of the three TLS descriptors to compare against.

lmbench3 test     Before  After  Improvement
--------------------------------------------
lat_ctx -s 32 24   7.19    6.52  9%
lat_pipe          12.56   11.66  7%

Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2012-07-19 15:51:57 -04:00
David Vrabel
59290362da xen/x86: add desc_equal() to compare GDT descriptors
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
[v1: Moving it to the Xen file]
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2012-07-19 15:51:45 -04:00
David Vrabel
66a27dde9a xen/mm: zero PTEs for non-present MFNs in the initial page table
When constructing the initial page tables, if the MFN for a usable PFN
is missing in the p2m then that frame is initially ballooned out.  In
this case, zero the PTE (as in decrease_reservation() in
drivers/xen/balloon.c).

This is obviously safe instead of having an valid PTE with an MFN of
INVALID_P2M_ENTRY (~0).

Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2012-07-19 15:51:44 -04:00
David Vrabel
d095d43e78 xen/mm: do direct hypercall in xen_set_pte() if batching is unavailable
In xen_set_pte() if batching is unavailable (because the caller is in
an interrupt context such as handling a page fault) it would fall back
to using native_set_pte() and trapping and emulating the PTE write.

On 32-bit guests this requires two traps for each PTE write (one for
each dword of the PTE).  Instead, do one mmu_update hypercall
directly.

During construction of the initial page tables, continue to use
native_set_pte() because most of the PTEs being set are in writable
and unpinned pages (see phys_pmd_init() in arch/x86/mm/init_64.c) and
using a hypercall for this is very expensive.

This significantly improves page fault performance in 32-bit PV
guests.

lmbench3 test  Before    After     Improvement
----------------------------------------------
lat_pagefault  3.18 us   2.32 us   27%
lat_proc fork  356 us    313.3 us  11%

Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2012-07-19 15:51:43 -04:00
Liu, Jinsong
05e36006ad xen/mce: Register native mce handler as vMCE bounce back point
When Xen hypervisor inject vMCE to guest, use native mce handler
to handle it

Signed-off-by: Ke, Liping <liping.ke@intel.com>
Signed-off-by: Jiang, Yunhong <yunhong.jiang@intel.com>
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Signed-off-by: Liu, Jinsong <jinsong.liu@intel.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2012-07-19 15:51:38 -04:00
Liu, Jinsong
a8fccdb061 x86, MCE, AMD: Adjust initcall sequence for xen
there are 3 funcs which need to be _initcalled in a logic sequence:
1. xen_late_init_mcelog
2. mcheck_init_device
3. threshold_init_device

xen_late_init_mcelog must register xen_mce_chrdev_device before
native mce_chrdev_device registration if running under xen platform;

mcheck_init_device should be inited before threshold_init_device to
initialize mce_device, otherwise a a NULL ptr dereference will cause panic.

so we use following _initcalls
1. device_initcall(xen_late_init_mcelog);
2. device_initcall_sync(mcheck_init_device);
3. late_initcall(threshold_init_device);

when running under xen, the initcall order is 1,2,3;
on baremetal, we skip 1 and we do only 2 and 3.

Acked-and-tested-by: Borislav Petkov <bp@amd64.org>
Suggested-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Liu, Jinsong <jinsong.liu@intel.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2012-07-19 15:51:37 -04:00
Liu, Jinsong
cef12ee52b xen/mce: Add mcelog support for Xen platform
When MCA error occurs, it would be handled by Xen hypervisor first,
and then the error information would be sent to initial domain for logging.

This patch gets error information from Xen hypervisor and convert
Xen format error into Linux format mcelog. This logic is basically
self-contained, not touching other kernel components.

By using tools like mcelog tool users could read specific error information,
like what they did under native Linux.

To test follow directions outlined in Documentation/acpi/apei/einj.txt

Acked-and-tested-by: Borislav Petkov <borislav.petkov@amd.com>
Signed-off-by: Ke, Liping <liping.ke@intel.com>
Signed-off-by: Jiang, Yunhong <yunhong.jiang@intel.com>
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Signed-off-by: Liu, Jinsong <jinsong.liu@intel.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2012-07-19 15:51:36 -04:00
David S. Miller
abaa72d7fd Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts:
	drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
2012-07-19 11:17:30 -07:00
Rafael J. Wysocki
6148d38b37 Merge branch 'pm-acpi'
* pm-acpi: (24 commits)
  olpc-xo15-sci: Use struct dev_pm_ops for power management
  ACPI / PM: Drop PM callbacks from the ACPI bus type
  ACPI / PM: Drop legacy driver PM callbacks that are not used any more
  ACPI / PM: Do not execute legacy driver PM callbacks
  acpi_power_meter: Use struct dev_pm_ops for power management
  fujitsu-tablet: Use struct dev_pm_ops for power management
  classmate-laptop: Use struct dev_pm_ops for power management
  xo15-ebook: Use struct dev_pm_ops for power management
  toshiba_bluetooth: Use struct dev_pm_ops for power management
  panasonic-laptop: Use struct dev_pm_ops for power management
  sony-laptop: Use struct dev_pm_ops for power management
  hp_accel: Use struct dev_pm_ops for power management
  toshiba_acpi: Use struct dev_pm_ops for power management
  ACPI: Use struct dev_pm_ops for power management in the SBS driver
  ACPI: Use struct dev_pm_ops for power management in the power driver
  ACPI: Use struct dev_pm_ops for power management in the button driver
  ACPI: Use struct dev_pm_ops for power management in the battery driver
  ACPI: Use struct dev_pm_ops for power management in the AC driver
  ACPI: Use struct dev_pm_ops for power management in processor driver
  ACPI: Use struct dev_pm_ops for power management in the thermal driver
  ...
2012-07-19 00:03:35 +02:00
Avi Kivity
d63d3e6217 x86, hyper: fix build with !CONFIG_KVM_GUEST
Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2012-07-18 17:01:48 -03:00
Ingo Molnar
a2fe194723 Merge branch 'linus' into perf/core
Pick up the latest ring-buffer fixes, before applying a new fix.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-07-18 11:17:17 +02:00
Michael S. Tsirkin
ebf7d2e993 Revert "apic: fix kvm build on UP without IOAPIC"
This reverts commit f9808b7fd4.
After commit 'kvm: switch to apic_set_eoi_write, apic_write'
the stubs are no longer needed as kvm does not look at apicdrivers anymore.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-07-16 12:51:56 +03:00
Michael S. Tsirkin
9053666406 KVM guest: switch to apic_set_eoi_write, apic_write
Use apic_set_eoi_write, apic_write to avoid meedling in core apic
driver data structures directly.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-07-16 12:51:44 +03:00
Michael S. Tsirkin
1551df646d apic: add apic_set_eoi_write for PV use
KVM PV EOI optimization overrides eoi_write apic op with its own
version. Add an API for this to avoid meddling with core x86 apic driver
data structures directly.

For KVM use, we don't need any guarantees about when the switch to the
new op will take place, so it could in theory use this API after SMP init,
but it currently doesn't, and restricting callers to early init makes it
clear that it's safe as it won't race with actual APIC driver use.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Acked-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-07-16 12:51:23 +03:00
Will Drewry
09d314425f vsyscall_64: add missing ifdef CONFIG_SECCOMP
vsyscall_seccomp introduced a dependency on __secure_computing.  On
configurations with CONFIG_SECCOMP disabled, compilation will fail.

Reported-by: feng xiangjun <fengxj325@gmail.com>
Signed-off-by: Will Drewry <wad@chromium.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-07-14 12:01:36 -07:00
Will Drewry
5651721ede x86/vsyscall: allow seccomp filter in vsyscall=emulate
If a seccomp filter program is installed, older static binaries and
distributions with older libc implementations (glibc 2.13 and earlier)
that rely on vsyscall use will be terminated regardless of the filter
program policy when executing time, gettimeofday, or getcpu.  This is
only the case when vsyscall emulation is in use (vsyscall=emulate is the
default).

This patch emulates system call entry inside a vsyscall=emulate by
populating regs->ax and regs->orig_ax with the system call number prior
to calling into seccomp such that all seccomp-dependencies function
normally.  Additionally, system call return behavior is emulated in line
with other vsyscall entrypoints for the trace/trap cases.

[ v2: fixed ip and sp on SECCOMP_RET_TRAP/TRACE (thanks to luto@mit.edu) ]
Reported-and-tested-by: Owen Kibel <qmewlo@gmail.com>
Signed-off-by: Will Drewry <wad@chromium.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-07-13 14:25:55 -07:00
Rafael J. Wysocki
18468843fa olpc-xo15-sci: Use struct dev_pm_ops for power management
Make the OLPC XO15 SCI driver define its resume callback through
a struct dev_pm_ops object rather than by using a legacy PM hook
in struct acpi_device_ops.

Reported-by: Randy Dunlap <rdunlap@xenotime.net>
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Randy Dunlap <rdunlap@xenotime.net>
2012-07-12 22:36:28 +02:00
Mao, Junjie
ad756a1603 KVM: VMX: Implement PCID/INVPCID for guests with EPT
This patch handles PCID/INVPCID for guests.

Process-context identifiers (PCIDs) are a facility by which a logical processor
may cache information for multiple linear-address spaces so that the processor
may retain cached information when software switches to a different linear
address space. Refer to section 4.10.1 in IA32 Intel Software Developer's Manual
Volume 3A for details.

For guests with EPT, the PCID feature is enabled and INVPCID behaves as running
natively.
For guests without EPT, the PCID feature is disabled and INVPCID triggers #UD.

Signed-off-by: Junjie Mao <junjie.mao@intel.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-07-12 13:07:34 +03:00
Ingo Molnar
bb65a764de Merge branch 'mce-ripvfix' of git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras into x86/mce
Merge memory fault handling fix from Tony Luck.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-07-11 22:37:48 +02:00
Tony Luck
6751ed65dc x86/mce: Fix siginfo_t->si_addr value for non-recoverable memory faults
In commit dad1743e59 ("x86/mce: Only restart instruction after machine
check recovery if it is safe") we fixed mce_notify_process() to force a
signal to the current process if it was not restartable (RIPV bit not
set in MCG_STATUS). But doing it here means that the process doesn't
get told the virtual address of the fault via siginfo_t->si_addr. This
would prevent application level recovery from the fault.

Make a new MF_MUST_KILL flag bit for memory_failure() et al. to use so
that we will provide the right information with the signal.

Signed-off-by: Tony Luck <tony.luck@intel.com>
Acked-by: Borislav Petkov <borislav.petkov@amd.com>
Cc: stable@kernel.org    # 3.4+
2012-07-11 10:20:47 -07:00
Prarit Bhargava
fc73373b33 KVM: Add x86_hyper_kvm to complete detect_hypervisor_platform check
While debugging I noticed that unlike all the other hypervisor code in the
kernel, kvm does not have an entry for x86_hyper which is used in
detect_hypervisor_platform() which results in a nice printk in the
syslog.  This is only really a stub function but it
does make kvm more consistent with the other hypervisors.


Signed-off-by: Prarit Bhargava <prarit@redhat.com>
Cc: Avi Kivity <avi@redhat.com>
Cc: Gleb Natapov <gleb@redhat.com>
Cc: Alex Williamson <alex.williamson@redhat.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Marcelo Tostatti <mtosatti@redhat.com>
Cc: kvm@vger.kernel.org
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-07-11 19:33:32 +03:00
Xiao Guangrong
6fbc277053 KVM: MMU: fix kvm_mmu_pagetable_walk tracepoint
The P bit of page fault error code is missed in this tracepoint, fix it by
passing the full error code

Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-07-11 16:51:22 +03:00
Xiao Guangrong
a72faf2504 KVM: MMU: trace fast page fault
To see what happen on this path and help us to optimize it

Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-07-11 16:51:21 +03:00
Xiao Guangrong
c7ba5b48cc KVM: MMU: fast path of handling guest page fault
If the the present bit of page fault error code is set, it indicates
the shadow page is populated on all levels, it means what we do is
only modify the access bit which can be done out of mmu-lock

Currently, in order to simplify the code, we only fix the page fault
caused by write-protect on the fast path

Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-07-11 16:51:20 +03:00
Xiao Guangrong
49fde3406f KVM: MMU: introduce SPTE_MMU_WRITEABLE bit
This bit indicates whether the spte can be writable on MMU, that means
the corresponding gpte is writable and the corresponding gfn is not
protected by shadow page protection

In the later path, SPTE_MMU_WRITEABLE will indicates whether the spte
can be locklessly updated

Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-07-11 16:51:19 +03:00
Xiao Guangrong
6e7d035407 KVM: MMU: fold tlb flush judgement into mmu_spte_update
mmu_spte_update() is the common function, we can easily audit the path

Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-07-11 16:51:18 +03:00
Xiao Guangrong
4f5982a56a KVM: VMX: export PFEC.P bit on ept
Export the present bit of page fault error code, the later patch
will use it

Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-07-11 16:51:17 +03:00
Xiao Guangrong
8e22f955fb KVM: MMU: cleanup spte_write_protect
Use __drop_large_spte to cleanup this function and comment spte_write_protect

Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-07-11 16:51:16 +03:00
Xiao Guangrong
d13bc5b5a1 KVM: MMU: abstract spte write-protect
Introduce a common function to abstract spte write-protect to
cleanup the code

Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-07-11 16:51:14 +03:00
Xiao Guangrong
2f84569f97 KVM: MMU: return bool in __rmap_write_protect
The reture value of __rmap_write_protect is either 1 or 0, use
true/false instead of these

Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-07-11 16:51:13 +03:00
Ingo Molnar
92254d3144 Linux 3.5-rc6
-----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.18 (GNU/Linux)
 
 iQEcBAABAgAGBQJP+NMmAAoJEHm+PkMAQRiGPxEH/18YQN8FAzEIjcC10ytA3RC3
 KzPv31jXgJGZDy1UqmpKtJ7GDwb92AhqZxVnJimMa+6d1uA8NsZQq5EMOPPiX8Qi
 8P4AEaw5kSMmR/6zxsxguCGdbDLU3xZ1nJZkHyMgjo2UJbMU0jBPneb/79heWPhe
 0HOkLzN5VA6Yx3Nt70sWQ1zsuj0Ji5jCGO0iNTCBmTiv4J9ZlOx3xJQn4aK6JscO
 /3QRTM43GG0j6zToEOCTHrn8ajOq6rHQQkG0bPVR723nFrSGLoaCT6QVBXYug+AZ
 9Xay7zVNvrq2oH5x5jADG2t2vyaG+nEJpSrVjXznzxgDnK7tWjYqiuG5zqKhAq8=
 =IMfr
 -----END PGP SIGNATURE-----

Merge tag 'v3.5-rc6' into x86/mce

Merge Linux 3.5-rc6 before merging more code.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-07-11 09:41:37 +02:00
Johannes Goetzfried
a43478863b crypto: twofish-avx - remove useless instruction
The register %rdx is written, but never read till the end of the encryption
routine. Therefore let's delete the useless instruction.

Signed-off-by: Johannes Goetzfried <Johannes.Goetzfried@informatik.stud.uni-erlangen.de>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2012-07-11 11:08:30 +08:00
Milan Broz
bf084d8f6e crypto: aesni-intel - fix wrong kfree pointer
kfree(new_key_mem) in rfc4106_set_key() should be called on malloced pointer,
not on aligned one, otherwise it can cause invalid pointer on free.

(Seen at least once when running tcrypt tests with debug kernel.)

Signed-off-by: Milan Broz <mbroz@redhat.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2012-07-11 11:06:13 +08:00
Jan Beulich
a7101d1526 x86/mm/mtrr: Slightly simplify print_mtrr_state()
high_width can be easily calculated in a single expression when
making use of __ffs64().

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Yinghai Lu <yinghai@kernel.org>
Link: http://lkml.kernel.org/r/4FF71053020000780008E1B5@nat28.tlf.novell.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-07-10 10:38:15 +02:00
Jan Beulich
1ba9a29414 x86/mm/mtrr: Fix alignment determination in range_to_mtrr()
With the variable operated on being of "unsigned long" type,
neither ffs() nor fls() are suitable to use on them, as those
truncate their arguments to 32 bits. Using __ffs() and __fls()
respectively at once eliminates the need to subtract 1 from their
results.

Additionally, with the alignment value subsequently used as a
shift count, it must be enforced to be less than BITS_PER_LONG
(and on 64-bit there's no need for it to be any smaller).

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Yinghai Lu <yinghai@kernel.org>
Link: http://lkml.kernel.org/r/4FF70D54020000780008E179@nat28.tlf.novell.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-07-10 10:38:14 +02:00
Avi Kivity
a27685c33a KVM: VMX: Emulate invalid guest state by default
Our emulation should be complete enough that we can emulate guests
while they are in big real mode, or in a mode transition that is not
virtualizable without unrestricted guest support.

Signed-off-by: Avi Kivity <avi@redhat.com>
2012-07-09 14:19:05 +03:00
Avi Kivity
8089000616 KVM: x86 emulator: implement LTR
Opcode 0F 00 /3.  Encountered during Windows XP secondary processor bringup.

Signed-off-by: Avi Kivity <avi@redhat.com>
2012-07-09 14:19:05 +03:00
Avi Kivity
869be99c75 KVM: x86 emulator: make loading TR set the busy bit
Guest software doesn't actually depend on it, but vmx will refuse us
entry if we don't.  Set the bit in both the cached segment and memory,
just to be nice.

Signed-off-by: Avi Kivity <avi@redhat.com>
2012-07-09 14:19:05 +03:00
Avi Kivity
e919464b53 KVM: x86 emulator: make read_segment_descriptor() return the address
Some operations want to modify the descriptor later on, so save the
address for future use.

Signed-off-by: Avi Kivity <avi@redhat.com>
2012-07-09 14:19:04 +03:00
Avi Kivity
a14e579f22 KVM: x86 emulator: emulate LLDT
Opcode 0F 00 /2. Used by isolinux durign the protected mode transition.

Signed-off-by: Avi Kivity <avi@redhat.com>
2012-07-09 14:19:04 +03:00
Avi Kivity
9299836e63 KVM: x86 emulator: emulate BSWAP
Opcodes 0F C8 - 0F CF.

Used by the SeaBIOS cdrom code (though not in big real mode).

Signed-off-by: Avi Kivity <avi@redhat.com>
2012-07-09 14:19:04 +03:00
Avi Kivity
de5f70e0c6 KVM: VMX: Improve error reporting during invalid guest state emulation
If instruction emulation fails, report it properly to userspace.

Signed-off-by: Avi Kivity <avi@redhat.com>
2012-07-09 14:19:04 +03:00
Avi Kivity
de87dcddc7 KVM: VMX: Stop invalid guest state emulation on pending event
Process the event, possibly injecting an interrupt, before continuing.

Signed-off-by: Avi Kivity <avi@redhat.com>
2012-07-09 14:19:04 +03:00
Avi Kivity
612e89f015 KVM: x86 emulator: implement ENTER
Opcode C8.

Only ENTER with lexical nesting depth 0 is implemented, since others are
very rare.  We'll fail emulation if nonzero lexical depth is used so data
is not corrupted.

Signed-off-by: Avi Kivity <avi@redhat.com>
2012-07-09 14:19:03 +03:00
Avi Kivity
51ddff50cb KVM: x86 emulator: split push logic from push opcode emulation
This allows us to reuse the code without populating ctxt->src and
overriding ctxt->op_bytes.

Signed-off-by: Avi Kivity <avi@redhat.com>
2012-07-09 14:19:03 +03:00
Avi Kivity
361cad2b50 KVM: x86 emulator: fix byte-sized MOVZX/MOVSX
Commit 2adb5ad9fe removed ByteOp from MOVZX/MOVSX, replacing them by
SrcMem8, but neglected to fix the dependency in the emulation code
on ByteOp.  This caused the instruction not to have any effect in
some circumstances.

Fix by replacing the check for ByteOp with the equivalent src.op_bytes == 1.

Signed-off-by: Avi Kivity <avi@redhat.com>
2012-07-09 14:19:03 +03:00
Avi Kivity
2dd7caa092 KVM: x86 emulator: emulate LAHF
Opcode 9F.

Signed-off-by: Avi Kivity <avi@redhat.com>
2012-07-09 14:19:03 +03:00
Avi Kivity
7c068e4558 KVM: VMX: Continue emulating after batch exhausted
If we return early from an invalid guest state emulation loop, make
sure we return to it later if the guest state is still invalid.

Signed-off-by: Avi Kivity <avi@redhat.com>
2012-07-09 14:19:03 +03:00
Avi Kivity
bdea48e305 KVM: VMX: Fix interrupt exit condition during emulation
Checking EFLAGS.IF is incorrect as we might be in interrupt shadow.  If
that is the case, the main loop will notice that and not inject the interrupt,
causing an endless loop.

Fix by using vmx_interrupt_allowed() to check if we can inject an interrupt
instead.

Signed-off-by: Avi Kivity <avi@redhat.com>
2012-07-09 14:19:02 +03:00
Avi Kivity
96051572c8 KVM: x86 emulator: emulate SGDT/SIDT
Opcodes 0F 01 /0 and 0F 01 /1

Signed-off-by: Avi Kivity <avi@redhat.com>
2012-07-09 14:19:02 +03:00
Avi Kivity
a6e3407bb1 KVM: Fix SS default ESP/EBP based addressing
We correctly default to SS when BP is used as a base in 16-bit address mode,
but we don't do that for 32-bit mode.

Fix by adjusting the default to SS when either ESP or EBP is used as the base
register.

Signed-off-by: Avi Kivity <avi@redhat.com>
2012-07-09 14:19:02 +03:00
Avi Kivity
cbd27ee783 KVM: x86 emulator: initialize memop
memop is not initialized; this can lead to a two-byte operation
following a 4-byte operation to see garbage values.  Usually
truncation fixes things fot us later on, but at least in one case
(call abs) it doesn't.

Fix by moving memop to the auto-initialized field area.

Signed-off-by: Avi Kivity <avi@redhat.com>
2012-07-09 14:19:02 +03:00
Avi Kivity
f47cfa3174 KVM: x86 emulator: emulate LEAVE
Opcode c9; used by some variants of Windows during boot, in big real mode.

Signed-off-by: Avi Kivity <avi@redhat.com>
2012-07-09 14:19:01 +03:00
Avi Kivity
b8405c184b KVM: VMX: Limit iterations with emulator_invalid_guest_state
Otherwise, if the guest ends up looping, we never exit the srcu critical
section, which causes synchronize_srcu() to hang.

Signed-off-by: Avi Kivity <avi@redhat.com>
2012-07-09 14:19:01 +03:00
Avi Kivity
f0495f9b99 KVM: VMX: Relax check on unusable segment
Some userspace (e.g. QEMU 1.1) munge the d and g bits of segment
descriptors, causing us not to recognize them as unusable segments
with emulate_invalid_guest_state=1.  Relax the check by testing for
segment not present (a non-present segment cannot be usable).

Signed-off-by: Avi Kivity <avi@redhat.com>
2012-07-09 14:19:01 +03:00
Avi Kivity
510425ff33 KVM: x86 emulator: fix LIDT/LGDT in long mode
The operand size for these instructions is 8 bytes in long mode, even without
a REX prefix.  Set it explicitly.

Triggered while booting Linux with emulate_invalid_guest_state=1.

Signed-off-by: Avi Kivity <avi@redhat.com>
2012-07-09 14:19:01 +03:00
Avi Kivity
79d5b4c3cd KVM: x86 emulator: allow loading null SS in long mode
Null SS is valid in long mode; allow loading it.

Signed-off-by: Avi Kivity <avi@redhat.com>
2012-07-09 14:19:01 +03:00
Avi Kivity
6d6eede4a0 KVM: x86 emulator: emulate cpuid
Opcode 0F A2.

Used by Linux during the mode change trampoline while in a state that is
not virtualizable on vmx without unrestricted_guest, so we need to emulate
it is emulate_invalid_guest_state=1.

Signed-off-by: Avi Kivity <avi@redhat.com>
2012-07-09 14:19:00 +03:00
Avi Kivity
0017f93a27 KVM: x86 emulator: change ->get_cpuid() accessor to use the x86 semantics
Instead of getting an exact leaf, follow the spec and fall back to the last
main leaf instead.  This lets us easily emulate the cpuid instruction in the
emulator.

Signed-off-by: Avi Kivity <avi@redhat.com>
2012-07-09 14:19:00 +03:00
Avi Kivity
62046e5a86 KVM: Split cpuid register access from computation
Introduce kvm_cpuid() to perform the leaf limit check and calculate
register values, and let kvm_emulate_cpuid() just handle reading and
writing the registers from/to the vcpu.  This allows us to reuse
kvm_cpuid() in a context where directly reading and writing registers
is not desired.

Signed-off-by: Avi Kivity <avi@redhat.com>
2012-07-09 14:19:00 +03:00
Avi Kivity
d881e6f6cf KVM: VMX: Return correct CPL during transition to protected mode
In protected mode, the CPL is defined as the lower two bits of CS, as set by
the last far jump.  But during the transition to protected mode, there is no
last far jump, so we need to return zero (the inherited real mode CPL).

Fix by reading CPL from the cache during the transition.  This isn't 100%
correct since we don't set the CPL cache on a far jump, but since protected
mode transition will always jump to a segment with RPL=0, it will always
work.

Signed-off-by: Avi Kivity <avi@redhat.com>
2012-07-09 14:19:00 +03:00
Avi Kivity
e676505ac9 KVM: MMU: Force cr3 reload with two dimensional paging on mov cr3 emulation
Currently the MMU's ->new_cr3() callback does nothing when guest paging
is disabled or when two-dimentional paging (e.g. EPT on Intel) is active.
This means that an emulated write to cr3 can be lost; kvm_set_cr3() will
write vcpu-arch.cr3, but the GUEST_CR3 field in the VMCS will retain its
old value and this is what the guest sees.

This bug did not have any effect until now because:
- with unrestricted guest, or with svm, we never emulate a mov cr3 instruction
- without unrestricted guest, and with paging enabled, we also never emulate a
  mov cr3 instruction
- without unrestricted guest, but with paging disabled, the guest's cr3 is
  ignored until the guest enables paging; at this point the value from arch.cr3
  is loaded correctly my the mov cr0 instruction which turns on paging

However, the patchset that enables big real mode causes us to emulate mov cr3
instructions in protected mode sometimes (when guest state is not virtualizable
by vmx); this mov cr3 is effectively ignored and will crash the guest.

The fix is to make nonpaging_new_cr3() call mmu_free_roots() to force a cr3
reload.  This is awkward because now all the new_cr3 callbacks to the same
thing, and because mmu_free_roots() is somewhat of an overkill; but fixing
that is more complicated and will be done after this minimal fix.

Observed in the Window XP 32-bit installer while bringing up secondary vcpus.

Signed-off-by: Avi Kivity <avi@redhat.com>
2012-07-09 14:18:59 +03:00
Pekka Enberg
c3b7cdf180 perf/x86: Fix intel_perfmon_event_mapformatting
Use tabs for "intel_perfmon_event_map" formatting in
perf_event_intel.c.

Signed-off-by: Pekka Enberg <penberg@kernel.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Link: http://lkml.kernel.org/r/1341568786-7045-1-git-send-email-penberg@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-07-06 13:16:15 +02:00
Ingo Molnar
35c2f48c66 Merge branch 'tip/perf/core' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace into perf/core
Pull tracing updates from Steve Rostedt.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-07-06 11:12:17 +02:00
Suresh Siddha
d872818dbb x86/apic/x2apic: Use multiple cluster members for the irq destination only with the explicit affinity
During boot or driver load etc, interrupt destination is setup
using default target cpu's. Later the user (irqbalance etc) or
the driver (irq_set_affinity/ irq_set_affinity_hint) can request
the interrupt to be migrated to some specific set of cpu's.

In the x2apic cluster routing, for the default scenario use
single cpu as the interrupt destination and when there is an
explicit interrupt affinity request, route the interrupt to
multiple members of a x2apic cluster specified in the cpumask of
the migration request.

This will minmize the vector pressure when there are lot of
interrupt sources and relatively few x2apic clusters (for
example a single socket server). This will allow the performance
critical interrupts to be routed to multiple cpu's in the x2apic
cluster (irqbalance for example uses the cache siblings etc
while specifying the interrupt destination) and allow
non-critical interrupts to be serviced by a single logical cpu.

Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Acked-by: Yinghai Lu <yinghai@kernel.org>
Acked-by: Alexander Gordeev <agordeev@redhat.com>
Acked-by: Cyrill Gorcunov <gorcunov@openvz.org>
Link: http://lkml.kernel.org/r/1340656709-11423-4-git-send-email-suresh.b.siddha@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-07-06 11:00:23 +02:00
Suresh Siddha
1ac322d0b1 x86/apic/x2apic: Limit the vector reservation to the user specified mask
For the x2apic cluster mode, vector for an interrupt is
currently reserved on all the cpu's that are part of the x2apic
cluster. But the interrupts will be routed only to the cluster
(derived from the first cpu in the mask) members specified in
the mask. So there is no need to reserve the vector in the
unused cluster members.

Modify __assign_irq_vector() to reserve the vectors based on the
user specified irq destination mask. If the new mask is a proper
subset of the currently used mask, cleanup the vector allocation
on the unused cpu members.

Also, allow the apic driver to tune the vector domain based on
the affinity mask (which in most cases is the user-specified
mask).

Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Acked-by: Yinghai Lu <yinghai@kernel.org>
Acked-by: Alexander Gordeev <agordeev@redhat.com>
Acked-by: Cyrill Gorcunov <gorcunov@openvz.org>
Link: http://lkml.kernel.org/r/1340656709-11423-3-git-send-email-suresh.b.siddha@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-07-06 11:00:22 +02:00
Suresh Siddha
b39f25a849 x86/apic: Optimize cpu traversal in __assign_irq_vector() using domain membership
Currently __assign_irq_vector() goes through each cpu in the
specified mask until it finds a free vector in all the cpu's
that are part of the same interrupt domain. We visit all the
interrupt domain sibling cpus to reserve the free vector. So,
when we fail to find a free vector in an interrupt domain, it is
safe to continue our search with a cpu belonging to a new
interrupt domain. No need to go through each cpu, if the domain
containing that cpu is already visited.

Use the irq_cfg's old_domain to track the visited domains and
optimize the cpu traversal while finding a free vector in the
given cpumask.

NOTE: We can also optimize the search by using for_each_cpu() and
skip the current cpu, if it is not the first cpu in the mask
returned by the vector_allocation_domain(). But re-using the
cfg->old_domain to track the visited domains will be slightly
faster.

Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Acked-by: Yinghai Lu <yinghai@kernel.org>
Acked-by: Alexander Gordeev <agordeev@redhat.com>
Acked-by: Cyrill Gorcunov <gorcunov@openvz.org>
Link: http://lkml.kernel.org/r/1340656709-11423-2-git-send-email-suresh.b.siddha@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-07-06 11:00:21 +02:00
Bjorn Helgaas
85a00dd391 Merge branch 'pci/myron-pcibios_setup' into next
* pci/myron-pcibios_setup:
  xtensa/PCI: factor out pcibios_setup()
  x86/PCI: adjust section annotations for pcibios_setup()
  unicore32/PCI: adjust section annotations for pcibios_setup()
  tile/PCI: factor out pcibios_setup()
  sparc/PCI: factor out pcibios_setup()
  sh/PCI: adjust section annotations for pcibios_setup()
  sh/PCI: factor out pcibios_setup()
  powerpc/PCI: factor out pcibios_setup()
  parisc/PCI: factor out pcibios_setup()
  MIPS/PCI: adjust section annotations for pcibios_setup()
  MIPS/PCI: factor out pcibios_setup()
  microblaze/PCI: factor out pcibios_setup()
  ia64/PCI: factor out pcibios_setup()
  cris/PCI: factor out pcibios_setup()
  alpha/PCI: factor out pcibios_setup()
  PCI: pull pcibios_setup() up into core
2012-07-05 15:31:05 -06:00
Myron Stowe
15fa325beb x86/PCI: adjust section annotations for pcibios_setup()
Make pcibios_setup() consistently use the "__init" section annotation.

Signed-off-by: Myron Stowe <myron.stowe@redhat.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
2012-07-05 15:09:14 -06:00
Yan, Zheng
6a67943a18 perf/x86: Uncore filter support for SandyBridge-EP
This patch adds C-Box and PCU filter support for SandyBridge-EP
uncore. We can filter C-Box events by thread/core ID and filter
PCU events by frequency/voltage.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1341381616-12229-5-git-send-email-zheng.z.yan@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-07-05 21:56:01 +02:00
Yan, Zheng
4208969724 perf/x86: Detect number of instances of uncore CBox
The CBox manages the interface between the core and the LLC, so
the instances of uncore CBox is equal to number of cores.

Reported-by: Andrew Cooks <acooks@gmail.com>
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1341381616-12229-4-git-send-email-zheng.z.yan@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-07-05 21:56:00 +02:00
Yan, Zheng
3b19e4c98c perf/x86: Fix event constraint for SandyBridge-EP C-Box
The constraint for C-Box event 0x1f should have overlap flag set.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1340866596-22502-2-git-send-email-zheng.z.yan@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-07-05 21:55:59 +02:00
Yan, Zheng
eca26c9950 perf/x86: Use 0xff as pseudo code for fixed uncore event
Stephane Eranian suggestted using 0xff as pseudo code for fixed
uncore event and using the umask value to determine which of the
fixed events we want to map to. So far there is at most one fixed
counter in a uncore PMU. So just change the definition of
UNCORE_FIXED_EVENT to 0xff.

Suggested-by: Stephane Eranian <eranian@google.com>
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1340780953-21130-1-git-send-email-zheng.z.yan@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-07-05 21:55:58 +02:00
Peter Zijlstra
3e0091e2b6 perf/x86: Save a few bytes in 'struct x86_pmu'
All these are basically boolean flags, use a bitfield to save a few
bytes.

Suggested-by: Borislav Petkov <bp@amd64.org>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/n/tip-vsevd5g8lhcn129n3s7trl7r@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-07-05 21:55:58 +02:00
Peter Zijlstra
c93dc84cbe perf/x86: Add a microcode revision check for SNB-PEBS
Recent Intel microcode resolved the SNB-PEBS issues, so conditionally
enable PEBS on SNB hardware depending on the microcode revision.

Thanks to Stephane for figuring out the various microcode revisions.

Suggested-by: Stephane Eranian <eranian@google.com>
Acked-by: Borislav Petkov <borislav.petkov@amd.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/n/tip-v3672ziwh9damwqwh1uz3krm@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-07-05 21:55:57 +02:00
Robert Richter
f285f92f7e perf/x86: Improve debug output in check_hw_exists()
It might be of interest which perfctr msr failed.

Signed-off-by: Robert Richter <robert.richter@amd.com>
[ added hunk to avoid GCC warn ]
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1340217996-2254-5-git-send-email-robert.richter@amd.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-07-05 21:19:42 +02:00
Robert Richter
b1dc3c4820 perf/x86/amd: Unify AMD's generic and family 15h pmus
There is no need for keeping separate pmu structs. We can enable
amd_{get,put}_event_constraints() functions also for family 15h event.

The advantage is that there is only a single pmu struct for all AMD
cpus. This patch introduces functions to setup the pmu to enabe core
performance counters or counter constraints.

Also, cpuid checks are used instead of family checks where
possible. Thus, it enables the code independently of cpu families if
the feature flag is set.

Signed-off-by: Robert Richter <robert.richter@amd.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1340217996-2254-4-git-send-email-robert.richter@amd.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-07-05 21:19:41 +02:00
Robert Richter
a1eac7ac90 perf/x86: Move Intel specific code to intel_pmu_init()
There is some Intel specific code in the generic x86 path. Move it to
intel_pmu_init().

Since p4 and p6 pmus don't have fixed counters we may skip the check
in case such a pmu is detected.

Signed-off-by: Robert Richter <robert.richter@amd.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1340217996-2254-3-git-send-email-robert.richter@amd.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-07-05 21:19:40 +02:00
Robert Richter
15c7ad51ad perf/x86: Rename Intel specific macros
There are macros that are Intel specific and not x86 generic. Rename
them into INTEL_*.

This patch removes X86_PMC_IDX_GENERIC and does:

 $ sed -i -e 's/X86_PMC_MAX_/INTEL_PMC_MAX_/g'           \
         arch/x86/include/asm/kvm_host.h                 \
         arch/x86/include/asm/perf_event.h               \
         arch/x86/kernel/cpu/perf_event.c                \
         arch/x86/kernel/cpu/perf_event_p4.c             \
         arch/x86/kvm/pmu.c
 $ sed -i -e 's/X86_PMC_IDX_FIXED/INTEL_PMC_IDX_FIXED/g' \
         arch/x86/include/asm/perf_event.h               \
         arch/x86/kernel/cpu/perf_event.c                \
         arch/x86/kernel/cpu/perf_event_intel.c          \
         arch/x86/kernel/cpu/perf_event_intel_ds.c       \
         arch/x86/kvm/pmu.c
 $ sed -i -e 's/X86_PMC_MSK_/INTEL_PMC_MSK_/g'           \
         arch/x86/include/asm/perf_event.h               \
         arch/x86/kernel/cpu/perf_event.c

Signed-off-by: Robert Richter <robert.richter@amd.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1340217996-2254-2-git-send-email-robert.richter@amd.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-07-05 21:19:39 +02:00
Ingo Molnar
1070505d18 Merge branch 'x86/microcode' into perf/core
Merge this branch because we want to rely on the newer (and saner)
microcode loading and checking facilities.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-07-05 21:13:57 +02:00
Ingo Molnar
b0338e99b2 Merge branch 'x86/cpu' into perf/core
Merge this branch because we changed the wrmsr*_safe() API and there's
a conflict.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-07-05 21:12:11 +02:00
Ingo Molnar
90574ebb7e Merge branch 'perf/urgent' into perf/core
Merge this branch to pick up a fixlet and to update to a more recent base.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-07-05 21:10:23 +02:00
Peter Zijlstra
ce5c1fe9a9 perf/x86: Fix USER/KERNEL tagging of samples
Several perf interrupt handlers (PEBS,IBS,BTS) re-write regs->ip but
do not update the segment registers. So use an regs->ip based test
instead of an regs->cs/regs->flags based test.

Reported-and-tested-by: Stephane Eranian <eranian@google.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Link: http://lkml.kernel.org/n/tip-xxrt0a1zronm1sm36obwc2vy@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-07-05 20:59:07 +02:00
David S. Miller
c90a9bb907 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2012-07-05 03:44:25 -07:00
Xiao Guangrong
85b7059169 KVM: MMU: fix shrinking page from the empty mmu
Fix:

 [ 3190.059226] BUG: unable to handle kernel NULL pointer dereference at           (null)
 [ 3190.062224] IP: [<ffffffffa02aac66>] mmu_page_zap_pte+0x10/0xa7 [kvm]
 [ 3190.063760] PGD 104f50067 PUD 112bea067 PMD 0
 [ 3190.065309] Oops: 0000 [#1] SMP DEBUG_PAGEALLOC
 [ 3190.066860] CPU 1
[ ...... ]
 [ 3190.109629] Call Trace:
 [ 3190.111342]  [<ffffffffa02aada6>] kvm_mmu_prepare_zap_page+0xa9/0x1fc [kvm]
 [ 3190.113091]  [<ffffffffa02ab2f5>] mmu_shrink+0x11f/0x1f3 [kvm]
 [ 3190.114844]  [<ffffffffa02ab25d>] ? mmu_shrink+0x87/0x1f3 [kvm]
 [ 3190.116598]  [<ffffffff81150c9d>] ? prune_super+0x142/0x154
 [ 3190.118333]  [<ffffffff8110a4f4>] ? shrink_slab+0x39/0x31e
 [ 3190.120043]  [<ffffffff8110a687>] shrink_slab+0x1cc/0x31e
 [ 3190.121718]  [<ffffffff8110ca1d>] do_try_to_free_pages

This is caused by shrinking page from the empty mmu, although we have
checked n_used_mmu_pages, it is useless since the check is out of mmu-lock

Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2012-07-03 17:31:50 -03:00
Guo Chao
2106a54812 KVM: VMX: code clean for vmx_init()
Signed-off-by: Guo Chao <yan@linux.vnet.ibm.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2012-07-03 14:55:30 -03:00
Michael S. Tsirkin
f9808b7fd4 apic: fix kvm build on UP without IOAPIC
On UP i386, when APIC is disabled
# CONFIG_X86_UP_APIC is not set
# CONFIG_PCI_IOAPIC is not set

code looking at apicdrivers never has any effect but it
still gets compiled in. In particular, this causes
build failures with kvm, but it generally bloats the kernel
unnecessarily.

Fix by defining both __apicdrivers and __apicdrivers_end
to be NULL when CONFIG_X86_LOCAL_APIC is unset: I verified
that as the result any loop scanning __apicdrivers gets optimized out by
the compiler.

Warning: a .config with apic disabled doesn't seem to boot
for me (even without this patch). Still verifying why,
meanwhile this patch is compile-tested only.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Reported-by: Randy Dunlap <rdunlap@xenotime.net>
Acked-by: Randy Dunlap <rdunlap@xenotime.net>
Acked-by: H. Peter Anvin <hpa@linux.intel.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2012-07-03 14:55:29 -03:00
Borislav Petkov
3d8986bc7f x86, microcode: Make reload interface per system
The reload interface should be per-system so that a full system ucode
reload happens (on each core) when doing

echo 1 > /sys/devices/system/cpu/microcode/reload

Move it to the cpu subsys directory instead of it being per-cpu.

Cc: Henrique de Moraes Holschuh <hmh@hmh.eng.br>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
Link: http://lkml.kernel.org/r/1340280437-7718-3-git-send-email-bp@amd64.org
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
2012-07-01 10:24:09 -07:00
Borislav Petkov
c9fc3f778a x86, microcode: Sanitize per-cpu microcode reloading interface
Microcode reloading in a per-core manner is a very bad idea for both
major x86 vendors. And the thing is, we have such interface with which
we can end up with different microcode versions applied on different
cores of an otherwise homogeneous wrt (family,model,stepping) system.

So turn off the possibility of doing that per core and allow it only
system-wide.

This is a minimal fix which we'd like to see in stable too thus the
more-or-less arbitrary decision to allow system-wide reloading only on
the BSP:

$ echo 1 > /sys/devices/system/cpu/cpu0/microcode/reload
...

and disable the interface on the other cores:

$ echo 1 > /sys/devices/system/cpu/cpu23/microcode/reload
-bash: echo: write error: Invalid argument

Also, allowing the reload only from one CPU (the BSP in
that case) doesn't allow the reload procedure to degenerate
into an O(n^2) deal when triggering reloads from all
/sys/devices/system/cpu/cpuX/microcode/reload sysfs nodes
simultaneously.

A more generic fix will follow.

Cc: Henrique de Moraes Holschuh <hmh@hmh.eng.br>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
Link: http://lkml.kernel.org/r/1340280437-7718-2-git-send-email-bp@amd64.org
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Cc: <stable@vger.kernel.org>
2012-07-01 10:24:05 -07:00
Linus Torvalds
c76760926a Merge branch 'release' of git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux
Pull ACPI & Power Management patches from Len Brown.

* 'release' of git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux:
  acpi_pad: fix power_saving thread deadlock
  ACPI video: Still use ACPI backlight control if _DOS doesn't exist
  ACPI, APEI, Avoid too much error reporting in runtime
  ACPI: Add a quirk for "AMILO PRO V2030" to ignore the timer overriding
  ACPI: Remove one board specific WARN when ignoring timer overriding
  ACPI: Make acpi_skip_timer_override cover all source_irq==0 cases
  ACPI, x86: fix Dell M6600 ACPI reboot regression via DMI
  ACPI sysfs.c strlen fix
2012-06-30 11:11:58 -07:00
Len Brown
6eca954e25 Merge branches 'acpi_pad-bugzilla-42981', 'apei-bugzilla-43282', 'video-bugzilla-43168', 'bugzilla-40002' and 'bugfix-misc' into release
bug fixes
2012-06-30 00:53:50 -04:00