we already have summit and etc depends on genericarch,
so use genericarch only.
Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
so don't punish all other cpus without that problem when init highmem
Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
and make 32-bit resource registration more like 64 bit.
also move probe_roms back to setup_32.c
Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Because of the size limits of struct boot_params (zero page), the
maximum number of E820 memory map entries can be passed to kernel is
128. As pointed by Paul Jackson, there is some machine produced by SGI
with so many nodes that the number of E820 memory map entries is more
than 128. To enabling Linux kernel on these system, a new setup data
type named SETUP_E820_EXT is defined to pass additional memory map
entries to Linux kernel.
This patch is based on x86/auto-latest branch of git-x86 tree and has
been tested on x86_64 and i386 platform.
Signed-off-by: Huang Ying <ying.huang@intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
use early_node_map to init high pages, so we can remove page_is_ram() and
page_is_reserved_early() in the big loop with add_one_highpage
also remove page_is_reserved_early(), it is not needed anymore.
v2: fix the build of other platforms
Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
rename update_memory_range to e820_update_range
rename add_memory_region to e820_add_region
to make it more clear that they are about e820 map operations.
Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
so we don't get the same value multiple times.
also make mp_config_acpi_legacy_irqs more readable by moving assignments
together.
Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Daniel Exner reported IO-APIC enumeration breakage in linux-next.
Alexey Starikovskiy found out that it might be related to
commit 2944e16b25 "x86: update mptable".
use enable_update_mptable to decide if need check before add mp_irqs array.
Reported-by: Daniel Exner <webmaster@dragonslave.de>
Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
1. move that before zone_sizes_init ...
2. add free_early for one old one, otherwise it will be be reserved again
when we init highmem.
Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
in case we have kva before ramdisk on a node, we still need to use
those ranges.
v2: reserve_early kva ram area, in case there are holes in highmem, to avoid
those area could be treat as free high pages.
Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
1. add reserve_bootmem_generic for 32bit
2. change len to unsigned long
3. make early_res_to_bootmem to use it
Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
we are checking mptable early for numaq, so don't need to reserve_bootmem
for it. bootmem is not there yet.
do the same thing as 64-bit.
found it on 64g above system from 64-bit kernel kexec to 32 bit kernel with
numaq support.
Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
since we now have 32-bit support for e820_register_active_regions(),
we can merge the parsing of the mem=/memmap= boot parameters.
Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This patch uses reserve_bootmem_generic() instead of reserve_bootmem()
to reserve the crashkernel memory on x86_64. That's necessary for NUMA
machines, see 00212fef81:
[PATCH] Fix kdump Crash Kernel boot memory reservation for NUMA machines
This patch will fix a boot memory reservation bug that trashes memory on
the ES7000 when loading the kdump crash kernel.
The code in arch/x86_64/kernel/setup.c to reserve boot memory for the crash
kernel uses the non-numa aware "reserve_bootmem" function instead of the
NUMA aware "reserve_bootmem_generic". I checked to make sure that no other
function was using "reserve_bootmem" and found none, except the ones that
had NUMA ifdef'ed out.
I have tested this patch only on an ES7000 with NUMA on and off (numa=off)
in a single (non-NUMA) and multi-cell (NUMA) configurations.
Signed-off-by: Amul Shah <amul.shah@unisys.com>
Looks-good-to: Vivek Goyal <vgoyal@in.ibm.com>
Cc: Andi Kleen <ak@muc.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The switch-back to reserve_bootmem() was accidentally introduced in
5c3391f9f7 when adding the BOOTMEM_EXCLUSIVE
parameter.
Signed-off-by: Bernhard Walle <bwalle@suse.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This patch adds a 'flags' parameter to reserve_bootmem_generic() like it
already has been added in reserve_bootmem() with commit
72a7fe3967.
It also changes all users to use BOOTMEM_DEFAULT, which doesn't effectively
change the behaviour. Since the change is x86-specific, I don't think it's
necessary to add a new API for migration. There are only 4 users of that
function.
The change is necessary for the next patch, using reserve_bootmem_generic()
for crashkernel reservation.
Signed-off-by: Bernhard Walle <bwalle@suse.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Since the trampoline code is now used for ACPI resume from suspend to RAM,
the trampoline page tables have to be fixed up during boot not only on SMP
systems, but also on UP systems that use the trampoline.
Reference: http://bugzilla.kernel.org/show_bug.cgi?id=10923
Reported-by: Dionisus Torimens <djtm@gmx.net>
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: pm list <linux-pm@lists.linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Some Dell laptops enter resume with apparent garbage in the segment
descriptor registers (almost certainly the result of a botched
transition from protected to real mode.) The only way to clean that
up is to enter protected mode ourselves and clean out the descriptor
registers.
This fixes resume on Dell XPS M1210 and Dell D620.
Reference: http://bugzilla.kernel.org/show_bug.cgi?id=10927
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Pavel Machek <pavel@ucw.cz>
Cc: pm list <linux-pm@lists.linux-foundation.org>
Cc: Len Brown <lenb@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Tested-by: Kirill A. Shutemov <kirill@shutemov.name>
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
When converting the page number in a pte/pmd/pud/pgd between
machine and pseudo-physical addresses, the converted result was
being truncated at 32-bits. This caused failures on machines
with more than 4G of physical memory.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Cc: "Christopher S. Aker" <caker@theshore.net>
Cc: Ian Campbell <Ian.Campbell@eu.citrix.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
fix this warning:
arch/x86/mm/init_64.c: In function 'early_memtest':
arch/x86/mm/init_64.c:524: warning: passing argument 2 of 'find_e820_area_size' from incompatible pointer type
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Fedora reports that mem_init()'s zap_low_mappings(), extended to SMP in
61165d7a03 x86: fix app crashes after SMP
resume causes 32-bit Intel Mac machines to reboot very early when
booting with EFI.
The EFI code appears to manage low mappings for itself when needed; but
like many before it, confuses PSE with PAE. So it has only been mapping
half the space it needed when PSE but not PAE. This remained unnoticed
until we moved the SMP zap_low_mappings() before
efi_enter_virtual_mode(). Presumably could have been noticed years ago
if anyone ran a UP kernel on such machines?
Reported-by: Peter Jones <pjones@redhat.com>
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Cc: Peter Jones <pjones@redhat.com>
Cc: Glauber Costa <gcosta@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Tested-by: Peter Jones <pjones@redhat.com>
commit 4323838215
x86: change size of node ids from u8 to s16
set the range for NODES_SHIFT to 1..15.
The possible range is 1..9
Fixes Bugzilla #10726
Reported-by: Dave Jones <davej@codemonkey.org.uk>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
* 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
ptrace GET/SET FPXREGS broken
x86: fix cpu hotplug crash
x86: section/warning fixes
x86: shift bits the right way in native_read_tscp
When I update kernel 2.6.25 from 2.6.24, gdb does not work.
On 2.6.25, ptrace(PTRACE_GETFPXREGS, ...) returns ENODEV.
But 2.6.24 kernel's ptrace() returns EIO.
It is issue of compatibility.
I attached test program as pt.c and patch for fix it.
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <signal.h>
#include <errno.h>
#include <sys/ptrace.h>
#include <sys/types.h>
struct user_fxsr_struct {
unsigned short cwd;
unsigned short swd;
unsigned short twd;
unsigned short fop;
long fip;
long fcs;
long foo;
long fos;
long mxcsr;
long reserved;
long st_space[32]; /* 8*16 bytes for each FP-reg = 128 bytes */
long xmm_space[32]; /* 8*16 bytes for each XMM-reg = 128 bytes */
long padding[56];
};
int main(void)
{
pid_t pid;
pid = fork();
switch(pid){
case -1:/* error */
break;
case 0:/* child */
child();
break;
default:
parent(pid);
break;
}
return 0;
}
int child(void)
{
ptrace(PTRACE_TRACEME);
kill(getpid(), SIGSTOP);
sleep(10);
return 0;
}
int parent(pid_t pid)
{
int ret;
struct user_fxsr_struct fpxregs;
ret = ptrace(PTRACE_GETFPXREGS, pid, 0, &fpxregs);
if(ret < 0){
printf("%d: %s.\n", errno, strerror(errno));
}
kill(pid, SIGCONT);
wait(pid);
return 0;
}
/* in the kerel, at kernel/i387.c get_fpxregs() */
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Vegard Nossum reported crashes during cpu hotplug tests:
http://marc.info/?l=linux-kernel&m=121413950227884&w=4
In function _cpu_up, the panic happens when calling
__raw_notifier_call_chain at the second time. Kernel doesn't panic when
calling it at the first time. If just say because of nr_cpu_ids, that's
not right.
By checking the source code, I found that function do_boot_cpu is the culprit.
Consider below call chain:
_cpu_up=>__cpu_up=>smp_ops.cpu_up=>native_cpu_up=>do_boot_cpu.
So do_boot_cpu is called in the end. In do_boot_cpu, if
boot_error==true, cpu_clear(cpu, cpu_possible_map) is executed. So later
on, when _cpu_up calls __raw_notifier_call_chain at the second time to
report CPU_UP_CANCELED, because this cpu is already cleared from
cpu_possible_map, get_cpu_sysdev returns NULL.
Many resources are related to cpu_possible_map, so it's better not to
change it.
Below patch against 2.6.26-rc7 fixes it by removing the bit clearing in
cpu_possible_map.
Signed-off-by: Zhang Yanmin <yanmin_zhang@linux.intel.com>
Tested-by: Vegard Nossum <vegard.nossum@gmail.com>
Acked-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
WARNING: arch/x86/mm/built-in.o(.text+0x3a1): Section mismatch in
reference from the function set_pte_phys() to the function
.init.text:spp_getpage()
The function set_pte_phys() references
the function __init spp_getpage().
This is often because set_pte_phys lacks a __init
annotation or the annotation of spp_getpage is wrong.
arch/x86/mm/init_64.c: In function 'early_memtest':
arch/x86/mm/init_64.c:520: warning: passing argument 2 of
'find_e820_area_size' from incompatible pointer type
Signed-off-by: Daniel J Blueman <daniel.blueman@gmail.com>
Cc: "Linus Torvalds" <torvalds@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
* 'kvm-updates-2.6.26' of git://git.kernel.org/pub/scm/linux/kernel/git/avi/kvm:
KVM: Remove now unused structs from kvm_para.h
x86: KVM guest: Use the paravirt clocksource structs and functions
KVM: Make kvm host use the paravirt clocksource structs
x86: Make xen use the paravirt clocksource structs and functions
x86: Add structs and functions for paravirt clocksource
KVM: VMX: Fix host msr corruption with preemption enabled
KVM: ioapic: fix lost interrupt when changing a device's irq
KVM: MMU: Fix oops on guest userspace access to guest pagetable
KVM: MMU: large page update_pte issue with non-PAE 32-bit guests (resend)
KVM: MMU: Fix rmap_write_protect() hugepage iteration bug
KVM: close timer injection race window in __vcpu_run
KVM: Fix race between timer migration and vcpu migration
This patch updates the kvm host code to use the pvclock structs
and functions, thereby making it compatible with Xen.
The patch also fixes an initialization bug: on SMP systems the
per-cpu has two different locations early at boot and after CPU
bringup. kvmclock must take that in account when registering the
physical address within the host.
Signed-off-by: Gerd Hoffmann <kraxel@redhat.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
This patch updates the kvm host code to use the pvclock structs.
It also makes the paravirt clock compatible with Xen.
Signed-off-by: Gerd Hoffmann <kraxel@redhat.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
This patch updates the xen guest to use the pvclock structs
and helper functions.
Signed-off-by: Gerd Hoffmann <kraxel@redhat.com>
Acked-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
This patch adds structs for the paravirt clocksource ABI
used by both xen and kvm (pvclock-abi.h).
It also adds some helper functions to read system time and
wall clock time from a paravirtual clocksource (pvclock.[ch]).
They are based on the xen code. They are enabled using
CONFIG_PARAVIRT_CLOCK.
Subsequent patches of this series will put the code in use.
Signed-off-by: Gerd Hoffmann <kraxel@redhat.com>
Acked-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Non-PAE operation has been deprecated in Xen for a while, and is
rarely tested or used. xen-unstable has now officially dropped
non-PAE support. Since Xen/pvops' non-PAE support has also been
broken for a while, we may as well completely drop it altogether.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Switching msrs can occur either synchronously as a result of calls to
the msr management functions (usually in response to the guest touching
virtualized msrs), or asynchronously when preempting a kvm thread that has
guest state loaded. If we're unlucky enough to have the two at the same
time, host msrs are corrupted and the machine goes kaput on the next syscall.
Most easily triggered by Windows Server 2008, as it does a lot of msr
switching during bootup.
Signed-off-by: Avi Kivity <avi@qumranet.com>
KVM has a heuristic to unshadow guest pagetables when userspace accesses
them, on the assumption that most guests do not allow userspace to access
pagetables directly. Unfortunately, in addition to unshadowing the pagetables,
it also oopses.
This never triggers on ordinary guests since sane OSes will clear the
pagetables before assigning them to userspace, which will trigger the flood
heuristic, unshadowing the pagetables before the first userspace access. One
particular guest, though (Xenner) will run the kernel in userspace, triggering
the oops. Since the heuristic is incorrect in this case, we can simply
remove it.
Signed-off-by: Avi Kivity <avi@qumranet.com>
kvm_mmu_pte_write() does not handle 32-bit non-PAE large page backed
guests properly. It will instantiate two 2MB sptes pointing to the same
physical 2MB page when a guest large pte update is trapped.
Instead of duplicating code to handle this, disallow directory level
updates to happen through kvm_mmu_pte_write(), so the two 2MB sptes
emulating one guest 4MB pte can be correctly created by the page fault
handling path.
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
rmap_next() does not work correctly after rmap_remove(), as it expects
the rmap chains not to change during iteration. Fix (for now) by restarting
iteration from the beginning.
Signed-off-by: Avi Kivity <avi@qumranet.com>
If a timer fires after kvm_inject_pending_timer_irqs() but before
local_irq_disable() the code will enter guest mode and only inject such
timer interrupt the next time an unrelated event causes an exit.
It would be simpler if the timer->pending irq conversion could be done
with IRQ's disabled, so that the above problem cannot happen.
For now introduce a new vcpu requests bit to cancel guest entry.
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
A guest vcpu instance can be scheduled to a different physical CPU
between the test for KVM_REQ_MIGRATE_TIMER and local_irq_disable().
If that happens, the timer will only be migrated to the current pCPU on
the next exit, meaning that guest LAPIC timer event can be delayed until
a host interrupt is triggered.
Fix it by cancelling guest entry if any vcpu request is pending. This
has the side effect of nicely consolidating vcpu->requests checks.
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Because NX is now enforced properly, we must put the hypercall page
into the .text segment so that it is executable.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Cc: Stable Kernel <stable@kernel.org>
Cc: the arch/x86 maintainers <x86@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
[ Stable: this isn't a bugfix in itself, but it's a pre-requiste
for "xen: don't drop NX bit" ]
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Cc: Stable Kernel <stable@kernel.org>
Cc: the arch/x86 maintainers <x86@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
General Software writes their own VSA2 module for their version
of the Geode BIOS, which returns a different ID then the standard
VSA2. This was causing the framebuffer driver to break for most
GSW boards.
Signed-off-by: Jordan Crouse <jordan.crouse@amd.com>
Cc: tglx@linutronix.de
Cc: linux-geode@lists.infradead.org
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This patch uses the BOOTMEM_EXCLUSIVE for crashkernel reservation also for
i386 and prints a error message on failure.
The patch is still for 2.6.26 since it is only bug fixing. The unification
of reserve_crashkernel() between i386 and x86_64 should be done for 2.6.27.
Signed-off-by: Bernhard Walle <bwalle@suse.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Cc: <stable@kernel.org>
Booting 2.6.26-rc6 on my 486 DX/4 fails with a "BUG: Int 6"
(invalid opcode) and a kernel halt immediately after the
kernel has been uncompressed. The BUG shows EIP pointing
to an rdtsc instruction in native_read_tsc(), invoked from
native_sched_clock().
(This error occurs so early that not even the serial console
can capture it.)
A bisection showed that this bug first occurs in 2.6.26-rc3-git7,
via commit 9ccc906c97:
>x86: distangle user disabled TSC from unstable
>
>tsc_enabled is set to 0 from the command line switch "notsc" and from
>the mark_tsc_unstable code. Seperate those functionalities and replace
>tsc_enable with tsc_disable. This makes also the native_sched_clock()
>decision when to use TSC understandable.
>
>Preparatory patch to solve the sched_clock() issue on 32 bit.
>
>Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
The core reason for this bug is that native_sched_clock() gets
called before tsc_init().
Before the commit above, tsc_32.c used a "tsc_enabled" variable
which defaulted to 0 == disabled, and which only got enabled late
in tsc_init(). Thus early calls to native_sched_clock() would skip
the TSC and use jiffies instead.
After the commit above, tsc_32.c uses a "tsc_disabled" variable
which defaults to 0, meaning that the TSC is Ok to use. Early calls
to native_sched_clock() now erroneously try to use the TSC on
!cpu_has_tsc processors, leading to invalid opcode exceptions.
My proposed fix is to initialise tsc_disabled to a "soft disabled"
state distinct from the hard disabled state set up by the "notsc"
kernel option. This fixes the native_sched_clock() problem. It also
allows tsc_init() to be simplified: instead of setting tsc_disabled = 1
on every error return, we just set tsc_disabled = 0 once when all
checks have succeeded.
I've verified that this lets my 486 boot again. I've also verified
that a Core2 machine still uses the TSC as clocksource after the patch.
Signed-off-by: Mikael Pettersson <mikpe@it.uu.se>
Signed-off-by: Ingo Molnar <mingo@elte.hu>