It is dangerous to shutdown the apics in machine_crash_shutdown.
With my previous patch to initialize apics in init_IRQ we should be able to
boot a kernel without this. As long as we reinitialize the APICs we don't
care what state they were in during bootup.
This should make machine_crash_shutdown noticeably more reliable.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
All kinds of ugliness exists because we don't initialize
the apics during init_IRQs.
- We calibrate jiffies in non apic mode even when we are using apics.
- We have to have special code to initialize the apics when non-smp.
- The legacy i8259 must exist and be setup correctly, even
when we won't use it past initialization.
- The kexec on panic code must restore the state of the io_apics.
- init/main.c needs a special case for !smp smp_init on x86
In addition to pure code movement I needed a couple
of non-obvious changes:
- Move setup_boot_APIC_clock into APIC_late_time_init for
simplicity.
- Use cpu_khz to generate a better approximation of loops_per_jiffies
so I can verify the timer interrupt is working.
- Call setup_apic_nmi_watchdog again after cpu_khz is initialized on
the boot cpu.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The per cpu nmi watchdog timer is based on an event counter. idle cpus
don't generate events so the NMI watchdog doesn't fire and the test to see
if the watchdog is working fails.
- Add nmi_cpu_busy so idle cpus don't mess up the test.
- kmalloc prev_nmi_count to keep kernel stack usage bounded.
- Improve the error message on failure so there is enough
information to debug problems.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Currently we attempt to restore virtual wire mode on reboot, which only
works if we can figure out where the i8259 is connected. This is very
useful when we kexec another kernel and likely helpful when dealing with a
BIOS that make assumptions about how the system is setup.
Since the acpi MADT table does not provide the location where the i8259 is
connected we have to look at the hardware to figure it out.
Most systems have the i8259 connected the local apic of the cpu so won't be
affected but people running Opteron and some serverworks chipsets should be
able to use kexec now.
In addition this patch removes the hard coded assumption that the io_apic
that delivers isa interrups is always known to the kernel as io_apic 0. As
there does not appear to be anything to guarantee that assumption is true.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This is platform code update for ES7000: disables IRQ overrides for the
recent ES7000 (Rascal/Zorro), cleans up the compile warning. The patch
only affects the ES7000 subarch.
Signed-off-by: <Natalie.Protasevich@unisys.com>
Acked-by: Zwane Mwaikambo <zwane@arm.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The code that prints the cache size assumes that L3 always lives in chipset
and is shared across CPUs. Which is not really true.
I think all the cachesizes reported by cpuid are in the processor itself.
The attached patch changes the code to reflect that.
Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This was reported because someone was getting oopses reading /proc/iomem.
It was tracked down to a zero-sized 'struct resource' entry which was
located right at 4GB.
You need two conditions to hit this bug: a BIOS E820_RAM area starting at
exactly the boundary where you specify mem= (to get a zero-sized entry),
and for the legacy_init_iomem_resources() loop to skip that resource (which
only happens at exactly 4G).
I think the killing zero-sized e820 entry is the easiest way to fix this.
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Make P6 MCA initialization code complaint with guidelines in IA-32 SDM
Vol3. Bank 0 control register should not be set by OS and clear status
registers on all banks on reset.
This will prevent false MCE alarms on the systems that has some non-MCE
information left-over in MC0_STATUS on reboot.
Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Add an accessor function for getting the per-CPU gdt. Callee must already
have the CPU.
Signed-off-by: Zachary Amsden <zach@vmware.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The per-CPU initialization code is copying in bogus data into
thread->tls_array. Note that it copies &per_cpu(cpu_gdt_table, cpu), not
&per_cpu(cpu_gdt_table, cpu)[GDT_ENTRY_TLS_MIN). That is totally broken
and unnecessary. Make the initialization explicitly NULL.
Signed-off-by: Zachary Amsden <zach@vmware.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The patch allows physical bring-up of new processors (not initially present
in the configuration) from facilities such as driver/utility implemented on
a platform. The actual method of making processors available is up to the
platform implementation.
Signed-off-by: Natalie Protasevich <Natalie.Protasevich@unisys.com>
Cc: Shaohua Li <shaohua.li@intel.com>
Cc: Ashok Raj <ashok.raj@intel.com>
Cc: Zwane Mwaikambo <zwane@holomorphy.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Join together some common functions (pmd_page{,_kernel}) over 2level and
3level pages.
Signed-off-by: Paolo 'Blaisorblade' Giarrusso <blaisorblade@yahoo.it>
Acked-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Initial internal version of Venki's cpuid(4) deterministic cache parameter
identification patch used static arrays of size MAX_CACHE_LEAVES. Final patch
which made to the base used dynamic array allocation, with this
MAX_CACHE_LEAVES limit hunk still in place.
cpuid(4) already has a mechanism to find out the number of cache levels
implemented and there is no need for this hardcoded MAX_CACHE_LEAVES limit.
So remove the MAX_CACHE_LEAVES limit from the routine which calculates the
number of cache levels using cpuid(4)
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Cc: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
There exists a field io_bitmap_owner in the TSS that is only checked, but
never set to anything else but NULL.
Signed-off-by: Bart Oldeman <bartoldeman@users.sourceforge.net>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
mxcsr_feature_mask_init isn't needed in suspend/resume time (we can use
boot time mask). And actually it's harmful, as it clear task's saved
fxsave in resume. This bug is widely seen by users using zsh.
(akpm: my eyes. Fixed some surrounding whitespace mess)
Signed-off-by: Shaohua Li<shaohua.li@intel.com>
Cc: Pavel Machek <pavel@ucw.cz>
Cc: Andi Kleen <ak@muc.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This adjusts i386's cmpxchg patterns so that
- for word and long cmpxchg-es the compiler can utilize all possible
registers
- cmpxchg8b gets disabled when the minimum specified hardware architectur
doesn't support it (like was already happening for the byte, word, and
long ones).
Signed-off-by: Jan Beulich <jbeulich@novell.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
I just found out that some precision is unnecessarily lost in the
arch/i386/kernel/timers/timer_tsc.c:set_cyc2ns_scale function. It uses a
cpu_mhz parameter when it could use a cpu_khz. In the specific case of an
Intel P4 running at 3001.171 Mhz, the truncation to 3001 Mhz leads to an
imprecision of 19 microseconds per second : this is very sad for a timer with
nearly nanosecond accuracy.
Fix the x86_64 architecture too.
Cc: george anzinger <george@mvista.com>
Cc: john stultz <johnstul@us.ibm.com>
Cc: Andi Kleen <ak@muc.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This patch removes a bunch of unecessary checks for (size_t < 0) in
selinuxfs.
Signed-off-by: James Morris <jmorris@namei.org>
Acked-by: Stephen Smalley <sds@tycho.nsa.gov>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
security/selinux/hooks.c: In function `selinux_inode_getxattr':
security/selinux/hooks.c:2193: warning: unused variable `sbsec'
Cc: James Morris <jmorris@namei.org>
Cc: Stephen Smalley <sds@tycho.nsa.gov>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This patch allows SELinux to canonicalize the value returned from
getxattr() via the security_inode_getsecurity() hook, which is called after
the fs level getxattr() function.
The purpose of this is to allow the in-core security context for an inode
to override the on-disk value. This could happen in cases such as
upgrading a system to a different labeling form (e.g. standard SELinux to
MLS) without needing to do a full relabel of the filesystem.
In such cases, we want getxattr() to return the canonical security context
that the kernel is using rather than what is stored on disk.
The implementation hooks into the inode_getsecurity(), adding another
parameter to indicate the result of the preceding fs-level getxattr() call,
so that SELinux knows whether to compare a value obtained from disk with
the kernel value.
We also now allow getxattr() to work for mountpoint labeled filesystems
(i.e. mount with option context=foo_t), as we are able to return the
kernel value to the user.
Signed-off-by: James Morris <jmorris@namei.org>
Signed-off-by: Stephen Smalley <sds@tycho.nsa.gov>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This patch converts SELinux code from kmalloc/memset to the new kazalloc
unction. On i386, this results in a text saving of over 1K.
Before:
text data bss dec hex filename
86319 4642 15236 106197 19ed5 security/selinux/built-in.o
After:
text data bss dec hex filename
85278 4642 15236 105156 19ac4 security/selinux/built-in.o
Signed-off-by: James Morris <jmorris@namei.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Add CONFIG_X86_32 for i386. This allows selecting options that only apply
to 32-bit systems.
(X86 && !X86_64) becomes X86_32
(X86 || X86_64) becomes X86
Signed-off-by: Brian Gerst <bgerst@didntduck.org>
Cc: Sam Ravnborg <sam@ravnborg.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Since CONFIG_IKCONFIG_PROC already depends on CONFIG_IKCONFIG, adding
configs.o again is redundant.
Signed-off-by: Brian Gerst <bgerst@didntduck.org>
Cc: Sam Ravnborg <sam@ravnborg.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
ata_pci_init_one() receives an array of struct ata_port_info. Recent
updates to the code had always obtained port information from
array element 0, rather than array element N.
Change to avoid hardcoding port_info[0], thereby restoring proper
hardware information to secondary legacy ports.
The second argument to ata_qc_complete() was being used for two
purposes: communicate the ATA Status register to the completion
function, and indicate an error. On legacy PCI IDE hardware, the latter
is often implicit in the former. On more modern hardware, the driver
often completely emulated a Status register value, passing ATA_ERR as an
indication that something went wrong.
Now that previous code changes have eliminated the need to use drv_stat
arg to communicate the ATA Status register value, we can convert it to a
mask of possible error classes.
This will lead to more flexible error handling in the future.
move EXPORT_SYMBOL(filemap_populate) to the proper place: just after
function itself: it's easy to miss that function is exported otherwise.
Signed-off-by: Nikita Danilov <nikita@clusterfs.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
In 'mm' change the explicit use of a for-loop using NR_CPUS into the
general for_each_cpu() constructs. This widens the scope of potential
future optimizations of the general constructs, as well as takes advantage
of the existing optimizations of first_cpu() and next_cpu(), which is
advantageous when the true CPU count is much smaller than NR_CPUS.
Signed-off-by: John Hawkes <hawkes@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Policy contextualization is only useful for task based policies and not for
vma based policies. It may be useful to define allowed nodes that are not
accessible from this thread because other threads may have access to these
nodes. Without this patch strange memory policy situations may cause an
application to fail with out of memory.
Example:
Let's say we have two threads A and B that share the same address space and
a huge array computational array X.
Thread A is restricted by its cpuset to nodes 0 and 1 and thread B is
restricted by its cpuset to nodes 2 and 3.
Thread A now wants to restrict allocations to the first node and thus
applies a BIND policy on X to node 0 and 2. The cpuset limits this to node
0. Thus pages for X must be allocated on node 0 now.
Thread B now touches a page that has never been used in X and faults in a
page. According to the BIND policy of the vma for X the page must be
allocated on page 0. However, the cpuset of B does not allow allocation on
0 and 1. Now the application fails in alloc_pages with out of memory.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Cc: Andi Kleen <ak@muc.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
- Do a separation between do_xxx and sys_xxx functions. sys_xxx functions
take variable sized bitmaps from user space as arguments. do_xxx functions
take fixed sized nodemask_t as arguments and may be used from inside the
kernel. Doing so simplifies the initialization code. There is no
fs = kernel_ds assumption anymore.
- Split up get_nodes into get_nodes (which gets the node list) and
contextualize_policy which restricts the nodes to those accessible
to the task and updates cpusets.
- Add comments explaining limitations of bind policy
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Cc: Andi Kleen <ak@muc.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Here is a set of ppc64 specific patches that at least allow
compilation/booting with the following configurations:
FLATMEM
SPARSEMEN
SPARSEMEM + MEMORY_HOTPLUG
Signed-off-by: Mike Kravetz <kravetz@us.ibm.com>
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Adds the necessary for non-NUMA hot-add of highmem to an existing zone on
i386.
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
From: IWAMOTO Toshihiro <iwamoto@valinux.co.jp>
> I found the tests does not work well with Dave's patchset.
> I've found the followings:
>
> - setup_per_zone_pages_min() calls should be added in
> capture_page_range() and online_pages()
> - lru_add_drain() should be called before try_to_migrate_pages()
The following patch deals with the first item.
Signed-off-by: IWAMOTO Toshihiro <iwamoto@valinux.co.jp>
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This basically keeps up from having to extern __kmalloc_section_memmap().
The vaddr_in_vmalloc_area() helper could go in a vmalloc header, but that
header gets hard to work with, because it needs some arch-specific macros.
Just stick it in here for now, instead of creating another header.
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
Signed-off-by: Lion Vollnhals <webmaster@schiggl.de>
Signed-off-by: Jiri Slaby <xslaby@fi.muni.cz>
Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This adds generic memory add/remove and supporting functions for memory
hotplug into a new file as well as a memory hotplug kernel config option.
Individual architecture patches will follow.
For now, disable memory hotplug when swsusp is enabled. There's a lot of
churn there right now. We'll fix it up properly once it calms down.
Signed-off-by: Matt Tolentino <matthew.e.tolentino@intel.com>
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
See the "fixup bad_range()" patch for more information, but this actually
creates a the lock to protect things making assumptions about a zone's size
staying constant at runtime.
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
pgdat->node_size_lock is basically only neeeded in one place in the normal
code: show_mem(), which is the arch-specific sysrq-m printing function.
Strictly speaking, the architectures not doing memory hotplug do no need this
locking in show_mem(). However, they are all included for completeness. This
should also make any future consolidation of all of the implementations a
little more straightforward.
This lock is also held in the sparsemem code during a memory removal, as
sections are invalidated. This is the place there pfn_valid() is made false
for a memory area that's being removed. The lock is only required when doing
pfn_valid() operations on memory which the user does not already have a
reference on the page, such as in show_mem().
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
When doing memory hotplug operations, the size of existing zones can obviously
change. This means that zone->zone_{start_pfn,spanned_pages} can change.
There are currently no locks that protect these structure members. However,
they are rarely accessed at runtime. Outside of swsusp, the only place that I
can find is bad_range().
So, split bad_range() up into two pieces: one that needs to be locked and
anther that doesn't.
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
A little helper that we use in the hotplug code.
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
If a zone is empty at boot-time and then hot-added to later, it needs to run
the same init code that would have been run on it at boot.
This patch breaks out zone table and per-cpu-pages functions for use by the
hotplug code. You can almost see all of the free_area_init_core() function on
one page now. :)
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The following series implements memory hot-add for ppc64 and i386. There are
x86_64 and ia64 implementations that will be submitted shortly as well,
through the normal maintainers.
This patch:
local_mapnr is unused, except for in an alpha header. Keep the alpha one,
kill the rest.
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
We had a problem on ppc64 where with more than 4 threads a large system
wouldn't scale well while faulting in the .text (most of the time was spent
in the kernel despite it was an userland compute intensive app). The
reason is the useless overwrite of the same pte from all cpu.
I fixed it this way (verified on an older kernel but the forward port is
almost identical). This will benefit all archs not just ppc64.
Signed-off-by: Andrea Arcangeli <andrea@suse.de>
Cc: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>