want to remove arch_get_ram_range, and use early_node_map instead.
Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Making a variable page-aligned by using
__attribute__((section(".data.page_aligned"))) is fragile because if
sizeof(variable) is not also a multiple of page size, it leaves
variables in the remainder of the section unaligned.
This patch introduces two new qualifiers, __page_aligned_data and
__page_aligned_bss to set the section *and* the alignment of
variables. This makes page-aligned variables more robust because the
linker will make sure they're aligned properly. Unfortunately it
requires *all* page-aligned data to use these macros...
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This patch adds a 'flags' parameter to reserve_bootmem_generic() like it
already has been added in reserve_bootmem() with commit
72a7fe3967.
It also changes all users to use BOOTMEM_DEFAULT, which doesn't effectively
change the behaviour. Since the change is x86-specific, I don't think it's
necessary to add a new API for migration. There are only 4 users of that
function.
The change is necessary for the next patch, using reserve_bootmem_generic()
for crashkernel reservation.
Signed-off-by: Bernhard Walle <bwalle@suse.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
sparse mutters:
arch/x86/mm/numa_64.c:195:27: warning: symbol 'end_pfn' shadows an earlier one
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
plat_node_bdata, cmdline, nodemap_addr, nodemap_size are local to
numa_64.c. Make them static
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
* Consolidate node_to_cpumask operations and remove the 256k
byte node_to_cpumask_map. This is done by allocating the
node_to_cpumask_map array after the number of possible nodes
(nr_node_ids) is known.
* Debug printouts when CONFIG_DEBUG_PER_CPU_MAPS is active have
been increased. It now shows faults when calling node_to_cpumask()
and node_to_cpumask_ptr().
For inclusion into sched-devel/latest tree.
Based on:
git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux-2.6.git
+ sched-devel/latest .../mingo/linux-2.6-sched-devel.git
Signed-off-by: Mike Travis <travis@sgi.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
* Introduce a new PER_CPU macro called "EARLY_PER_CPU". This is
used by some per_cpu variables that are initialized and accessed
before there are per_cpu areas allocated.
["Early" in respect to per_cpu variables is "earlier than the per_cpu
areas have been setup".]
This patchset adds these new macros:
DEFINE_EARLY_PER_CPU(_type, _name, _initvalue)
EXPORT_EARLY_PER_CPU_SYMBOL(_name)
DECLARE_EARLY_PER_CPU(_type, _name)
early_per_cpu_ptr(_name)
early_per_cpu_map(_name, _idx)
early_per_cpu(_name, _cpu)
The DEFINE macro defines the per_cpu variable as well as the early
map and pointer. It also initializes the per_cpu variable and map
elements to "_initvalue". The early_* macros provide access to
the initial map (usually setup during system init) and the early
pointer. This pointer is initialized to point to the early map
but is then NULL'ed when the actual per_cpu areas are setup. After
that the per_cpu variable is the correct access to the variable.
The early_per_cpu() macro is not very efficient but does show how to
access the variable if you have a function that can be called both
"early" and "late". It tests the early ptr to be NULL, and if not
then it's still valid. Otherwise, the per_cpu variable is used
instead:
#define early_per_cpu(_name, _cpu) \
(early_per_cpu_ptr(_name) ? \
early_per_cpu_ptr(_name)[_cpu] : \
per_cpu(_name, _cpu))
A better method is to actually check the pointer manually. In the
case below, numa_set_node can be called both "early" and "late":
void __cpuinit numa_set_node(int cpu, int node)
{
int *cpu_to_node_map = early_per_cpu_ptr(x86_cpu_to_node_map);
if (cpu_to_node_map)
cpu_to_node_map[cpu] = node;
else
per_cpu(x86_cpu_to_node_map, cpu) = node;
}
* Add a flag "arch_provides_topology_pointers" that indicates pointers
to topology cpumask_t maps are available. Otherwise, use the function
returning the cpumask_t value. This is useful if cpumask_t set size
is very large to avoid copying data on to/off of the stack.
* The coverage of CONFIG_DEBUG_PER_CPU_MAPS has been increased while
the non-debug case has been optimized a bit.
* Remove an unreferenced compiler warning in drivers/base/topology.c
* Clean up #ifdef in setup.c
For inclusion into sched-devel/latest tree.
Based on:
git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux-2.6.git
+ sched-devel/latest .../mingo/linux-2.6-sched-devel.git
Signed-off-by: Mike Travis <travis@sgi.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
so don't punish all other cpus without that problem when init highmem
Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
use early_node_map to init high pages, so we can remove page_is_ram() and
page_is_reserved_early() in the big loop with add_one_highpage
also remove page_is_reserved_early(), it is not needed anymore.
v2: fix the build of other platforms
Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
in case we have kva before ramdisk on a node, we still need to use
those ranges.
v2: reserve_early kva ram area, in case there are holes in highmem, to avoid
those area could be treat as free high pages.
Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
1. add reserve_bootmem_generic for 32bit
2. change len to unsigned long
3. make early_res_to_bootmem to use it
Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This patch adds a 'flags' parameter to reserve_bootmem_generic() like it
already has been added in reserve_bootmem() with commit
72a7fe3967.
It also changes all users to use BOOTMEM_DEFAULT, which doesn't effectively
change the behaviour. Since the change is x86-specific, I don't think it's
necessary to add a new API for migration. There are only 4 users of that
function.
The change is necessary for the next patch, using reserve_bootmem_generic()
for crashkernel reservation.
Signed-off-by: Bernhard Walle <bwalle@suse.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Use PAGE_OFFSET macro instead of using 0xffff810000000000UL directly.
Signed-off-by: Jiri Slaby <jirislaby@gmail.com>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: hpa@zytor.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
1) Remove __meminit from update_pages_count. It is used inside
split_pages()
2) Make the code depend on PROC_FS. Doing statistics for nothing is
useless and not adding useless code is nice to the Linux tiny folks.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Add information about the mapping state of the direct mapping to
/proc/meminfo. I chose /proc/meminfo because that is where all the other
memory statistics are too and it is a generally useful metric even
outside debugging situations. A lot of split kernel pages means the
kernel will run slower.
This way we can see how many large pages are really used for it and how
many are split.
Useful for general insight into the kernel.
v2: Add hotplug locking to 64bit to plug a very obscure theoretical race.
32bit doesn't need it because it doesn't support hotadd for lowmem.
Fix some typos
v3: Rename dpages_cnt
Add CONFIG ifdef for count update as requested by tglx
Expand description
v4: Fix stupid bugs added in v3
Move update_page_count to pageattr.c
Signed-off-by: Andi Kleen <andi@firstfloor.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Remove the #ifdef conditional because this comparison is already done in
user_mode_vm().
Signed-off-by: Gustavo F. Padovan <gustavo@las.ic.unicamp.br>
Cc: akpm@osdl.org
Signed-off-by: Ingo Molnar <mingo@elte.hu>
fix this warning:
arch/x86/mm/init_64.c: In function 'early_memtest':
arch/x86/mm/init_64.c:524: warning: passing argument 2 of 'find_e820_area_size' from incompatible pointer type
Signed-off-by: Ingo Molnar <mingo@elte.hu>
'man 3 printf' tells me that %p should be printed as if by %#x, but
this is not true for the kernel, which does not use the '0x' prefix
for the %p conversion specifier.
A small cast to (void *) is also prettier than #ifdef/#else/#endif.
Signed-off-by: Vegard Nossum <vegard.nossum@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
WARNING: arch/x86/mm/built-in.o(.text+0x3a1): Section mismatch in
reference from the function set_pte_phys() to the function
.init.text:spp_getpage()
The function set_pte_phys() references
the function __init spp_getpage().
This is often because set_pte_phys lacks a __init
annotation or the annotation of spp_getpage is wrong.
arch/x86/mm/init_64.c: In function 'early_memtest':
arch/x86/mm/init_64.c:520: warning: passing argument 2 of
'find_e820_area_size' from incompatible pointer type
Signed-off-by: Daniel J Blueman <daniel.blueman@gmail.com>
Cc: "Linus Torvalds" <torvalds@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
... to strip down loop body in reserve_memtype.
Signed-off-by: Andreas Herrmann <andreas.herrmann3@amd.com>
Cc: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Cc: Suresh B Siddha <suresh.b.siddha@intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
... and move last debug message out of locked section.
Signed-off-by: Andreas Herrmann <andreas.herrmann3@amd.com>
Cc: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Cc: Suresh B Siddha <suresh.b.siddha@intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
- add BUG statement to catch invalid start and end parameters
- No need to track the actual type in both req_type and actual_type --
keep req_type unchanged.
- removed (IMHO) superfluous comments
Signed-off-by: Andreas Herrmann <andreas.herrmann3@amd.com>
Cc: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Cc: Suresh B Siddha <suresh.b.siddha@intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
- parse/ml => entry (within list_for_each and friends)
- new_entry => new
- ret_type => new_type (to avoid confusion with req_type)
(... to make it more readable...)
Signed-off-by: Andreas Herrmann <andreas.herrmann3@amd.com>
Cc: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Cc: Suresh B Siddha <suresh.b.siddha@intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
fix build failure:
arch/x86/mm/pgtable.c:280: warning: ‘enum fixed_addresses’ declared inside parameter list
arch/x86/mm/pgtable.c:280: warning: its scope is only this definition or declaration, which is probably not what you want
arch/x86/mm/pgtable.c:280: error: parameter 1 (‘idx’) has incomplete type
Signed-off-by: Ingo Molnar <mingo@elte.hu>
In both cases, I went with the 32-bit behaviour.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Clean up over-complications in pat_x_mtrr_type().
And if reserve_memtype() ignores stray req_type bits when
pat_enabled, it's better to mask them off when not also.
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Changed the call to find_e820_area_size to pass u64 instead of unsigned long.
Signed-off-by: Kevin Winchester <kjwinchester@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Fix an incompatible pointer type warning on x86_64 compilations.
early_memtest() is passing a u64* to find_e820_area_size() which is expecting
an unsigned long. Change t_start and t_size to unsigned long as those are
also 64-bit types on x88_64.
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Page faults in kernel address space between PAGE_OFFSET up to
VMALLOC_START should not try to map as vmalloc.
Fix rarely endless page faults inside mount_block_root for root
filesystem at boot time.
All 32bit kernels up to 2.6.25 can fail into this hole.
I can not present this under native linux kernel. I see, that the 64bit
has fixed the problem. I copied the same lines into 32bit part.
Recorded debugs are from coLinux kernel 2.6.22.18 (virtualisation):
http://www.henrynestler.com/colinux/testing/pfn-check-0.7.3/20080410-antinx/bug16-recursive-page-fault-endless.txt
The physicaly memory was trimmed down to 192MB to better catch the bug.
More memory gets the bug more rarely.
Details, how every x86 32bit system can fail:
Start from "mount_block_root",
http://lxr.linux.no/linux/init/do_mounts.c#L297
There the variable "fs_names" got one memory page with 4096 bytes.
Variable "p" walks through the existing file system types. The first
string is no problem.
But, with the second loop in mount_block_root the offset of "p" is not
at beginning of page, the offset is for example +9, if "reiserfs" is the
first in list.
Than calls do_mount_root, and lands in sys_mount.
Remember: Variable "type_page" contains now "fs_type+9" and not contains
a full page.
The sys_mount copies 4096 bytes with function "exact_copy_from_user()":
http://lxr.linux.no/linux/fs/namespace.c#L1540
Mostly exist pages after the buffer "fs_names+4096+9" and the page fault
handler was not called. No problem.
In the case, if the page after "fs_names+4096" is not mapped, the page
fault handler was called from http://lxr.linux.no/linux/fs/namespace.c#L1320
The do_page_fault gots an address 0xc03b4000.
It's kernel address, address >= TASK_SIZE, but not from vmalloc! It's
from "__getname()" alias "kmem_cache_alloc".
The "error_code" is 0. "vmalloc_fault" will be call:
http://lxr.linux.no/linux/arch/i386/mm/fault.c#L332
"vmalloc_fault" tryed to find the physical page for a non existing
virtual memory area. The macro "pte_present" in vmalloc_fault()
got a next page fault for 0xc0000ed0 at:
http://lxr.linux.no/linux/arch/i386/mm/fault.c#L282
No PTE exist for such virtual address. The page fault handler was trying
to sync the physical page for the PTE lockup.
This called vmalloc_fault() again for address 0xc000000, and that also
was not existing. The endless began...
In normal case the cpu would still loop with disabled interrrupts. Under
coLinux this was catched by a stack overflow inside printk debugs.
Signed-off-by: Henry Nestler <henry.nestler@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
BTW, what does pat_wc_enabled stand for? Does it mean
"write-combining"?
Currently it is used to globally switch on or off PAT support.
Thus I renamed it to pat_enabled.
I think this increases readability (and hope that I didn't miss
something).
Signed-off-by: Andreas Herrmann <andreas.herrmann3@amd.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>