Commit graph

1230 commits

Author SHA1 Message Date
David Howells
165b239270 [PATCH] NOMMU: make SYSV SHM nattch work correctly
Make the SYSV SHM nattch counter work correctly by forcing multiple VMAs to
be produced to represent MAP_SHARED segments, even if they overlap exactly.

Using this test program:

	http://people.redhat.com/~dhowells/doshm.c

Run as:

	doshm sysv

I can see nattch going from one before the patch:

	# /doshm sysv
	Command: sysv
	shmid: 65536
	memory: 0xc3700000
	c0b00000-c0b04000 rw-p 00000000 00:00 0
	c0bb0000-c0bba788 r-xs 00000000 00:0b 14582157  /lib/ld-uClibc-0.9.28.so
	c3180000-c31dede4 r-xs 00000000 00:0b 14582179  /lib/libuClibc-0.9.28.so
	c3520000-c352278c rw-p 00000000 00:0b 13763417  /doshm
	c3584000-c35865e8 r-xs 00000000 00:0b 13763417  /doshm
	c3588000-c358aa00 rw-p 00008000 00:0b 14582157  /lib/ld-uClibc-0.9.28.so
	c3590000-c359b6c0 rw-p 00000000 00:00 0
	c3620000-c3640000 rwxp 00000000 00:00 0
	c3700000-c37fa000 rw-S 00000000 00:06 1411      /SYSV00000000 (deleted)
	c3700000-c37fa000 rw-S 00000000 00:06 1411      /SYSV00000000 (deleted)
	nattch 1

To two after the patch:

	# /doshm sysv
	Command: sysv
	shmid: 0
	memory: 0xc3700000
	c0bb0000-c0bba788 r-xs 00000000 00:0b 14582157  /lib/ld-uClibc-0.9.28.so
	c3180000-c31dede4 r-xs 00000000 00:0b 14582179  /lib/libuClibc-0.9.28.so
	c3320000-c3340000 rwxp 00000000 00:00 0
	c3530000-c35325e8 r-xs 00000000 00:0b 13763417  /doshm
	c3534000-c353678c rw-p 00000000 00:0b 13763417  /doshm
	c3538000-c353aa00 rw-p 00008000 00:0b 14582157  /lib/ld-uClibc-0.9.28.so
	c3590000-c359b6c0 rw-p 00000000 00:00 0
	c35a4000-c35a8000 rw-p 00000000 00:00 0
	c3700000-c37fa000 rw-S 00000000 00:06 1369      /SYSV00000000 (deleted)
	c3700000-c37fa000 rw-S 00000000 00:06 1369      /SYSV00000000 (deleted)
	nattch 2

That's +1 to nattch for each shmat() made.

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-03-22 19:39:06 -07:00
David Howells
d56e03cd27 [PATCH] NOMMU: supply get_unmapped_area() to fix NOMMU SYSV SHM
Supply a get_unmapped_area() to fix NOMMU SYSV SHM support.

Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Adam Litke <agl@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-03-22 19:39:05 -07:00
Ankita Garg
35ae834fa0 [PATCH] oom fix: prevent oom from killing a process with children/sibling unkillable
Looking at oom_kill.c, found that the intention to not kill the selected
process if any of its children/siblings has OOM_DISABLE set, is not being
met.

Signed-off-by: Ankita Garg <ankita@in.ibm.com>
Acked-by: Nick Piggin <npiggin@suse.de>
Acked-by: William Irwin <wli@holomorphy.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-03-16 19:25:06 -07:00
Peter Zijlstra
89a09141df [PATCH] nfs: fix congestion control
The current NFS client congestion logic is severly broken, it marks the
backing device congested during each nfs_writepages() call but doesn't
mirror this in nfs_writepage() which makes for deadlocks.  Also it
implements its own waitqueue.

Replace this by a more regular congestion implementation that puts a cap on
the number of active writeback pages and uses the bdi congestion waitqueue.

Also always use an interruptible wait since it makes sense to be able to
SIGKILL the process even for mounts without 'intr'.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Trond Myklebust <trond.myklebust@fys.uio.no>
Cc: Christoph Lameter <clameter@engr.sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-03-16 19:25:05 -07:00
Zach Brown
65b8291c40 [PATCH] dio: invalidate clean pages before dio write
This patch fixes a user-triggerable oops that was reported by Leonid
Ananiev as archived at http://lkml.org/lkml/2007/2/8/337.

dio writes invalidate clean pages that intersect the written region so that
subsequent buffered reads go to disk to read the new data.  If this fails
the interface tries to tell the caller that the cache is inconsistent by
returning EIO.

Before this patch we had the problem where this invalidation failure would
clobber -EIOCBQUEUED as it made its way from fs/direct-io.c to fs/aio.c.
Both fs/aio.c and bio completion call aio_complete() and we reference freed
memory, usually oopsing.

This patch addresses this problem by invalidating before the write so that
we can cleanly return -EIO before ->direct_IO() has had a chance to return
-EIOCBQUEUED.

There is a compromise here.  During the dio write we can fault in mmap()ed
pages which intersect the written range with get_user_pages() if the user
provided them for the source buffer.  This is a crazy thing to do, but we
can make it mostly work in most cases by trying the invalidation again.
The compromise is that we won't return an error if this second invalidation
fails if it's an AIO write and we have -EIOCBQUEUED.

This was tested by having two processes race performing large O_DIRECT and
buffered ordered writes.  Within minutes ext3 would see a race between
ext3_releasepage() and jbd holding a reference on ordered data buffers and
would cause invalidation to fail, panicing the box.  The test can be found
in the 'aio_dio_bugs' test group in test.kernel.org/autotest.  After this
patch the test passes.

Signed-off-by: Zach Brown <zach.brown@oracle.com>
Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
Cc: Leonid Ananiev <leonid.i.ananiev@linux.intel.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-03-16 19:25:04 -07:00
Nick Piggin
00e9fa2d64 [PATCH] mm: fix madvise infinine loop
madvise(MADV_REMOVE) can go into an infinite loop or cause an oops if the
call covers a region from the start of a vma, and extending past that vma.

Signed-off-by: Nick Piggin <npiggin@suse.de>
Cc: Badari Pulavarty <pbadari@us.ibm.com>
Acked-by: Hugh Dickins <hugh@veritas.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-03-16 19:25:04 -07:00
Christoph Lameter
0dc952dc3e [PATCH] Page migration: Fix vma flag checking
Currently we do not check for vma flags if sys_move_pages is called to move
individual pages.  If sys_migrate_pages is called to move pages then we
check for vm_flags that indicate a non migratable vma but that still
includes VM_LOCKED and we can migrate mlocked pages.

Extract the vma_migratable check from mm/mempolicy.c, fix it and put it
into migrate.h so that is can be used from both locations.

Problem was spotted by Lee Schermerhorn

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-03-05 07:57:51 -08:00
Hugh Dickins
759b9775c2 [PATCH] shmem and simple const super_operations
shmem's super_operations were missed from the recent const-ification;
and simple_fill_super()'s, which can share with get_sb_pseudo()'s.

Signed-off-by: Hugh Dickins <hugh@veritas.com>
Acked-by: Josef 'Jeff' Sipek <jsipek@cs.sunysb.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-03-05 07:57:51 -08:00
Trond Myklebust
7b965e0884 [PATCH] VM: invalidate_inode_pages2_range() should not exit early
Fix invalidate_inode_pages2_range() so that it does not immediately exit
just because a single page in the specified range could not be removed.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-03-01 14:53:39 -08:00
Oleg Nesterov
34bbd70405 [PATCH] adapt page_lock_anon_vma() to PREEMPT_RCU
page_lock_anon_vma() uses spin_lock() to block RCU.  This doesn't work with
PREEMPT_RCU, we have to do rcu_read_lock() explicitely.  Otherwise, it is
theoretically possible that slab returns anon_vma's memory to the system
before we do spin_unlock(&anon_vma->lock).

[ Hugh points out that this only matters for PREEMPT_RCU, which isn't merged
  yet, and may never be.  Regardless, this patch is conceptually the
  right thing to do, even if it doesn't matter at this point.  - Linus ]

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Paul McKenney <paulmck@linux.vnet.ibm.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Christoph Lameter <clameter@engr.sgi.com>
Acked-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-03-01 14:53:39 -08:00
Andrew Morton
232ea4d69d [PATCH] throttle_vm_writeout(): don't loop on GFP_NOFS and GFP_NOIO allocations
throttle_vm_writeout() is designed to wait for the dirty levels to subside.
But if the caller holds IO or FS locks, we might be holding up that writeout.

So change it to take a single nap to give other devices a chance to clean some
memory, then return.

Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Cc: Kumar Gala <galak@kernel.crashing.org>
Cc: Pete Zaitcev <zaitcev@redhat.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-03-01 14:53:38 -08:00
David Miller
d1af65d13f [PATCH] Bug in MM_RB debugging
The code is seemingly trying to make sure that rb_next() brings us to
successive increasing vma entries.

But the two variables, prev and pend, used to perform these checks, are
never advanced.

Signed-off-by: David S. Miller <davem@davemloft.net>
Cc: Andrea Arcangeli <andrea@novell.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-03-01 14:53:38 -08:00
Nick Piggin
5409bae07a [PATCH] Rename PG_checked to PG_owner_priv_1
Rename PG_checked to PG_owner_priv_1 to reflect its availablilty as a
private flag for use by the owner/allocator of the page.  In the case of
pagecache pages (which might be considered to be owned by the mm),
filesystems may use the flag.

Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
Signed-off-by: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-03-01 14:53:37 -08:00
Randy Dunlap
05fb6bf0b2 [PATCH] kernel-doc fixes for 2.6.20-git15 (non-drivers)
Fix kernel-doc warnings in 2.6.20-git15 (lib/, mm/, kernel/, include/).

Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-03-01 14:53:37 -08:00
Adrian Bunk
9b83a6a852 [PATCH] mm/{,tiny-}shmem.c cleanups
shmem_{nopage,mmap} are no longer used in ipc/shm.c

Signed-off-by: Adrian Bunk <bunk@stusta.de>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-03-01 14:53:35 -08:00
Christoph Lameter
8ef8286689 [PATCH] slab: reduce size of alien cache to cover only possible nodes
The alien cache is a per cpu per node array allocated for every slab on the
system.  Currently we size this array for all nodes that the kernel does
support.  For IA64 this is 1024 nodes.  So we allocate an array with 1024
objects even if we only boot a system with 4 nodes.

This patch uses "nr_node_ids" to determine the number of possible nodes
supported by a hardware configuration and only allocates an alien cache
sized for possible nodes.

The initialization of nr_node_ids occurred too late relative to the bootstrap
of the slab allocator and so I moved the setup_nr_node_ids() into
free_area_init_nodes().

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-20 17:10:13 -08:00
Christoph Lameter
74c7aa8b85 [PATCH] Replace highest_possible_node_id() with nr_node_ids
highest_possible_node_id() is currently used to calculate the last possible
node idso that the network subsystem can figure out how to size per node
arrays.

I think having the ability to determine the maximum amount of nodes in a
system at runtime is useful but then we should name this entry
correspondingly, it should return the number of node_ids, and the the value
needs to be setup only once on bootup.  The node_possible_map does not
change after bootup.

This patch introduces nr_node_ids and replaces the use of
highest_possible_node_id().  nr_node_ids is calculated on bootup when the
page allocators pagesets are initialized.

[deweerdt@free.fr: fix oops]
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Cc: Neil Brown <neilb@suse.de>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Signed-off-by: Frederik Deweerdt <frederik.deweerdt@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-20 17:10:13 -08:00
KAMEZAWA Hiroyuki
8af5e2eb3c [PATCH] fix mempolicy's check on a system with memory-less-node
bind_zonelist() can create zero-length zonelist if there is a
memory-less-node.  This patch checks the length of zonelist.  If length is
0, returns -EINVAL.

tested on ia64/NUMA with memory-less-node.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Andi Kleen <ak@suse.de>
Cc: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-20 17:10:13 -08:00
NeilBrown
29dbb3fc80 [PATCH] knfsd: stop NFSD writes from being broken into lots of little writes to filesystem
When NFSD receives a write request, the data is typically in a number of
1448 byte segments and writev is used to collect them together.

Unfortunately, generic_file_buffered_write passes these to the filesystem
one at a time, so an e.g.  32K over-write becomes a series of partial-page
writes to each page, causing the filesystem to have to pre-read those pages
- wasted effort.

generic_file_buffered_write handles one segment of the vector at a time as
it has to pre-fault in each segment to avoid deadlocks.  When writing from
kernel-space (and nfsd does) this is not an issue, so
generic_file_buffered_write does not need to break and iovec from nfsd into
little pieces.

This patch avoids the splitting when  get_fs is KERNEL_DS as it is
from NFSd.

This issue was introduced by commit 6527c2bdf1

Acked-by: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Norman Weathers <norman.r.weathers@conocophillips.com>
Cc: Vladimir V. Saveliev <vs@namesys.com>
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-16 08:14:01 -08:00
Nick Piggin
e0a04cffa4 [PATCH] mincore: vma crossing fix
My mincore also forgot about crossing vmas.

Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-15 09:57:03 -08:00
Nick Piggin
4a76ef036a [PATCH] mincore: fill in results properly
Paper bag time. Thanks to Randy for noticing that I didn't actually assign
'present' to anything.

Unfortunately my original patch passed the few simple test cases I gave it,
purely by coincidence.

Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-15 09:57:03 -08:00
Nick Piggin
30fcffed81 [PATCH] mincore: CONFIG_SWAP=n fix
Fix mincore-anon patch to compile with CONFIG_SWAP=n

Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-15 09:57:03 -08:00
Arjan van de Ven
92e1d5be91 [PATCH] mark struct inode_operations const 2
Many struct inode_operations in the kernel can be "const".  Marking them const
moves these to the .rodata section, which avoids false sharing with potential
dirty data.  In addition it'll catch accidental writes at compile time to
these shared resources.

Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-12 09:48:46 -08:00
Nick Piggin
42da9cbd3e [PATCH] mm: mincore anon
Make mincore work for anon mappings, nonlinear, and migration entries.
Based on patch from Linus Torvalds <torvalds@linux-foundation.org>.

Signed-off-by: Nick Piggin <npiggin@suse.de>
Acked-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-12 09:48:27 -08:00
Benjamin Herrenschmidt
22cd25ed31 [PATCH] Add NOPFN_REFAULT result from vm_ops->nopfn()
Add a NOPFN_REFAULT return code for vm_ops->nopfn() equivalent to
NOPAGE_REFAULT for vmops->nopage() indicating that the handler requests a
re-execution of the faulting instruction

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Arnd Bergmann <arnd.bergmann@de.ibm.com>
Cc: Hugh Dickins <hugh@veritas.com>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-12 09:48:27 -08:00
Nick Piggin
e0dc0d8f4a [PATCH] add vm_insert_pfn()
Add a vm_insert_pfn helper, so that ->fault handlers can have nopfn
functionality by installing their own pte and returning NULL.

Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Arnd Bergmann <arnd.bergmann@de.ibm.com>
Cc: Hugh Dickins <hugh@veritas.com>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-12 09:48:27 -08:00
Paul E. McKenney
aa0f030374 [PATCH] Change constant zero to NOTIFY_DONE in ratelimit_handler()
Change a hard-coded constant 0 to the symbolic equivalent NOTIFY_DONE in
the ratelimit_handler() CPU notifier handler function.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-11 11:18:07 -08:00
Robert P. J. Day
72fd4a35a8 [PATCH] Numerous fixes to kernel-doc info in source files.
A variety of (mostly) innocuous fixes to the embedded kernel-doc content in
source files, including:

  * make multi-line initial descriptions single line
  * denote some function names, constants and structs as such
  * change erroneous opening '/*' to '/**' in a few places
  * reword some text for clarity

Signed-off-by: Robert P. J. Day <rpjday@mindspring.com>
Cc: "Randy.Dunlap" <rdunlap@xenotime.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-11 10:51:32 -08:00
Andrew Morton
fc0ecff698 [PATCH] remove invalidate_inode_pages()
Convert all calls to invalidate_inode_pages() into open-coded calls to
invalidate_mapping_pages().

Leave the invalidate_inode_pages() wrapper in place for now, marked as
deprecated.

Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-11 10:51:31 -08:00
Anton Altaparmakov
54bc485522 [PATCH] Export invalidate_mapping_pages() to modules
It makes no sense to me to export invalidate_inode_pages() and not
invalidate_mapping_pages() and I actually need invalidate_mapping_pages()
because of its range specification ability...

akpm: also remove the export of invalidate_inode_pages() by making it an
inlined wrapper.

Signed-off-by: Anton Altaparmakov <aia21@cantab.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-11 10:51:30 -08:00
Ingo Molnar
898552c9d8 [PATCH] lockdep: also check for freed locks in kmem_cache_free()
kmem_cache_free() was missing the check for freeing held locks.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-11 10:51:26 -08:00
Ken Chen
daa88c8d21 [PATCH] do not disturb page referenced state when unmapping memory range
When kernel unmaps an address range, it needs to transfer PTE state into
page struct.  Currently, kernel transfer access bit via
mark_page_accessed().  The call to mark_page_accessed in the unmap path
doesn't look logically correct.

At unmap time, calling mark_page_accessed will causes page LRU state to be
bumped up one step closer to more recently used state.  It is causing quite
a bit headache in a scenario when a process creates a shmem segment, touch
a whole bunch of pages, then unmaps it.  The unmapping takes a long time
because mark_page_accessed() will start moving pages from inactive to
active list.

I'm not too much concerned with moving the page from one list to another in
LRU.  Sooner or later it might be moved because of multiple mappings from
various processes.  But it just doesn't look logical that when user asks a
range to be unmapped, it's his intention that the process is no longer
interested in these pages.  Moving those pages to active list (or bumping
up a state towards more active) seems to be an over reaction.  It also
prolongs unmapping latency which is the core issue I'm trying to solve.

As suggested by Peter, we should still preserve the info on pte young
pages, but not more.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Ken Chen <kenchen@google.com>
Cc: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-11 10:51:19 -08:00
Ken Chen
767193253b [PATCH] simplify shmem_aops.set_page_dirty() method
shmem backed file does not have page writeback, nor it participates in
backing device's dirty or writeback accounting.  So using generic
__set_page_dirty_nobuffers() for its .set_page_dirty aops method is a bit
overkill.  It unnecessarily prolongs shm unmap latency.

For example, on a densely populated large shm segment (sevearl GBs), the
unmapping operation becomes painfully long.  Because at unmap, kernel
transfers dirty bit in PTE into page struct and to the radix tree tag.  The
operation of tagging the radix tree is particularly expensive because it
has to traverse the tree from the root to the leaf node on every dirty
page.  What's bothering is that radix tree tag is used for page write back.
 However, shmem is memory backed and there is no page write back for such
file system.  And in the end, we spend all that time tagging radix tree and
none of that fancy tagging will be used.  So let's simplify it by introduce
a new aops __set_page_dirty_no_writeback and this will speed up shm unmap.

Signed-off-by: Ken Chen <kenchen@google.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-11 10:51:19 -08:00
Christoph Lameter
5ac6da669e [PATCH] Set CONFIG_ZONE_DMA for arches with GENERIC_ISA_DMA
As Andi pointed out: CONFIG_GENERIC_ISA_DMA only disables the ISA DMA
channel management.  Other functionality may still expect GFP_DMA to
provide memory below 16M.  So we need to make sure that CONFIG_ZONE_DMA is
set independent of CONFIG_GENERIC_ISA_DMA.  Undo the modifications to
mm/Kconfig where we made ZONE_DMA dependent on GENERIC_ISA_DMA and set
theses explicitly in each arches Kconfig.

Reviews must occur for each arch in order to determine if ZONE_DMA can be
switched off.  It can only be switched off if we know that all devices
supported by a platform are capable of performing DMA transfers to all of
memory (Some arches already support this: uml, avr32, sh sh64, parisc and
IA64/Altix).

In order to switch ZONE_DMA off conditionally, one would have to establish
a scheme by which one can assure that no drivers are enabled that are only
capable of doing I/O to a part of memory, or one needs to provide an
alternate means of performing an allocation from a specific range of memory
(like provided by alloc_pages_range()) and insure that all drivers use that
call.  In that case the arches alloc_dma_coherent() may need to be modified
to call alloc_pages_range() instead of relying on GFP_DMA.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-11 10:51:19 -08:00
Christoph Lameter
4b51d66989 [PATCH] optional ZONE_DMA: optional ZONE_DMA in the VM
Make ZONE_DMA optional in core code.

- ifdef all code for ZONE_DMA and related definitions following the example
  for ZONE_DMA32 and ZONE_HIGHMEM.

- Without ZONE_DMA, ZONE_HIGHMEM and ZONE_DMA32 we get to a ZONES_SHIFT of
  0.

- Modify the VM statistics to work correctly without a DMA zone.

- Modify slab to not create DMA slabs if there is no ZONE_DMA.

[akpm@osdl.org: cleanup]
[jdike@addtoit.com: build fix]
[apw@shadowen.org: Simplify calculation of the number of bits we need for ZONES_SHIFT]
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Cc: Andi Kleen <ak@suse.de>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Kyle McMartin <kyle@mcmartin.ca>
Cc: Matthew Wilcox <willy@debian.org>
Cc: James Bottomley <James.Bottomley@steeleye.com>
Cc: Paul Mundt <lethal@linux-sh.org>
Signed-off-by: Andy Whitcroft <apw@shadowen.org>
Signed-off-by: Jeff Dike <jdike@addtoit.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-11 10:51:18 -08:00
Christoph Lameter
66701b1499 [PATCH] optional ZONE_DMA: introduce CONFIG_ZONE_DMA
This patch simply defines CONFIG_ZONE_DMA for all arches.  We later do special
things with CONFIG_ZONE_DMA after the VM and an arch are prepared to work
without ZONE_DMA.

CONFIG_ZONE_DMA can be defined in two ways depending on how an architecture
handles ISA DMA.

First if CONFIG_GENERIC_ISA_DMA is set by the arch then we know that the arch
needs ZONE_DMA because ISA DMA devices are supported.  We can catch this in
mm/Kconfig and do not need to modify arch code.

Second, arches may use ZONE_DMA in an unknown way.  We set CONFIG_ZONE_DMA for
all arches that do not set CONFIG_GENERIC_ISA_DMA in order to insure backwards
compatibility.  The arches may later undefine ZONE_DMA if their arch code has
been verified to not depend on ZONE_DMA.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Cc: Andi Kleen <ak@suse.de>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Kyle McMartin <kyle@mcmartin.ca>
Cc: Matthew Wilcox <willy@debian.org>
Cc: James Bottomley <James.Bottomley@steeleye.com>
Cc: Paul Mundt <lethal@linux-sh.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-11 10:51:18 -08:00
Christoph Lameter
6267276f3f [PATCH] optional ZONE_DMA: deal with cases of ZONE_DMA meaning the first zone
This patchset follows up on the earlier work in Andrew's tree to reduce the
number of zones.  The patches allow to go to a minimum of 2 zones.  This one
allows also to make ZONE_DMA optional and therefore the number of zones can be
reduced to one.

ZONE_DMA is usually used for ISA DMA devices.  There are a number of reasons
why we would not want to have ZONE_DMA

1. Some arches do not need ZONE_DMA at all.

2. With the advent of IOMMUs DMA zones are no longer needed.
   The necessity of DMA zones may drastically be reduced
   in the future. This patchset allows a compilation of
   a kernel without that overhead.

3. Devices that require ISA DMA get rare these days. All
   my systems do not have any need for ISA DMA.

4. The presence of an additional zone unecessarily complicates
   VM operations because it must be scanned and balancing
   logic must operate on its.

5. With only ZONE_NORMAL one can reach the situation where
   we have only one zone. This will allow the unrolling of many
   loops in the VM and allows the optimization of varous
   code paths in the VM.

6. Having only a single zone in a NUMA system results in a
   1-1 correspondence between nodes and zones. Various additional
   optimizations to critical VM paths become possible.

Many systems today can operate just fine with a single zone.  If you look at
what is in ZONE_DMA then one usually sees that nothing uses it.  The DMA slabs
are empty (Some arches use ZONE_DMA instead of ZONE_NORMAL, then ZONE_NORMAL
will be empty instead).

On all of my systems (i386, x86_64, ia64) ZONE_DMA is completely empty.  Why
constantly look at an empty zone in /proc/zoneinfo and empty slab in
/proc/slabinfo?  Non i386 also frequently have no need for ZONE_DMA and zones
stay empty.

The patchset was tested on i386 (UP / SMP), x86_64 (UP, NUMA) and ia64 (NUMA).

The RFC posted earlier (see
http://marc.theaimsgroup.com/?l=linux-kernel&m=115231723513008&w=2) had lots
of #ifdefs in them.  An effort has been made to minize the number of #ifdefs
and make this as compact as possible.  The job was made much easier by the
ongoing efforts of others to extract common arch specific functionality.

I have been running this for awhile now on my desktop and finally Linux is
using all my available RAM instead of leaving the 16MB in ZONE_DMA untouched:

christoph@pentium940:~$ cat /proc/zoneinfo
Node 0, zone   Normal
  pages free     4435
        min      1448
        low      1810
        high     2172
        active   241786
        inactive 210170
        scanned  0 (a: 0 i: 0)
        spanned  524224
        present  524224
    nr_anon_pages 61680
    nr_mapped    14271
    nr_file_pages 390264
    nr_slab_reclaimable 27564
    nr_slab_unreclaimable 1793
    nr_page_table_pages 449
    nr_dirty     39
    nr_writeback 0
    nr_unstable  0
    nr_bounce    0
    cpu: 0 pcp: 0
              count: 156
              high:  186
              batch: 31
    cpu: 0 pcp: 1
              count: 9
              high:  62
              batch: 15
  vm stats threshold: 20
    cpu: 1 pcp: 0
              count: 177
              high:  186
              batch: 31
    cpu: 1 pcp: 1
              count: 12
              high:  62
              batch: 15
  vm stats threshold: 20
  all_unreclaimable: 0
  prev_priority:     12
  temp_priority:     12
  start_pfn:         0

This patch:

In two places in the VM we use ZONE_DMA to refer to the first zone.  If
ZONE_DMA is optional then other zones may be first.  So simply replace
ZONE_DMA with zone 0.

This also fixes ZONETABLE_PGSHIFT.  If we have only a single zone then
ZONES_PGSHIFT may become 0 because there is no need anymore to encode the zone
number related to a pgdat.  However, we still need a zonetable to index all
the zones for each node if this is a NUMA system.  Therefore define
ZONETABLE_SHIFT unconditionally as the offset of the ZONE field in page flags.

[apw@shadowen.org: fix mismerge]
Acked-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Cc: Andi Kleen <ak@suse.de>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Kyle McMartin <kyle@mcmartin.ca>
Cc: Matthew Wilcox <willy@debian.org>
Cc: James Bottomley <James.Bottomley@steeleye.com>
Cc: Paul Mundt <lethal@linux-sh.org>
Signed-off-by: Andy Whitcroft <apw@shadowen.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-11 10:51:18 -08:00
Christoph Lameter
65e458d43d [PATCH] Drop get_zone_counts()
Values are available via ZVC sums.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-11 10:51:18 -08:00
Christoph Lameter
05a0416be2 [PATCH] Drop __get_zone_counts()
Values are readily available via ZVC per node and global sums.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-11 10:51:18 -08:00
Christoph Lameter
9195481d2f [PATCH] Drop nr_free_pages_pgdat()
Function is unnecessary now.  We can use the summing features of the ZVCs to
get the values we need.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-11 10:51:18 -08:00
Christoph Lameter
9617729941 [PATCH] Drop free_pages()
nr_free_pages is now a simple access to a global variable.  Make it a macro
instead of a function.

The nr_free_pages now requires vmstat.h to be included.  There is one
occurrence in power management where we need to add the include.  Directly
refrer to global_page_state() there to clarify why the #include was added.

[akpm@osdl.org: arm build fix]
[akpm@osdl.org: sparc64 build fix]
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-11 10:51:18 -08:00
Christoph Lameter
51ed449127 [PATCH] Reorder ZVCs according to cacheline
The global and per zone counter sums are in arrays of longs.  Reorder the ZVCs
so that the most frequently used ZVCs are put into the same cacheline.  That
way calculations of the global, node and per zone vm state touches only a
single cacheline.  This is mostly important for 64 bit systems were one 128
byte cacheline takes only 8 longs.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-11 10:51:17 -08:00
Christoph Lameter
d23ad42324 [PATCH] Use ZVC for free_pages
This is again simplifies some of the VM counter calculations through the use
of the ZVC consolidated counters.

[michal.k.k.piotrowski@gmail.com: build fix]
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Michal Piotrowski <michal.k.k.piotrowski@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-11 10:51:17 -08:00
Christoph Lameter
c878538598 [PATCH] Use ZVC for inactive and active counts
The determination of the dirty ratio to determine writeback behavior is
currently based on the number of total pages on the system.

However, not all pages in the system may be dirtied.  Thus the ratio is always
too low and can never reach 100%.  The ratio may be particularly skewed if
large hugepage allocations, slab allocations or device driver buffers make
large sections of memory not available anymore.  In that case we may get into
a situation in which f.e.  the background writeback ratio of 40% cannot be
reached anymore which leads to undesired writeback behavior.

This patchset fixes that issue by determining the ratio based on the actual
pages that may potentially be dirty.  These are the pages on the active and
the inactive list plus free pages.

The problem with those counts has so far been that it is expensive to
calculate these because counts from multiple nodes and multiple zones will
have to be summed up.  This patchset makes these counters ZVC counters.  This
means that a current sum per zone, per node and for the whole system is always
available via global variables and not expensive anymore to calculate.

The patchset results in some other good side effects:

- Removal of the various functions that sum up free, active and inactive
  page counts

- Cleanup of the functions that display information via the proc filesystem.

This patch:

The use of a ZVC for nr_inactive and nr_active allows a simplification of some
counter operations.  More ZVC functionality is used for sums etc in the
following patches.

[akpm@osdl.org: UP build fix]
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-11 10:51:17 -08:00
Hugh Dickins
c3704ceb4a [PATCH] page_mkwrite caller race fix
After do_wp_page has tested page_mkwrite, it must release old_page after
acquiring page table lock, not before: at some stage that ordering got
reversed, leaving a (very unlikely) window in which old_page might be
truncated, freed, and reused in the same position.

Signed-off-by: Hugh Dickins <hugh@veritas.com>
Acked-by: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-11 10:51:17 -08:00
Andrew Morton
5a88a13d06 [PATCH] /proc/zoneinfo: fix vm stats display
This early break prevents us from displaying info for the vm stats thresholds
if the zone doesn't have any pages in its per-cpu pagesets.

So my 800MB i386 box says:

Node 0, zone      DMA
  pages free     2365
        min      16
        low      20
        high     24
        active   0
        inactive 0
        scanned  0 (a: 0 i: 0)
        spanned  4096
        present  4044
    nr_anon_pages 0
    nr_mapped    1
    nr_file_pages 0
    nr_slab_reclaimable 0
    nr_slab_unreclaimable 0
    nr_page_table_pages 0
    nr_dirty     0
    nr_writeback 0
    nr_unstable  0
    nr_bounce    0
    nr_vmscan_write 0
        protection: (0, 868, 868)
  pagesets
  all_unreclaimable: 0
  prev_priority:     12
  start_pfn:         0
Node 0, zone   Normal
  pages free     199713
        min      934
        low      1167
        high     1401
        active   10215
        inactive 4507
        scanned  0 (a: 0 i: 0)
        spanned  225280
        present  222420
    nr_anon_pages 2685
    nr_mapped    1110
    nr_file_pages 12055
    nr_slab_reclaimable 2216
    nr_slab_unreclaimable 1527
    nr_page_table_pages 213
    nr_dirty     0
    nr_writeback 0
    nr_unstable  0
    nr_bounce    0
    nr_vmscan_write 0
        protection: (0, 0, 0)
  pagesets
    cpu: 0 pcp: 0
              count: 152
              high:  186
              batch: 31
    cpu: 0 pcp: 1
              count: 13
              high:  62
              batch: 15
  vm stats threshold: 16
    cpu: 1 pcp: 0
              count: 34
              high:  186
              batch: 31
    cpu: 1 pcp: 1
              count: 10
              high:  62
              batch: 15
  vm stats threshold: 16
  all_unreclaimable: 0
  prev_priority:     12
  start_pfn:         4096

Just nuke all that search-for-the-first-non-empty-pageset code.  Dunno why it
was there in the first place..

Cc: Christoph Lameter <clameter@engr.sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-11 10:51:17 -08:00
Mel Gorman
a6af2bc3d5 [PATCH] Avoid excessive sorting of early_node_map[]
find_min_pfn_for_node() and find_min_pfn_with_active_regions() sort
early_node_map[] on every call.  This is an excessive amount of sorting and
that can be avoided.  This patch always searches the whole early_node_map[]
in find_min_pfn_for_node() instead of returning the first value found.  The
map is then only sorted once when required.  Successfully boot tested on a
number of machines.

[akpm@osdl.org: cleanup]
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-11 10:51:17 -08:00
Christoph Lameter
7c5cae368a [PATCH] slab: use parameter passed to cache_reap to determine pointer to work structure
Use the pointer passed to cache_reap to determine the work pointer and
consolidate exit paths.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-11 10:51:17 -08:00
Pekka Enberg
8c8cc2c10c [PATCH] slab: cache alloc cleanups
Clean up __cache_alloc and __cache_alloc_node functions a bit.  We no
longer need to do NUMA_BUILD tricks and the UMA allocation path is much
simpler.  No functional changes in this patch.

Note: saves few kernel text bytes on x86 NUMA build due to using gotos in
__cache_alloc_node() and moving __GFP_THISNODE check in to
fallback_alloc().

Cc: Andy Whitcroft <apw@shadowen.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Manfred Spraul <manfred@colorfullife.com>
Acked-by: Christoph Lameter <christoph@lameter.com>
Cc: Paul Jackson <pj@sgi.com>
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-11 10:51:16 -08:00
Pekka Enberg
6e40e73097 [PATCH] slab: remove broken PageSlab check from kfree_debugcheck
The PageSlab debug check in kfree_debugcheck() is broken for compound
pages.  It is also redundant as we already do BUG_ON for non-slab pages in
page_get_cache() and page_get_slab() which are always called before we free
any actual objects.

Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-11 10:51:16 -08:00