Commit graph

8319 commits

Author SHA1 Message Date
Stephen Rothwell
5c45708352 [SPARC/64]: Consolidate of_register_driver
Also of_unregister_driver.  These will be shortly also used by the
PowerPC code.

Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-10-17 21:17:42 -07:00
Stephen Hemminger
c264c3dee9 napi_synchronize: waiting for NAPI
Some drivers with shared NAPI need a synchronization barrier.
Also suggested by Benjamin Herrenschmidt for EMAC.

Signed-off-by: Stephen Hemminger <shemminger@linux-foundation.org>
Signed-off-by: Jeff Garzik <jeff@garzik.org>
2007-10-17 20:17:34 -04:00
Aneesh Kumar K.V
d8dd0b4543 ext4: Convert ext4_extent_idx.ei_leaf to ext4_extent_idx.ei_leaf_lo
Convert ext4_extent_idx.ei_leaf  ext4_extent_idx.ei_leaf_lo
This helps in finding BUGs due to direct partial access of
these split 48 bit values.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
2007-10-17 18:50:03 -04:00
Aneesh Kumar K.V
b377611d11 ext4: Convert ext4_extent.ee_start to ext4_extent.ee_start_lo
Convert ext4_extent.ee_start to ext4_extent.ee_start_lo
This helps in finding BUGs due to direct partial access of
these split 48 bit values

Also fix direct partial access in ext4 code

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
2007-10-17 18:50:03 -04:00
Aneesh Kumar K.V
308ba3ece7 ext4: Convert s_r_blocks_count and s_free_blocks_count
Convert s_r_blocks_count and s_free_blocks_count to
s_r_blocks_count_lo and s_free_blocks_count_lo

This helps in finding BUGs due to direct partial access of
these split 64 bit values

Also fix direct partial access in ext4 code

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2007-10-17 18:50:02 -04:00
Aneesh Kumar K.V
6bc9feff14 ext4: Convert s_blocks_count to s_blocks_count_lo
Convert s_blocks_count to s_blocks_count_lo
This helps in finding BUGs due to direct partial access of
these split 64 bit values

Also fix direct partial access in ext4 code

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
2007-10-17 18:50:02 -04:00
Aneesh Kumar K.V
5272f83727 ext4: Convert bg_inode_bitmap and bg_inode_table
Convert bg_inode_bitmap and bg_inode_table to bg_inode_bitmap_lo
and bg_inode_table_lo.  This helps in finding BUGs due to
direct partial access of these split 64 bit values

Also fix one direct partial access

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
2007-10-17 18:50:02 -04:00
Aneesh Kumar K.V
3a14589cce ext4: Convert bg_block_bitmap to bg_block_bitmap_lo
Convert bg_block_bitmap to bg_block_bitmap_lo
This helps in catching some BUGS due to direct
partial access of these split fields.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
2007-10-17 18:50:01 -04:00
Jose R. Santos
ce42158179 ext4: FLEX_BG Kernel support v2.
This feature relaxes check restrictions on where each block groups meta
data is located within the storage media.  This allows for the allocation
of bitmaps or inode tables outside the block group boundaries in cases
where bad blocks forces us to look for new blocks which the owning block
group can not satisfy.  This will also allow for new meta-data allocation
schemes to improve performance and scalability.

Signed-off-by: Jose R. Santos <jrs@us.ibm.com>
Cc: <linux-ext4@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2007-10-17 18:50:01 -04:00
Aneesh Kumar K.V
c1bddad949 ext4: Fix sparse warnings
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2007-10-17 18:50:01 -04:00
Andreas Dilger
717d50e497 Ext4: Uninitialized Block Groups
In pass1 of e2fsck, every inode table in the fileystem is scanned and checked,
regardless of whether it is in use.  This is this the most time consuming part
of the filesystem check.  The unintialized block group feature can greatly
reduce e2fsck time by eliminating checking of uninitialized inodes.

With this feature, there is a a high water mark of used inodes for each block
group.  Block and inode bitmaps can be uninitialized on disk via a flag in the
group descriptor to avoid reading or scanning them at e2fsck time.  A checksum
of each group descriptor is used to ensure that corruption in the group
descriptor's bit flags does not cause incorrect operation.

The feature is enabled through a mkfs option

	mke2fs /dev/ -O uninit_groups

A patch adding support for uninitialized block groups to e2fsprogs tools has
been posted to the linux-ext4 mailing list.

The patches have been stress tested with fsstress and fsx.  In performance
tests testing e2fsck time, we have seen that e2fsck time on ext3 grows
linearly with the total number of inodes in the filesytem.  In ext4 with the
uninitialized block groups feature, the e2fsck time is constant, based
solely on the number of used inodes rather than the total inode count.
Since typical ext4 filesystems only use 1-10% of their inodes, this feature can
greatly reduce e2fsck time for users.  With performance improvement of 2-20
times, depending on how full the filesystem is.

The attached graph shows the major improvements in e2fsck times in filesystems
with a large total inode count, but few inodes in use.

In each group descriptor if we have

EXT4_BG_INODE_UNINIT set in bg_flags:
        Inode table is not initialized/used in this group. So we can skip
        the consistency check during fsck.
EXT4_BG_BLOCK_UNINIT set in bg_flags:
        No block in the group is used. So we can skip the block bitmap
        verification for this group.

We also add two new fields to group descriptor as a part of
uninitialized group patch.

        __le16  bg_itable_unused;       /* Unused inodes count */
        __le16  bg_checksum;            /* crc16(sb_uuid+group+desc) */

bg_itable_unused:

If we have EXT4_BG_INODE_UNINIT not set in bg_flags
then bg_itable_unused will give the offset within
the inode table till the inodes are used. This can be
used by fsck to skip list of inodes that are marked unused.

bg_checksum:
Now that we depend on bg_flags and bg_itable_unused to determine
the block and inode usage, we need to make sure group descriptor
is not corrupt. We add checksum to group descriptor to
detect corruption. If the descriptor is found to be corrupt, we
mark all the blocks and inodes in the group used.

Signed-off-by: Avantika Mathur <mathur@us.ibm.com>
Signed-off-by: Andreas Dilger <adilger@clusterfs.com>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
2007-10-17 18:50:00 -04:00
Eric Sandeen
4074fe3736 ext4: remove #ifdef CONFIG_EXT4_INDEX
CONFIG_EXT4_INDEX is not an exposed config option in the kernel, and it is
unconditionally defined in ext4_fs.h.  tune2fs is already able to turn off
dir indexing, so at this point it's just cluttering up the code.  Remove
it.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2007-10-17 18:50:00 -04:00
Coly Li
f077d0d7ea ext4: Remove (partial, never completed) fragment support
Fragment support in ext2/3/4 was never implemented, and it probably will
never be implemented.   So remove it from ext4.

Signed-off-by: Coly Li <coyli@suse.de>
Acked-by: Andreas Dilger <adilger@clusterfs.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2007-10-17 18:49:59 -04:00
Mingming Cao
cd02ff0b14 jbd2: JBD_XXX to JBD2_XXX naming cleanup
change JBD_XXX macros to JBD2_XXX in JBD2/Ext4

Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2007-10-17 18:49:58 -04:00
Mingming Cao
2d917969bc JBD2: replace jbd_kmalloc with kmalloc directly.
This patch cleans up jbd_kmalloc and replace it with kmalloc directly

Signed-off-by: Mingming Cao <cmm@us.ibm.com>
2007-10-17 18:49:57 -04:00
Mingming Cao
a5005da204 JBD: replace jbd_kmalloc with kmalloc directly
This patch cleans up jbd_kmalloc and replace it with kmalloc directly

Signed-off-by: Mingming Cao <cmm@us.ibm.com>
2007-10-17 18:49:57 -04:00
Mingming Cao
af1e76d6b3 JBD2: jbd2 slab allocation cleanups
JBD2: Replace slab allocations with page allocations

JBD2 allocate memory for committed_data and frozen_data from slab. However
JBD2 should not pass slab pages down to the block layer. Use page allocator
pages instead. This will also prepare JBD for the large blocksize patchset.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
2007-10-17 18:49:56 -04:00
Mingming Cao
c089d490df JBD: JBD slab allocation cleanups
JBD: Replace slab allocations with page allocations

JBD allocate memory for committed_data and frozen_data from slab. However
JBD should not pass slab pages down to the block layer. Use page allocator pages instead. This will also prepare JBD for the large blocksize patchset.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
2007-10-17 18:49:56 -04:00
Pierre Ossman
727c26ed78 net: libertas sdio driver
Add driver for Marvell's Libertas 8385 and 8686 wifi chips.

Signed-off-by: Pierre Ossman <drzeus@drzeus.cx>
Acked-by: Dan Williams <dcbw@redhat.com>
2007-10-17 22:51:13 +02:00
Linus Torvalds
e6d5a11dad Merge git://git.kernel.org/pub/scm/linux/kernel/git/mingo/linux-2.6-sched
* git://git.kernel.org/pub/scm/linux/kernel/git/mingo/linux-2.6-sched:
  sched: fix new task startup crash
  sched: fix !SYSFS build breakage
  sched: fix improper load balance across sched domain
  sched: more robust sd-sysctl entry freeing
2007-10-17 09:11:18 -07:00
Linus Torvalds
c548f08a4f Merge branch 'merge' of git://git.kernel.org/pub/scm/linux/kernel/git/paulus/powerpc
* 'merge' of git://git.kernel.org/pub/scm/linux/kernel/git/paulus/powerpc: (24 commits)
  [POWERPC] Fix vmemmap warning in init_64.c
  [POWERPC] Fix 64 bits vDSO DWARF info for CR register
  [POWERPC] Add 1TB workaround for PA6T
  [POWERPC] Enable NO_HZ and high res timers for pseries and ppc64 configs
  [POWERPC] Quieten cache information at boot
  [POWERPC] Quieten clockevent printk
  [POWERPC] Enable SLUB in *_defconfig
  [POWERPC] Fix 1TB segment detection
  [POWERPC] Fix iSeries_hpte_insert prototype
  [POWERPC] Fix copyright symbol
  [POWERPC] ibmebus: Move to of_device and of_platform_driver, match eHCA and eHEA drivers
  [POWERPC] ibmebus: Add device creation and bus probing based on of_device
  [POWERPC] ibmebus: Remove bus match/probe/remove functions
  [POWERPC] Move of_device allocation into of_device.[ch]
  [POWERPC] mpc52xx: device tree changes for FEC and MDIO
  [POWERPC] bestcomm: GenBD task support
  [POWERPC] bestcomm: FEC task support
  [POWERPC] bestcomm: ATA task support
  [POWERPC] bestcomm: core bestcomm support for Freescale MPC5200
  [POWERPC] mpc52xx: Update mpc52xx_psc structure with B revision changes
  ...
2007-10-17 09:05:55 -07:00
Linus Torvalds
5c8e191e84 Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/hpa/linux-2.6-x86setup
* 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/hpa/linux-2.6-x86setup:
  Remove magic macros for screen_info structure members
  [x86] remove uses of magic macros for boot_params access
2007-10-17 09:00:30 -07:00
Adrian Bunk
cbfee34520 security/ cleanups
This patch contains the following cleanups that are now possible:
- remove the unused security_operations->inode_xattr_getsuffix
- remove the no longer used security_operations->unregister_security
- remove some no longer required exit code
- remove a bunch of no longer used exports

Signed-off-by: Adrian Bunk <bunk@kernel.org>
Acked-by: James Morris <jmorris@namei.org>
Cc: Chris Wright <chrisw@sous-sol.org>
Cc: Stephen Smalley <sds@tycho.nsa.gov>
Cc: Serge Hallyn <serue@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-17 08:43:07 -07:00
Serge E. Hallyn
b53767719b Implement file posix capabilities
Implement file posix capabilities.  This allows programs to be given a
subset of root's powers regardless of who runs them, without having to use
setuid and giving the binary all of root's powers.

This version works with Kaigai Kohei's userspace tools, found at
http://www.kaigai.gr.jp/index.php.  For more information on how to use this
patch, Chris Friedhoff has posted a nice page at
http://www.friedhoff.org/fscaps.html.

Changelog:
	Nov 27:
	Incorporate fixes from Andrew Morton
	(security-introduce-file-caps-tweaks and
	security-introduce-file-caps-warning-fix)
	Fix Kconfig dependency.
	Fix change signaling behavior when file caps are not compiled in.

	Nov 13:
	Integrate comments from Alexey: Remove CONFIG_ ifdef from
	capability.h, and use %zd for printing a size_t.

	Nov 13:
	Fix endianness warnings by sparse as suggested by Alexey
	Dobriyan.

	Nov 09:
	Address warnings of unused variables at cap_bprm_set_security
	when file capabilities are disabled, and simultaneously clean
	up the code a little, by pulling the new code into a helper
	function.

	Nov 08:
	For pointers to required userspace tools and how to use
	them, see http://www.friedhoff.org/fscaps.html.

	Nov 07:
	Fix the calculation of the highest bit checked in
	check_cap_sanity().

	Nov 07:
	Allow file caps to be enabled without CONFIG_SECURITY, since
	capabilities are the default.
	Hook cap_task_setscheduler when !CONFIG_SECURITY.
	Move capable(TASK_KILL) to end of cap_task_kill to reduce
	audit messages.

	Nov 05:
	Add secondary calls in selinux/hooks.c to task_setioprio and
	task_setscheduler so that selinux and capabilities with file
	cap support can be stacked.

	Sep 05:
	As Seth Arnold points out, uid checks are out of place
	for capability code.

	Sep 01:
	Define task_setscheduler, task_setioprio, cap_task_kill, and
	task_setnice to make sure a user cannot affect a process in which
	they called a program with some fscaps.

	One remaining question is the note under task_setscheduler: are we
	ok with CAP_SYS_NICE being sufficient to confine a process to a
	cpuset?

	It is a semantic change, as without fsccaps, attach_task doesn't
	allow CAP_SYS_NICE to override the uid equivalence check.  But since
	it uses security_task_setscheduler, which elsewhere is used where
	CAP_SYS_NICE can be used to override the uid equivalence check,
	fixing it might be tough.

	     task_setscheduler
		 note: this also controls cpuset:attach_task.  Are we ok with
		     CAP_SYS_NICE being used to confine to a cpuset?
	     task_setioprio
	     task_setnice
		 sys_setpriority uses this (through set_one_prio) for another
		 process.  Need same checks as setrlimit

	Aug 21:
	Updated secureexec implementation to reflect the fact that
	euid and uid might be the same and nonzero, but the process
	might still have elevated caps.

	Aug 15:
	Handle endianness of xattrs.
	Enforce capability version match between kernel and disk.
	Enforce that no bits beyond the known max capability are
	set, else return -EPERM.
	With this extra processing, it may be worth reconsidering
	doing all the work at bprm_set_security rather than
	d_instantiate.

	Aug 10:
	Always call getxattr at bprm_set_security, rather than
	caching it at d_instantiate.

[morgan@kernel.org: file-caps clean up for linux/capability.h]
[bunk@kernel.org: unexport cap_inode_killpriv]
Signed-off-by: Serge E. Hallyn <serue@us.ibm.com>
Cc: Stephen Smalley <sds@tycho.nsa.gov>
Cc: James Morris <jmorris@namei.org>
Cc: Chris Wright <chrisw@sous-sol.org>
Cc: Andrew Morgan <morgan@kernel.org>
Signed-off-by: Andrew Morgan <morgan@kernel.org>
Signed-off-by: Adrian Bunk <bunk@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-17 08:43:07 -07:00
Alexey Dobriyan
57c521ce61 ifdef struct task_struct::security
For those who don't care about CONFIG_SECURITY.

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Cc: "Serge E. Hallyn" <serge@hallyn.com>
Cc: Casey Schaufler <casey@schaufler-ca.com>
Cc: James Morris <jmorris@namei.org>
Cc: Stephen Smalley <sds@tycho.nsa.gov>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-17 08:43:07 -07:00
James Morris
20510f2f4e security: Convert LSM into a static interface
Convert LSM into a static interface, as the ability to unload a security
module is not required by in-tree users and potentially complicates the
overall security architecture.

Needlessly exported LSM symbols have been unexported, to help reduce API
abuse.

Parameters for the capability and root_plug modules are now specified
at boot.

The SECURITY_FRAMEWORK_VERSION macro has also been removed.

In a nutshell, there is no safe way to unload an LSM.  The modular interface
is thus unecessary and broken infrastructure.  It is used only by out-of-tree
modules, which are often binary-only, illegal, abusive of the API and
dangerous, e.g.  silently re-vectoring SELinux.

[akpm@linux-foundation.org: cleanups]
[akpm@linux-foundation.org: USB Kconfig fix]
[randy.dunlap@oracle.com: fix LSM kernel-doc]
Signed-off-by: James Morris <jmorris@namei.org>
Acked-by: Chris Wright <chrisw@sous-sol.org>
Cc: Stephen Smalley <sds@tycho.nsa.gov>
Cc: "Serge E. Hallyn" <serue@us.ibm.com>
Acked-by: Arjan van de Ven <arjan@infradead.org>
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-17 08:43:07 -07:00
Dave Hansen
ce8d2cdf3d r/o bind mounts: filesystem helpers for custom 'struct file's
Why do we need r/o bind mounts?

This feature allows a read-only view into a read-write filesystem.  In the
process of doing that, it also provides infrastructure for keeping track of
the number of writers to any given mount.

This has a number of uses.  It allows chroots to have parts of filesystems
writable.  It will be useful for containers in the future because users may
have root inside a container, but should not be allowed to write to
somefilesystems.  This also replaces patches that vserver has had out of the
tree for several years.

It allows security enhancement by making sure that parts of your filesystem
read-only (such as when you don't trust your FTP server), when you don't want
to have entire new filesystems mounted, or when you want atime selectively
updated.  I've been using the following script to test that the feature is
working as desired.  It takes a directory and makes a regular bind and a r/o
bind mount of it.  It then performs some normal filesystem operations on the
three directories, including ones that are expected to fail, like creating a
file on the r/o mount.

This patch:

Some filesystems forego the vfs and may_open() and create their own 'struct
file's.

This patch creates a couple of helper functions which can be used by these
filesystems, and will provide a unified place which the r/o bind mount code
may patch.

Also, rename an existing, static-scope init_file() to a less generic name.

Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-17 08:43:04 -07:00
Bjorn Helgaas
402b310cb6 PNP: remove null pointer checks
Remove some null pointer checks.  Null pointers in these areas indicate
programming errors, and I think it's better to oops immediately rather than
return an error that is easily ignored.

Signed-off-by: Bjorn Helgaas <bjorn.helgaas@hp.com>
Cc: Adam Belay <ambx1@neo.rr.com>
Cc: Len Brown <lenb@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-17 08:43:04 -07:00
Adrian Bunk
5ebf2c1260 bitmap.h: remove dead artifacts
bitmap_active() no longer exists and BITMAP_ACTIVE is no longer used.

Signed-off-by: Adrian Bunk <bunk@kernel.org>
Cc: Neil Brown <neilb@suse.de>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-17 08:43:03 -07:00
Martin J. Bligh
a686cd898b ext2 reservations
Val's cross-port of the ext3 reservations code into ext2.

[mbligh@mbligh.org: Small type error for printk
[akpm@linux-foundation.org: fix types, sync with ext3]
[mbligh@mbligh.org: Bring ext2 reservations code in line with latest ext3]
[akpm@linux-foundation.org: kill noisy printk]
[akpm@linux-foundation.org: remember to dirty the gdp's block]
[akpm@linux-foundation.org: cross-port the missed 5dea5176e5]
[akpm@linux-foundation.org: cross-port e6022603b9]
[akpm@linux-foundation.org: Port the omitted 08fb306fe6]
[akpm@linux-foundation.org: Backport the missed 20acaa18d0]
[akpm@linux-foundation.org: fixes]
[cmm@us.ibm.com: fix reservation extension]
[bunk@stusta.de: make ext2_get_blocks() static]
[hugh@veritas.com: fix hang]
[hugh@veritas.com: ext2_new_blocks should reset the reservation window size]
[hugh@veritas.com: ext2 balloc: fix off-by-one against rsv_end]
[hugh@veritas.com: grp_goal 0 is a genuine goal (unlike -1), so ext2_try_to_allocate_with_rsv should treat it as such]
[hugh@veritas.com: rbtree usage cleanup]
[pbadari@us.ibm.com: Fix for ext2 reservation]
[bunk@kernel.org: remove fs/ext2/balloc.c:reserve_blocks()]
[hugh@veritas.com: ext2 balloc: use io_error label]
Cc: "Martin J. Bligh" <mbligh@mbligh.org>
Cc: Valerie Henson <val_henson@linux.intel.com>
Cc: Mingming Cao <cmm@us.ibm.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Badari Pulavarty <pbadari@us.ibm.com>
Signed-off-by: Adrian Bunk <bunk@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-17 08:43:02 -07:00
Joern Engel
1c0eeaf569 introduce I_SYNC
I_LOCK was used for several unrelated purposes, which caused deadlock
situations in certain filesystems as a side effect.  One of the purposes
now uses the new I_SYNC bit.

Also document the various bits and change their order from historical to
logical.

[bunk@stusta.de: make fs/inode.c:wake_up_inode() static]
Signed-off-by: Joern Engel <joern@wohnheim.fh-wedel.de>
Cc: Dave Kleikamp <shaggy@linux.vnet.ibm.com>
Cc: David Chinner <dgc@sgi.com>
Cc: Anton Altaparmakov <aia21@cam.ac.uk>
Cc: Al Viro <viro@ftp.linux.org.uk>
Cc: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-17 08:43:02 -07:00
Fengguang Wu
2e6883bdf4 writeback: introduce writeback_control.more_io to indicate more io
After making dirty a 100M file, the normal behavior is to start the writeback
for all data after 30s delays.  But sometimes the following happens instead:

	- after 30s:    ~4M
	- after 5s:     ~4M
	- after 5s:     all remaining 92M

Some analyze shows that the internal io dispatch queues goes like this:

		s_io            s_more_io
		-------------------------
	1)	100M,1K         0
	2)	1K              96M
	3)	0               96M

1) initial state with a 100M file and a 1K file
2) 4M written, nr_to_write <= 0, so write more
3) 1K written, nr_to_write > 0, no more writes(BUG)

nr_to_write > 0 in (3) fools the upper layer to think that data have all been
written out.  The big dirty file is actually still sitting in s_more_io.  We
cannot simply splice s_more_io back to s_io as soon as s_io becomes empty, and
let the loop in generic_sync_sb_inodes() continue: this may starve newly
expired inodes in s_dirty.  It is also not an option to draw inodes from both
s_more_io and s_dirty, an let the loop go on: this might lead to live locks,
and might also starve other superblocks in sync time(well kupdate may still
starve some superblocks, that's another bug).

We have to return when a full scan of s_io completes.  So nr_to_write > 0 does
not necessarily mean that "all data are written".  This patch introduces a
flag writeback_control.more_io to indicate this situation.  With it the big
dirty file no longer has to wait for the next kupdate invocation 5s later.

Cc: David Chinner <dgc@sgi.com>
Cc: Ken Chen <kenchen@google.com>
Signed-off-by: Fengguang Wu <wfg@mail.ustc.edu.cn>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-17 08:43:02 -07:00
Fengguang Wu
08d8e9749e writeback: fix ntfs with sb_has_dirty_inodes()
NTFS's if-condition on dirty inodes is not complete.  Fix it with
sb_has_dirty_inodes().

Cc: Anton Altaparmakov <aia21@cantab.net>
Cc: Ken Chen <kenchen@google.com>
Signed-off-by: Fengguang Wu <wfg@mail.ustc.edu.cn>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-17 08:43:02 -07:00
Ken Chen
0e0f4fc22e writeback: fix periodic superblock dirty inode flushing
Current -mm tree has bucketful of bug fixes in periodic writeback path.
However, we still hit a glitch where dirty pages on a given inode aren't
completely flushed to the disk, and system will accumulate large amount of
dirty pages beyond what dirty_expire_interval is designed for.

The problem is __sync_single_inode() will move an inode to sb->s_dirty list
even when there are more pending dirty pages on that inode.  If there is
another inode with a small number of dirty pages, we hit a case where the loop
iteration in wb_kupdate() terminates prematurely because wbc.nr_to_write > 0.
Thus leaving the inode that has large amount of dirty pages behind and it has
to wait for another dirty_writeback_interval before we flush it again.  We
effectively only write out MAX_WRITEBACK_PAGES every dirty_writeback_interval.
If the rate of dirtying is sufficiently high, the system will start
accumulate a large number of dirty pages.

So fix it by having another sb->s_more_io list on which to park the inode
while we iterate through sb->s_io and to allow each dirty inode which resides
on that sb to have an equal chance of flushing some amount of dirty pages.

Signed-off-by: Ken Chen <kenchen@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-17 08:43:02 -07:00
Ingo Molnar
4749252776 printk: add KERN_CONT annotation
printk: add the KERN_CONT annotation (which is empty string but via
which checkpatch.pl can notice that the lacking KERN_ level is fine).
This useful for multiple calls of hand-crafted printk output done by
early debug code or similar.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Cc: Andy Whitcroft <apw@shadowen.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-17 08:43:01 -07:00
Ulrich Drepper
22d2b35b20 F_DUPFD_CLOEXEC implementation
One more small change to extend the availability of creation of file
descriptors with FD_CLOEXEC set.  Adding a new command to fcntl() requires
no new system call and the overall impact on code size if minimal.

If this patch gets accepted we will also add this change to the next
revision of the POSIX spec.

To test the patch, use the following little program.  Adjust the value of
F_DUPFD_CLOEXEC appropriately.

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
#include <errno.h>
#include <fcntl.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>

#ifndef F_DUPFD_CLOEXEC
# define F_DUPFD_CLOEXEC 12
#endif

int
main (int argc, char *argv[])
{
  if  (argc > 1)
    {
      if (fcntl (3, F_GETFD) == 0)
	{
	  puts ("descriptor not closed");
	  exit (1);
	}
      if (errno != EBADF)
	{
	  puts ("error not EBADF");
	  exit (1);
	}

      exit (0);
    }
  int fd = fcntl (STDOUT_FILENO, F_DUPFD_CLOEXEC, 0);
  if (fd == -1 && errno == EINVAL)
    {
      puts ("F_DUPFD_CLOEXEC not supported");
      return 0;
    }
  if (fd != 3)
    {
      puts ("program called with descriptors other than 0,1,2");
      return 1;
    }

  execl ("/proc/self/exe", "/proc/self/exe", "1", NULL);
  puts ("execl failed");
  return 1;
}
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Signed-off-by: Ulrich Drepper <drepper@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@lst.de>
Cc: <linux-arch@vger.kernel.org>
Cc: Kyle McMartin <kyle@mcmartin.ca>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-17 08:43:01 -07:00
Alexey Dobriyan
18796aa002 task_struct: move ->fpu_counter and ->oomkilladj
There is nice 2 byte hole after struct task_struct::ioprio field
into which we can put two 1-byte fields: ->fpu_counter and ->oomkilladj.

Signed-off-by: Alexey Dobriyan <adobriyan@sw.ru>
Acked-by: Arjan van de Ven <arjan@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-17 08:43:01 -07:00
Davide Libenzi
96358de6bc rename signalfd_siginfo fields
For Michael Kerrisk request, the following patch renames signalfd_siginfo
fields in order to keep them consistent with the siginfo_t ones.

Signed-off-by: Davide Libenzi <davidel@xmailserver.org>
Cc: Michael Kerrisk <mtk-manpages@gmx.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-17 08:43:01 -07:00
Eric Sandeen
059590f495 ext3: remove #ifdef CONFIG_EXT3_INDEX
CONFIG_EXT3_INDEX is not an exposed config option in the kernel, and it is
unconditionally defined in ext3_fs.h.  tune2fs is already able to turn off
dir indexing, so at this point it's just cluttering up the code.  Remove
it.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Cc: <linux-ext4@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-17 08:43:01 -07:00
Ahmed S. Darwish
b4471cbb09 Completely remove deprecated IRQ flags (SA_*)
Only very little files use the deprecated SA_* IRQ flags in latest pull.  This
patch series removes such macros from the tree and transfrom old code to the
new IRQF_* flags.

I've grepped the whole tree to make sure that no more files than the patched
ones use such deprecated macros.  I hope this series won't introduce build
errors.

Signed-off-by: Ahmed S. Darwish <darwish.07@gmail.com>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: James Bottomley <James.Bottomley@steeleye.com>
Cc: Matthew Wilcox <willy@debian.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-17 08:43:00 -07:00
Andrey Mirkin
fd5eea4214 change inotifyfs magic as the same magic is used for futexfs
Right now futexfs and inotifyfs have one magic 0xBAD1DEA, that looks a
little bit confusing.  Use 0xBAD1DEA as magic for futexfs and 0x2BAD1DEA as
magic for inotifyfs.

Signed-off-by: Andrey Mirkin <major@openvz.org>
Acked-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-17 08:43:00 -07:00
Olaf Hering
4f9a58d75b increase AT_VECTOR_SIZE to terminate saved_auxv properly
include/asm-powerpc/elf.h has 6 entries in ARCH_DLINFO.  fs/binfmt_elf.c
has 14 unconditional NEW_AUX_ENT entries and 2 conditional NEW_AUX_ENT
entries.  So in the worst case, saved_auxv does not get an AT_NULL entry at
the end.

The saved_auxv array must be terminated with an AT_NULL entry.  Make the
size of mm_struct->saved_auxv arch dependend, based on the number of
ARCH_DLINFO entries.

Signed-off-by: Olaf Hering <olh@suse.de>
Cc: Roland McGrath <roland@redhat.com>
Cc: Jakub Jelinek <jakub@redhat.com>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Paul Mundt <lethal@linux-sh.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-17 08:43:00 -07:00
Pavel Emelyanov
1efd24fa05 Remove unused member from nsproxy
The nslock spinlock is not used in the kernel at all.  Remove it.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-17 08:42:59 -07:00
Alexey Dobriyan
970a8645ca user.c: #ifdef ->mq_bytes
For those who deselect POSIX message queues.

Reduces SLAB size of user_struct from 64 to 32 bytes here, SLUB size -- from
40 bytes to 32 bytes.

[akpm@linux-foundation.org: fix build]
Signed-off-by: Alexey Dobriyan <adobriyan@sw.ru>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-17 08:42:59 -07:00
Alan Cox
5f519d7281 tty: expose new methods needed for drivers to get termios right
This adds three new functions (or in one case to be more exact makes it
always available)

tty_termios_copy_hw

Copies all the hardware settings from one termios structure to the other.
This is intended for drivers that support little or no hardware setting

tty_termios_encode_baud_rate

Allows you to set the input and output baud rate in a termios structure.  A
driver is supposed to set the resulting baud rate from a request so most
will want to use this function to set the resulting input and output rates
to match the hardware values.  Internally it knows about keeping Bxxx
encoding when possible to maximise compatibility.

tty_encode_baud_rate

As above but for the tty's own current termios structure

I suspect this will initially need some tweaking as it gets enabled by
driver patches over the next few mm cycles so consider this lot -mm only
for the moment so it can stabilize and end up neat before it goes to base.

I've tried not to break any obscure architectures - if you get a speed you
can't represent the code will print warnings on non updated termios systems
but not break.

Once this is merged and seems sane I've got a growing pile of driver
updates to use it - notably for USB serial drivers.

[akpm@linux-foundation.org: cleanups]
Signed-off-by: Alan Cox <alan@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-17 08:42:58 -07:00
Denys Vlasenko
b9ec0339d8 add consts where appropriate in fs/nls/*
Add const modifiers to a few struct nls_table's member pointers in
include/linux/nls.h and adds a lot of const's in fs/nls/*.c files.

Resulting changes as visible by size:

   text    data     bss     dec     hex filename
 113612  481216    2368  597196   91ccc nls.org/built-in.o
 593548    3296     288  597132   91c8c nls/built-in.o

Apparently compiler managed to optimize code a bit better
because of const-ness.

No other changes are made.

Signed-off-by: Denys Vlasenko <vda.linux@googlemail.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-17 08:42:58 -07:00
Denis V. Lunev
37c42524d6 shrink_dcache_sb speedup
This patch makes shrink_dcache_sb consistent with dentry pruning policy.

On the first pass we iterate over dentry unused list and prepare some
dentries for removal.

However, since the existing code moves evicted dentries to the beginning of
the LRU it can happen that fresh dentries from other superblocks will be
inserted *before* our dentries.

This can result in significant slowdown of shrink_dcache_sb().  Moreover,
for virtual filesystems like unionfs which can call dput() during dentries
kill existing code results in O(n^2) complexity.

We observed 2 minutes shrink_dcache_sb() with only 35000 dentries.

To avoid this effects we propose to isolate sb dentries at the end
of LRU list.

Signed-off-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: Kirill Korotaev <dev@openvz.org>
Signed-off-by: Andrey Mirkin <amirkin@openvz.org>
Cc: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-17 08:42:57 -07:00
Emil Medve
1f7c8234c7 Make the pr_*() family of macros in kernel.h complete
Other/Some pr_*() macros are already defined in kernel.h, but pr_err() was
defined multiple times in several other places

Signed-off-by: Emil Medve <Emilian.Medve@Freescale.com>
Cc: Jean Delvare <khali@linux-fr.org>
Cc: Jeff Garzik <jeff@garzik.org>
Cc: "Antonino A. Daplas" <adaplas@pol.net>
Cc: Tony Lindgren <tony@atomide.com>
Reviewed-by: Satyam Sharma <satyam@infradead.org>
Acked-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-17 08:42:57 -07:00
David Howells
76181c134f KEYS: Make request_key() and co fundamentally asynchronous
Make request_key() and co fundamentally asynchronous to make it easier for
NFS to make use of them.  There are now accessor functions that do
asynchronous constructions, a wait function to wait for construction to
complete, and a completion function for the key type to indicate completion
of construction.

Note that the construction queue is now gone.  Instead, keys under
construction are linked in to the appropriate keyring in advance, and that
anyone encountering one must wait for it to be complete before they can use
it.  This is done automatically for userspace.

The following auxiliary changes are also made:

 (1) Key type implementation stuff is split from linux/key.h into
     linux/key-type.h.

 (2) AF_RXRPC provides a way to allocate null rxrpc-type keys so that AFS does
     not need to call key_instantiate_and_link() directly.

 (3) Adjust the debugging macros so that they're -Wformat checked even if
     they are disabled, and make it so they can be enabled simply by defining
     __KDEBUG to be consistent with other code of mine.

 (3) Documentation.

[alan@lxorguk.ukuu.org.uk: keys: missing word in documentation]
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Alan Cox <alan@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-17 08:42:57 -07:00
Matti Linnanvuori
f20fda4861 Mutex documentation is unclear about software interrupts, tasklets and timers
Acked-by: Arjan van de Ven <arjan@linux.intel.com>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-17 08:42:57 -07:00
Bill Nottingham
2e8ecb9db0 add CONFIG_VT_UNICODE
As of now, the kernel defaults to non-unicode and XLATE for the keyboard.
We've been changing this in Fedora, but that requires patching the defaults
in the kernel.

The attached introduces CONFIG_VT_UNICODE, which sets the console in
unicode mode by default on boot, including both the virtual terminal and
the keyboard driver.

Signed-off-by: Bill Nottingham <notting@redhat.com>
Cc: Samuel Thibault <samuel.thibault@ens-lyon.org>
Cc: Dmitry Torokhov <dtor@mail.ru>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-17 08:42:56 -07:00
Jan Beulich
22e48eaf58 constify string/array kparam tracking structures
.. in an effort to make read-only whatever can be made, so that
CONFIG_DEBUG_RODATA can catch as many issues as possible.

Signed-off-by: Jan Beulich <jbeulich@novell.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-17 08:42:56 -07:00
Jan Beulich
d5aa0daf6d store __setup_str_* in a more compact way
__setup_str_* are referenced only during boot, hence there's no need to
waste image space for aligning these strings (with the aim of improving
performance).

Signed-off-by: Jan Beulich <jbeulich@novell.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-17 08:42:56 -07:00
Robert P. J. Day
b311e921b3 Add a "rounddown_pow_of_two" routine to log2.h
To go along with the existing "roundup_pow_of_two" routine, add one for
rounding down since that operation appears to crop up on a regular basis in
the source tree.

[m.kozlowski@tuxland.pl: fix unbalanced parentheses]
Signed-off-by: Robert P. J. Day <rpjday@mindspring.com>
Signed-off-by: Mariusz Kozlowski <m.kozlowski@tuxland.pl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-17 08:42:56 -07:00
Jan Kara
8e8934695d quota: send messages via netlink
Implement sending of quota messages via netlink interface.  The advantage
is that in userspace we can better decide what to do with the message - for
example display a dialogue in your X session or just write the message to
the console.  As a bonus, we can get rid of problems with console locking
deep inside filesystem code once we remove the old printing mechanism.

Signed-off-by: Jan Kara <jack@suse.cz>
Cc: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-17 08:42:56 -07:00
Adrian Bunk
b012d346c0 make kernel/profile.c:time_hook static
{,un}register_timer_hook() is the API that should be used.

Signed-off-by: Adrian Bunk <bunk@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-17 08:42:55 -07:00
Adrian Bunk
cba4fbbff2 remove include/asm-*/ipc.h
All asm/ipc.h files do only #include <asm-generic/ipc.h>.

This patch therefore removes all include/asm-*/ipc.h files and moves the
contents of include/asm-generic/ipc.h to include/linux/ipc.h.

Signed-off-by: Adrian Bunk <bunk@kernel.org>
Cc: <linux-arch@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-17 08:42:55 -07:00
Alexey Dobriyan
4af3c9cc4f Drop some headers from mm.h
mm.h doesn't use directly anything from mutex.h and backing-dev.h, so
remove them and add them back to files which need them.

Cross-compile tested on many configs and archs.

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-17 08:42:55 -07:00
Paul Clements
7fdfd4065c NBD: allow hung network I/O to be cancelled
Allow NBD I/O to be cancelled when a network outage occurs.  Previously, I/O
would just hang, and if enough I/O was hung in nbd, the system (at least
user-level) would completely hang until a TCP timeout (default, 15 minutes)
occurred.

The patch introduces a new ioctl NBD_SET_TIMEOUT that allows a transmit
timeout value (in seconds) to be specified.  Any network send that exceeds the
timeout will be cancelled and the nbd connection will be shut down.  I've
tested with various timeout values and 6 seconds seems to be a good choice for
the timeout.  If the NBD_SET_TIMEOUT ioctl is not called, you get the old (I/O
hang) behavior.

Signed-off-by: Paul Clements <paul.clements@steeleye.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-17 08:42:55 -07:00
Alan Cox
328dfd0f78 tty.h: remove dead define
No longer used. TTY_FLIPBUF_SIZE will also go soon but needs a couple of
other cleanups first

Signed-off-by: Alan Cox <alan@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-17 08:42:55 -07:00
Alexey Dobriyan
42b2dd0a02 Shrink task_struct if CONFIG_FUTEX=n
robust_list, compat_robust_list, pi_state_list, pi_state_cache are
really used if futexes are on.

Signed-off-by: Alexey Dobriyan <adobriyan@sw.ru>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-17 08:42:55 -07:00
Ken'ichi Ohmichi
bcbba6c10e add-vmcore: add a prefix "VMCOREINFO_" to the vmcoreinfo macros
Add a prefix "VMCOREINFO_" to the vmcoreinfo macros.  Old vmcoreinfo macros
were defined as generic names SYMBOL/SIZE/OFFSET /LENGTH/CONFIG, and it is
impossible to grep for them.  So these names should be changed.  This
discussion is the following:
http://www.ussg.iu.edu/hypermail/linux/kernel/0709.1/0415.html

Signed-off-by: Ken'ichi Ohmichi <oomichi@mxs.nes.nec.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-17 08:42:54 -07:00
Ken'ichi Ohmichi
6cfa062f01 add-vmcore: add nodemask_t's size and NR_FREE_PAGES's value to vmcoreinfo_data
[2/3] Add nodemask_t's size and NR_FREE_PAGES's value to vmcoreinfo_data.
  The dump filetering command 'makedumpfile'(v1.1.6 or before) had assumed
  the above values, and it was not good from the reliability viewpoint.
  So makedumpfile v1.2.0 came to need these values and I created the patch
  to let the kernel output them.
  makedumpfile site:
  https://sourceforge.net/projects/makedumpfile/

Signed-off-by: Ken'ichi Ohmichi <oomichi@mxs.nes.nec.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-17 08:42:54 -07:00
Ken'ichi Ohmichi
d768281e97 add-vmcore: cleanup the coding style according to Andrew's comments
[1/3] Cleanup the coding style according to Andrew's comments:
http://lists.infradead.org/pipermail/kexec/2007-August/000522.html
- vmcoreinfo_append_str() should have suitable __attribute__s so that
  the compiler can check its use.
- vmcoreinfo_max_size should have size_t.
- Use get_seconds() instead of xtime.tv_sec.
- Use init_uts_ns.name.release instead of UTS_RELEASE.

Signed-off-by: Ken'ichi Ohmichi <oomichi@mxs.nes.nec.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-17 08:42:54 -07:00
Ken'ichi Ohmichi
fd59d231f8 Add vmcoreinfo
This patch set frees the restriction that makedumpfile users should install a
vmlinux file (including the debugging information) into each system.

makedumpfile command is the dump filtering feature for kdump.  It creates a
small dumpfile by filtering unnecessary pages for the analysis.  To
distinguish unnecessary pages, it needs a vmlinux file including the debugging
information.  These days, the debugging package becomes a huge file, and it is
hard to install it into each system.

To solve the problem, kdump developers discussed it at lkml and kexec-ml.  As
the result, we reached the conclusion that necessary information for dump
filtering (called "vmcoreinfo") should be embedded into the first kernel file
and it should be accessed through /proc/vmcore during the second kernel.
(http://www.uwsg.iu.edu/hypermail/linux/kernel/0707.0/1806.html)

Dan Aloni created the patch set for the above implementation.
(http://www.uwsg.iu.edu/hypermail/linux/kernel/0707.1/1053.html)

And I updated it for multi architectures and memory models.
(http://lists.infradead.org/pipermail/kexec/2007-August/000479.html)

Signed-off-by: Dan Aloni <da-x@monatomic.org>
Signed-off-by: Ken'ichi Ohmichi <oomichi@mxs.nes.nec.co.jp>
Signed-off-by: Bernhard Walle <bwalle@suse.de>
Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-17 08:42:54 -07:00
Mathieu Desnoyers
2b47c3611d Fix f_version type: should be u64 instead of unsigned long
Fix f_version type: should be u64 instead of long

There is a type inconsistency between struct inode i_version and struct file
f_version.

fs.h:

struct inode
  u64                     i_version;

and

struct file
  unsigned long           f_version;

Users do:

fs/ext3/dir.c:

if (filp->f_version != inode->i_version) {

So why isn't f_version a u64 ? It becomes a problem if versions gets
higher than 2^32 and we are on an architecture where longs are 32 bits.

This patch changes the f_version type to u64, and updates the users accordingly.

It applies to 2.6.23-rc2-mm2.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
Cc: Martin Bligh <mbligh@google.com>
Cc: "Randy.Dunlap" <rdunlap@xenotime.net>
Cc: Al Viro <viro@ftp.linux.org.uk>
Cc: <linux-ext4@vger.kernel.org>
Cc: Mark Fasheh <mark.fasheh@oracle.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-17 08:42:53 -07:00
Adrian Bunk
4a239427f2 make fs/libfs.c:simple_commit_write() static
simple_commit_write() can now become static.

Signed-off-by: Adrian Bunk <bunk@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-17 08:42:53 -07:00
Adrian Bunk
ba2a631b14 kernel/time/timekeeping.c: cleanups
- remove the no longer required __attribute__((weak)) of xtime_lock
- remove the following no longer used EXPORT_SYMBOL's:
  - xtime
  - xtime_lock

Signed-off-by: Adrian Bunk <bunk@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: john stultz <johnstul@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-17 08:42:53 -07:00
Alexey Dobriyan
3befe7ceb8 Shrink struct task_struct::oomkilladj
oomkilladj is int, but values which can be assigned to it are -17, [-16,
15], thus fitting into s8.

While patch itself doesn't help in making task_struct smaller, because of
natural alignment of ->link_count, it will make picture clearer wrt futher
task_struct reduction patches.  My plan is to move ->fpu_counter and
->oomkilladj after ->ioprio filling hole on i386 and x86_64.  But that's
for later, because bloated distro configs need looking at as well.

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-17 08:42:53 -07:00
Andi Drebes
ac8d35c565 cramfs: error message about endianess
The README file in the cramfs subdirectory says: "All data is currently in
host-endian format; neither mkcramfs nor the kernel ever do swabbing."

If somebody tries to mount a cramfs with the wrong endianess, cramfs only
complains about a wrong magic but doesn't inform the user that only the
endianess isn't right.

The following patch adds an error message to the cramfs sources.  If a user
tries to mount a cramfs with the wrong endianess using the patched sources,
cramfs will display the message "cramfs: wrong endianess".

Signed-off-by: Andi Drebes <lists-receive@programmierforen.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-17 08:42:53 -07:00
Olaf Hering
7f44c3621a include linux/types.h in if_fddi.h
include/linux/if_fddi.h is an exported header.
It uses __be16. Include linux/types.h to get this prototype.

Signed-off-by: Olaf Hering <olaf@aepfle.de>
Cc: "Maciej W. Rozycki" <macro@linux-mips.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-17 08:42:52 -07:00
Olaf Hering
e30618cbd1 remove consolemap.h from header exports
Remove linux/consolemap.h from make headers_install

It contains no user interfaces.
The defines in this file are used only for kernel internal state.

Signed-off-by: Olaf Hering <olaf@aepfle.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-17 08:42:52 -07:00
Samuel Thibault
04c7197650 unicode diacritics support
There have been issues with non-latin1 diacritics and unicode.
http://bugzilla.kernel.org/show_bug.cgi?id=7746

Git 759448f459 `Kernel utf-8 handling'
partly resolved it by adding conversion between diacritics and
unicode. The patch below goes further by just turning diacritics into
unicode, hence providing better future support. The kbd support can be
fetched from
http://bugzilla.kernel.org/attachment.cgi?id=12313

This was tested in all of latin1, latin9, latin2 and unicode with french
and czech dead keys.

Turn the kernel accent_table into unicode, and extend ioctls KDGKBDIACR
and KDSKBDIACR into their equivalents KDGKBDIACRUC and KDSKBDIACR.

New function int conv_uni_to_8bit(u32 uni) for converting unicode into 8bit
_input_.  No, we don't want to store the translation, as it is potentially
sparse and large.

Signed-off-by: Samuel Thibault <samuel.thibault@ens-lyon.org>
Cc: Jan Engelhardt <jengelh@gmx.de>
Cc: "Antonino A. Daplas" <adaplas@pol.net>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-17 08:42:52 -07:00
Roland McGrath
82df39738b Add MMF_DUMP_ELF_HEADERS
This adds the MMF_DUMP_ELF_HEADERS option to /proc/pid/coredump_filter.
This dumps the first page (only) of a private file mapping if it appears to
be a mapping of an ELF file.  Including these pages in the core dump may
give sufficient identifying information to associate the original DSO and
executable file images and their debugging information with a core file in
a generic way just from its contents (e.g.  when those binaries were built
with ld --build-id).  I expect this to become the default behavior
eventually.  Existing versions of gdb can be confused by the core dumps it
creates, so it won't enabled by default for some time to come.  Soon many
people will have systems with a gdb that handle these dumps, so they can
arrange to set the bit at boot and have it inherited system-wide.

This also cleans up the checking of the MMF_DUMP_* flag bits, which did not
need to be using atomic macros.

Signed-off-by: Roland McGrath <roland@redhat.com>
Cc: Hidehiro Kawai <hidehiro.kawai.ez@hitachi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-17 08:42:52 -07:00
Roland McGrath
ab799dede9 Add linux/elfcore-compat.h
This adds the linux/elfcore-compat.h header file, which is the CONFIG_COMPAT
analog of the linux/elfcore.h header.  Each arch that needs to fake out
fs/binfmt_elf.c for its compat code can use this header to replace the
hand-copied definitions of the compat variants of struct elf_prstatus et al.
Only the pr_reg field varies by arch, so asm/{compat,elf}.h must define
compat_elf_gregset_t before linux/elfcore-compat.h can be used.

It's a clean-up that every arch with compat core dumping code can benefit
from.  I only touched the ones I have handy to test at home.  Doing the same
for each other arch should be straightforward, and I'm happy to offer tips.

Signed-off-by: Roland McGrath <roland@redhat.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Andi Kleen <ak@suse.de>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: <linux-arch@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-17 08:42:51 -07:00
Christoph Hellwig
bcd6d4ecf6 ufs: move non-layout parts of ufs_fs.h to fs/ufs/
Move prototypes and in-core structures to fs/ufs/ similar to what most
other filesystems already do.

I made little modifications: move also ufs debug macros and
mount options constants into fs/ufs/ufs.h, this stuff
also private for ufs.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Evgeniy Dushistov <dushistov@mail.ru>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-17 08:42:51 -07:00
Mike Frysinger
0b15d04af3 printk: add interfaces for external access to the log buffer
Add two new functions for reading the kernel log buffer.  The intention is for
them to be used by recovery/dump/debug code so the kernel log can be easily
retrieved/parsed in a crash scenario, but they are generic enough for other
people to dream up other fun uses.

[akpm@linux-foundation.org: buncha fixes]
Signed-off-by: Mike Frysinger <vapier@gentoo.org>
Cc: Robin Getz <rgetz@blackfin.uclinux.org>
Cc: Greg Ungerer <gerg@snapgear.com>
Cc: Russell King <rmk@arm.linux.org.uk>
Cc: Paul Mundt <lethal@linux-sh.org>
Acked-by: Tim Bird <tim.bird@am.sony.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-17 08:42:50 -07:00
Roland McGrath
6d76013381 Add /sys/module/name/notes
This patch adds the /sys/module/<name>/notes/ magic directory, which has a
file for each allocated SHT_NOTE section that appears in <name>.ko.  This
is the counterpart for each module of /sys/kernel/notes for vmlinux.
Reading this delivers the contents of the module's SHT_NOTE sections.  This
lets userland easily glean any detailed information about that module's
build that was stored there at compile time (e.g.  by ld --build-id).

Signed-off-by: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-17 08:42:50 -07:00
Neil Horman
7dc0b22e3c core_pattern: ignore RLIMIT_CORE if core_pattern is a pipe
For some time /proc/sys/kernel/core_pattern has been able to set its output
destination as a pipe, allowing a user space helper to receive and
intellegently process a core.  This infrastructure however has some
shortcommings which can be enhanced.  Specifically:

1) The coredump code in the kernel should ignore RLIMIT_CORE limitation
   when core_pattern is a pipe, since file system resources are not being
   consumed in this case, unless the user application wishes to save the core,
   at which point the app is restricted by usual file system limits and
   restrictions.

2) The core_pattern code should be able to parse and pass options to the
   user space helper as an argv array.  The real core limit of the uid of the
   crashing proces should also be passable to the user space helper (since it
   is overridden to zero when called).

3) Some miscellaneous bugs need to be cleaned up (specifically the
   recognition of a recursive core dump, should the user mode helper itself
   crash.  Also, the core dump code in the kernel should not wait for the user
   mode helper to exit, since the same context is responsible for writing to
   the pipe, and a read of the pipe by the user mode helper will result in a
   deadlock.

This patch:

Remove the check of RLIMIT_CORE if core_pattern is a pipe.  In the event that
core_pattern is a pipe, the entire core will be fed to the user mode helper.

Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
Cc: <martin.pitt@ubuntu.com>
Cc: <wwoods@redhat.com>
Cc: Jeremy Fitzhardinge <jeremy@goop.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-17 08:42:50 -07:00
Mark Fortescue
252e211e90 Add in SunOS 4.1.x compatible mode for UFS
Add in support for SunOS 4.1.x flavor of BSD 4.2 UFS filing system Macros have
been put in to alow suport for the old static table Cylinder Groups but this
implementation does not use them yet.

This also fixes Solaris UFS filing system access by disabling fast symbolic
links as Sun's version of UFS does not support on-disk fast symbolic links.

Tested by:
  Ppartitioning a new disk using SunOS 4.1.1, creating a UFS filing system on
  one of the partitions and writing some files to the filing system.
  Using Linux-2.6.22 (patched) to read the files and then write a shed load of
  files to the UFS partition.
  Using SunOS 4.1.1 to verify the filing system is OK and to check the files.
The test host is a sun4c SS1 Clone.

[akpm@linux-foundation.org: coding style fixes]
[adobriyan@gmail.com: fix oops]
Signed-off-by: Mark Fortescue <mark@mtfhpc.demon.co.uk>
Cc: Evgeniy Dushistov <dushistov@mail.ru>
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-17 08:42:49 -07:00
Denis Cheng
74bf17cffc fs: remove the unused mempages parameter
Since the mempages parameter is actually not used, they should be removed.

Now there is only files_init use the mempages parameter,

 	files_init(mempages);

but I don't think the adaptation to mempages in files_init is really
useful; and if files_init also changed to the prototype void (*func)(void),
the wrapper vfs_caches_init would also not need the mempages parameter.

Signed-off-by: Denis Cheng <crquan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-17 08:42:49 -07:00
Rusty Russell
af49d9248f Remove "unsafe" from module struct
Adrian Bunk points out that "unsafe" was used to mark modules touched by
the deprecated MOD_INC_USE_COUNT interface, which has long gone.  It's time
to remove the member from the module structure, as well.

If you want a module which can't unload, don't register an exit function.

(Vlad Yasevich says SCTP is now safe to unload, so just remove the
__unsafe there).

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Acked-by: Shannon Nelson <shannon.nelson@intel.com>
Acked-by: Dan Williams <dan.j.williams@intel.com>
Acked-by: Vlad Yasevich <vladislav.yasevich@hp.com>
Cc: Sridhar Samudrala <sri@us.ibm.com>
Cc: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-17 08:42:49 -07:00
Miklos Szeredi
d9c9bef134 ext4: show all mount options
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Cc: <linux-ext4@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-17 08:42:49 -07:00
Miklos Szeredi
571beed8d6 ext3: show all mount options
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Cc: <linux-ext4@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-17 08:42:48 -07:00
Miklos Szeredi
93d44cb275 ext2: show all mount options
Using mtab is problematic for various reasons, one of them is that
unprivileged mounts won't turn up in there.  So we want to get rid of it, and
use /proc/mounts instead.

But most filesystems are lazy, and are not showing all mount options.  Which
means, that without mtab, the user won't be able to see some or all of the
options.

It would be nice if the generic code could remember the mount options, and
show them without the need to add extra code to filesystems.  But this is not
easy, because different filesystems handle mount options given options, and
not tough the rest.  This is not taken into account by mount(8) either, so
/etc/mtab will be broken in this case.

This series fixes up ->show_options() in ext[234].

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Cc: <linux-ext4@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-17 08:42:48 -07:00
Alexey Dobriyan
4be28540ee Remove sysctl.h from fs.h
Rrrr, addition of sysctl.h to fs.h was't very smart, because simple
editing of the former will buy you big recompile, where it shouldn't
have to.

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-17 08:42:48 -07:00
Olaf Hering
4029a9177f unexport asm/shmparam.h
SHMLBA cant possible be used in userspace, see sparc versions of that header.

Do not export asm/shmparam.h during make headers_install_all
This removes another uservisible place of PAGE_SIZE

Signed-off-by: Olaf Hering <olaf@aepfle.de>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-17 08:42:47 -07:00
Hans-Christian Egtvedt
eb1f293060 Driver for the Atmel on-chip SSC on AT32AP and AT91
The Synchronous Serial Controller (SSC) on Atmel microprocessors are
capable of tranceiving many frame based protocols, like I2S.  Tested on the
AT32AP7000/ATSTK1000.

This driver is used in the ALSA sound driver for the AT73C213 external DAC
on the ATSTK1000 development board for AVR32.  This sound driver will be
submitted soon.

Hardware documentation can be found in the AT32AP7000 data sheet, which can
be downloaded from
http://www.atmel.com/dyn/products/datasheets.asp?family_id=682

[akpm@linux-foundation.org: init spinlock at compile time]
Signed-off-by: Hans-Christian Egtvedt <hcegtvedt@atmel.com>
Acked-by: Haavard Skinnemoen <hskinnemoen@atmel.com>
Cc: David Brownell <david-b@pacbell.net>
Cc: Andrew Victor <andrew@sanpeople.com>
Cc: Patrice Vilchez <patrice.vilchez@rfo.atmel.com>
Cc: Nicolas Ferre <nicolas.ferre@rfo.atmel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-17 08:42:47 -07:00
Robert P. J. Day
94f582f82a Force erroneous inclusions of compiler-*.h files to be errors
Replace worthless comments with actual preprocessor errors when including
the wrong versions of the compiler.h files.

[akpm@linux-foundation.org: make it work]
Signed-off-by: Robert P. J. Day <rpjday@mindspring.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-17 08:42:47 -07:00
Ravikiran G Thirumalai
c4f3b63fe1 softlockup: add a /proc tuning parameter
Control the trigger limit for softlockup warnings.  This is useful for
debugging softlockups, by lowering the softlockup_thresh to identify
possible softlockups earlier.

This patch:
1. Adds a sysctl softlockup_thresh with valid values of 1-60s
   (Higher value to disable false positives)
2. Changes the softlockup printk to print the cpu softlockup time

[akpm@linux-foundation.org: Fix various warnings and add definition of "two"]
Signed-off-by: Ravikiran Thirumalai <kiran@scalex86.org>
Signed-off-by: Shai Fultheim <shai@scalex86.org>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-17 08:42:47 -07:00
Paul E. McKenney
97b430320c Immunize rcu_dereference() against crazy compiler writers
Turns out that compiler writers are a bit more aggressive about optimizing
than one might expect.  This patch prevents a number of such optimizations
from messing up rcu_deference().  This is not merely a theoretical problem, as
evidenced by the rmb() in mce_log().

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Ingo Molnar <mingo@elte.hu>
Acked-by: Josh Triplett <josh@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-17 08:42:46 -07:00
Alexey Dobriyan
f6b450d489 Make unregister_binfmt() return void
list_del() hardly can fail, so checking for return value is pointless
(and current code always return 0).

Nobody really cared that return value anyway.

Signed-off-by: Alexey Dobriyan <adobriyan@sw.ru>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-17 08:42:46 -07:00
Alexey Dobriyan
e4dc1b14d8 Use list_head in binfmt handling
Switch single-linked binfmt formats list to usual list_head's.  This leads
to one-liners in register_binfmt() and unregister_binfmt().  The downside
is one pointer more in struct linux_binfmt.  This is not a problem, since
the set of registered binfmts on typical box is very small -- (ELF +
something distro enabled for you).

Test-booted, played with executable .txt files, modprobe/rmmod binfmt_misc.

Signed-off-by: Alexey Dobriyan <adobriyan@sw.ru>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-17 08:42:46 -07:00
Adrian Bunk
deba0f49b9 fs/reiserfs/: cleanups
- remove the following no longer used functions:
  - bitmap.c: reiserfs_claim_blocks_to_be_allocated()
  - bitmap.c: reiserfs_release_claimed_blocks()
  - bitmap.c: reiserfs_can_fit_pages()

- make the following functions static:
  - inode.c: restart_transaction()
  - journal.c: reiserfs_async_progress_wait()

Signed-off-by: Adrian Bunk <bunk@stusta.de>
Acked-by: Vladimir V. Saveliev <vs@namesys.com>
Cc: Nick Piggin <npiggin@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-17 08:42:46 -07:00
David Rientjes
d773ed6b85 mm: test and set zone reclaim lock before starting reclaim
Introduces new zone flag interface for testing and setting flags:

	int zone_test_and_set_flag(struct zone *zone, zone_flags_t flag)

Instead of setting and clearing ZONE_RECLAIM_LOCKED each time shrink_zone() is
called, this flag is test and set before starting zone reclaim.  Zone reclaim
starts in __alloc_pages() when a zone's watermark fails and the system is in
zone_reclaim_mode.  If it's already in reclaim, there's no need to start again
so it is simply considered full for that allocation attempt.

There is a change of behavior with regard to concurrent zone shrinking.  It is
now possible for try_to_free_pages() or kswapd to already be shrinking a
particular zone when __alloc_pages() starts zone reclaim.  In this case, it is
possible for two concurrent threads to invoke shrink_zone() for a single zone.

This change forbids a zone to be in zone reclaim twice, which was always the
behavior, but allows for concurrent try_to_free_pages() or kswapd shrinking
when starting zone reclaim.

Cc: Andrea Arcangeli <andrea@suse.de>
Cc: Christoph Lameter <clameter@sgi.com>
Signed-off-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-17 08:42:46 -07:00
David Rientjes
9aad369e56 oom: add header file to Kbuild as unifdef
Preprocess include/linux/oom.h before exporting it to userspace.

Cc: Andrea Arcangeli <andrea@suse.de>
Cc: Christoph Lameter <clameter@sgi.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-17 08:42:46 -07:00
David Rientjes
172acf60f3 oom: prevent including sched.h in header file
It's not necessary to include all of linux/sched.h in linux/oom.h.  Instead,
simply include prototypes for the relevant structs and include linux/types.h
for gfp_t.

Cc: Andrea Arcangeli <andrea@suse.de>
Cc: Christoph Lameter <clameter@sgi.com>
Acked-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-17 08:42:46 -07:00
David Rientjes
bbe373f2c6 oom: compare cpuset mems_allowed instead of exclusive ancestors
Instead of testing for overlap in the memory nodes of the the nearest
exclusive ancestor of both current and the candidate task, it is better to
simply test for intersection between the task's mems_allowed in their task
descriptors.  This does not require taking callback_mutex since it is only
used as a hint in the badness scoring.

Tasks that do not have an intersection in their mems_allowed with the current
task are not explicitly restricted from being OOM killed because it is quite
possible that the candidate task has allocated memory there before and has
since changed its mems_allowed.

Cc: Andrea Arcangeli <andrea@suse.de>
Acked-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-17 08:42:46 -07:00
David Rientjes
098d7f128a oom: add per-zone locking
OOM killer synchronization should be done with zone granularity so that memory
policy and cpuset allocations may have their corresponding zones locked and
allow parallel kills for other OOM conditions that may exist elsewhere in the
system.  DMA allocations can be targeted at the zone level, which would not be
possible if locking was done in nodes or globally.

Synchronization shall be done with a variation of "trylocks." The goal is to
put the current task to sleep and restart the failed allocation attempt later
if the trylock fails.  Otherwise, the OOM killer is invoked.

Each zone in the zonelist that __alloc_pages() was called with is checked for
the newly-introduced ZONE_OOM_LOCKED flag.  If any zone has this flag present,
the "trylock" to serialize the OOM killer fails and returns zero.  Otherwise,
all the zones have ZONE_OOM_LOCKED set and the try_set_zone_oom() function
returns non-zero.

Cc: Andrea Arcangeli <andrea@suse.de>
Cc: Christoph Lameter <clameter@sgi.com>
Signed-off-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-17 08:42:45 -07:00
David Rientjes
e815af95f9 oom: change all_unreclaimable zone member to flags
Convert the int all_unreclaimable member of struct zone to unsigned long
flags.  This can now be used to specify several different zone flags such as
all_unreclaimable and reclaim_in_progress, which can now be removed and
converted to a per-zone flag.

Flags are set and cleared as follows:

	zone_set_flag(struct zone *zone, zone_flags_t flag)
	zone_clear_flag(struct zone *zone, zone_flags_t flag)

Defines the first zone flags, ZONE_ALL_UNRECLAIMABLE and ZONE_RECLAIM_LOCKED,
which have the same semantics as the old zone->all_unreclaimable and
zone->reclaim_in_progress, respectively.  Also converts all current users that
set or clear either flag to use the new interface.

Helper functions are defined to test the flags:

	int zone_is_all_unreclaimable(const struct zone *zone)
	int zone_is_reclaim_locked(const struct zone *zone)

All flag operators are of the atomic variety because there are currently
readers that are implemented that do not take zone->lock.

[akpm@linux-foundation.org: add needed include]
Cc: Andrea Arcangeli <andrea@suse.de>
Acked-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-17 08:42:45 -07:00
David Rientjes
70e24bdf6d oom: move constraints to enum
The OOM killer's CONSTRAINT definitions are really more appropriate in an
enum, so define them in include/linux/oom.h.

Cc: Andrea Arcangeli <andrea@suse.de>
Acked-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-17 08:42:45 -07:00
David Rientjes
5a3135c2e7 oom: move prototypes to appropriate header file
Move the OOM killer's extern function prototypes to include/linux/oom.h and
include it where necessary.

[clg@fr.ibm.com: build fix]
Cc: Andrea Arcangeli <andrea@suse.de>
Acked-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: David Rientjes <rientjes@google.com>
Signed-off-by: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-17 08:42:45 -07:00
Christoph Lameter
4ba9b9d0ba Slab API: remove useless ctor parameter and reorder parameters
Slab constructors currently have a flags parameter that is never used.  And
the order of the arguments is opposite to other slab functions.  The object
pointer is placed before the kmem_cache pointer.

Convert

        ctor(void *object, struct kmem_cache *s, unsigned long flags)

to

        ctor(struct kmem_cache *s, void *object)

throughout the kernel

[akpm@linux-foundation.org: coupla fixes]
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-17 08:42:45 -07:00
Peter Zijlstra
3e26c149c3 mm: dirty balancing for tasks
Based on ideas of Andrew:
  http://marc.info/?l=linux-kernel&m=102912915020543&w=2

Scale the bdi dirty limit inversly with the tasks dirty rate.
This makes heavy writers have a lower dirty limit than the occasional writer.

Andrea proposed something similar:
  http://lwn.net/Articles/152277/

The main disadvantage to his patch is that he uses an unrelated quantity to
measure time, which leaves him with a workload dependant tunable. Other than
that the two approaches appear quite similar.

[akpm@linux-foundation.org: fix warning]
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-17 08:42:45 -07:00
Peter Zijlstra
04fbfdc14e mm: per device dirty threshold
Scale writeback cache per backing device, proportional to its writeout speed.

By decoupling the BDI dirty thresholds a number of problems we currently have
will go away, namely:

 - mutual interference starvation (for any number of BDIs);
 - deadlocks with stacked BDIs (loop, FUSE and local NFS mounts).

It might be that all dirty pages are for a single BDI while other BDIs are
idling. By giving each BDI a 'fair' share of the dirty limit, each one can have
dirty pages outstanding and make progress.

A global threshold also creates a deadlock for stacked BDIs; when A writes to
B, and A generates enough dirty pages to get throttled, B will never start
writeback until the dirty pages go away. Again, by giving each BDI its own
'independent' dirty limit, this problem is avoided.

So the problem is to determine how to distribute the total dirty limit across
the BDIs fairly and efficiently. A DBI that has a large dirty limit but does
not have any dirty pages outstanding is a waste.

What is done is to keep a floating proportion between the DBIs based on
writeback completions. This way faster/more active devices get a larger share
than slower/idle devices.

[akpm@linux-foundation.org: fix warnings]
[hugh@veritas.com: Fix occasional hang when a task couldn't get out of balance_dirty_pages]
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-17 08:42:45 -07:00
Peter Zijlstra
145ca25eb2 lib: floating proportions
Given a set of objects, floating proportions aims to efficiently give the
proportional 'activity' of a single item as compared to the whole set. Where
'activity' is a measure of a temporal property of the items.

It is efficient in that it need not inspect any other items of the set
in order to provide the answer. It is not even needed to know how many
other items there are.

It has one parameter, and that is the period of 'time' over which the
'activity' is measured.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-17 08:42:45 -07:00
Peter Zijlstra
69cb51d18c mm: count writeback pages per BDI
Count per BDI writeback pages.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-17 08:42:45 -07:00
Peter Zijlstra
c9e51e4180 mm: count reclaimable pages per BDI
Count per BDI reclaimable pages; nr_reclaimable = nr_dirty + nr_unstable.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-17 08:42:45 -07:00
Peter Zijlstra
b2e8fb6efa mm: scalable bdi statistics counters
Provide scalable per backing_dev_info statistics counters.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-17 08:42:45 -07:00
Peter Zijlstra
e0bf68ddec mm: bdi init hooks
provide BDI constructor/destructor hooks

[akpm@linux-foundation.org: compile fix]
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-17 08:42:45 -07:00
Peter Zijlstra
dc62a30e27 lib: percpu_counter_init_irq
provide a way to tell lockdep about percpu_counters that are supposed to be
used from irq safe contexts.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-17 08:42:44 -07:00
Peter Zijlstra
833f4077bf lib: percpu_counter_init error handling
alloc_percpu can fail, propagate that error.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-17 08:42:44 -07:00
Peter Zijlstra
bf1d89c813 lib: percpu_count_sum()
Provide an accurate version of percpu_counter_read.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-17 08:42:44 -07:00
Peter Zijlstra
52d9f3b409 lib: percpu_counter_sum_positive
s/percpu_counter_sum/&_positive/

Because its consitent with percpu_counter_read*

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-17 08:42:44 -07:00
Peter Zijlstra
3a587f47b8 lib: percpu_counter_set
Provide a method to set a percpu counter to a specified value.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-17 08:42:44 -07:00
Peter Zijlstra
20e8976709 lib: make percpu_counter_add take s64
percpu_counter is a s64 counter, make _add consitent.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-17 08:42:44 -07:00
Peter Zijlstra
252e0ba6b7 lib: percpu_counter variable batch
Because the current batch setup has an quadric error bound on the counter,
allow for an alternative setup.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-17 08:42:44 -07:00
Peter Zijlstra
3cb4f9fa0c lib: percpu_counter_sub
Hugh spotted that some code does:
  percpu_counter_add(&counter, -unsignedlong)

which, when the amount argument is of type s32, sort-of works thanks to
two's-complement. However when we'd change the type to s64 this breaks on 32bit
machines, because the promotion rules zero extend the unsigned number.

Provide percpu_counter_sub() to hide the s64 cast. That is:
  percpu_counter_sub(&counter, foo)
is equal to:
  percpu_counter_add(&counter, -(s64)foo);

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-17 08:42:44 -07:00
Peter Zijlstra
aa0dff2d09 lib: percpu_counter_add
s/percpu_counter_mod/percpu_counter_add/

Because its a better name, _mod implies modulo.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-17 08:42:44 -07:00
Peter Zijlstra
c4dc4beed2 nfs: remove congestion_end()
These patches aim to improve balance_dirty_pages() and directly address three
issues:
  1) inter device starvation
  2) stacked device deadlocks
  3) inter process starvation

1 and 2 are a direct result from removing the global dirty limit and using
per device dirty limits. By giving each device its own dirty limit is will
no longer starve another device, and the cyclic dependancy on the dirty limit
is broken.

In order to efficiently distribute the dirty limit across the independant
devices a floating proportion is used, this will allocate a share of the total
limit proportional to the device's recent activity.

3 is done by also scaling the dirty limit proportional to the current task's
recent dirty rate.

This patch:

nfs: remove congestion_end().  It's redundant, clear_bdi_congested() already
wakes the waiters.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-17 08:42:44 -07:00
Mark Nelson
1f7d6668c2 powerpc: add Altivec/VMX state to coredumps
Update dump_task_altivec() (which has so far never been put to use) so that
it dumps the Altivec/VMX registers (VR[0] - VR[31], VSCR and VRSAVE) in the
same format as the ptrace get_vrregs(), and add the appropriate glue
typedef and #defines to make it work.

A new note type of NT_PPC_VMX was chosen to be 0x100 (arbitrarily) because
it allows the low range values to be used for more generic purposes and
0x100 seems an adequate starting point for PowerPC extensions.

Signed-off-by: Mark Nelson <markn@au1.ibm.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Andi Kleen <ak@suse.de>
Cc: <linux-arch@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-17 08:42:44 -07:00
Christoph Hellwig
eead191153 partially fix up the lookup_one_noperm mess
Try to fix the mess created by sysfs braindamage.

 - refactor code internal to fs/namei.c a little to avoid too much
   duplication:
	o __lookup_hash_kern is renamed back to __lookup_hash
	o the old __lookup_hash goes away, permission checks moves to
	  the two callers
	o useless inline qualifiers on above functions go away
 - lookup_one_len_kern loses it's last argument and is renamed to
   lookup_one_noperm to make it's useage a little more clear
 - added kerneldoc comments to describe lookup_one_len aswell as
   lookup_one_noperm and make it very clear that no one should use
   the latter ever.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Cc: Josef 'Jeff' Sipek <jsipek@cs.sunysb.edu>
Cc: Miklos Szeredi <miklos@szeredi.hu>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-17 08:42:44 -07:00
Dhaval Giani
b1a8c172c3 sched: fix !SYSFS build breakage
When CONFIG_SYSFS is not set, CONFIG_FAIR_USER_SCHED fails to build
with

kernel/built-in.o: In function `uids_kobject_init':
(.init.text+0x1488): undefined reference to `kernel_subsys'
kernel/built-in.o: In function `uids_kobject_init':
(.init.text+0x1490): undefined reference to `kernel_subsys'
kernel/built-in.o: In function `uids_kobject_init':
(.init.text+0x1480): undefined reference to `kernel_subsys'
kernel/built-in.o: In function `uids_kobject_init':
(.init.text+0x1494): undefined reference to `kernel_subsys'

This patch fixes this build error.

Signed-off-by: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
Signed-off-by: Dhaval Giani <dhaval@linux.vnet.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-10-17 16:55:11 +02:00
Joachim Fenkes
fec738dd48 [POWERPC] Move of_device allocation into of_device.[ch]
Extract generic of_device allocation code from of_platform_device_create()
and move it into of_device.[ch], called of_device_alloc(). Also, there's now
of_device_free() which puts the device node.

This way, bus drivers that build on of_platform (like ibmebus will) can
build upon this code instead of reinventing the wheel.

Signed-off-by: Joachim Fenkes <fenkes@de.ibm.com>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Paul Mackerras <paulus@samba.org>
2007-10-17 22:30:07 +10:00
H. Peter Anvin
3ea3351000 Remove magic macros for screen_info structure members
Stop using magic macros for screen_info structure members.

Signed-off-by: H. Peter Anvin <hpa@zytor.com>
2007-10-16 22:57:17 -07:00
Linus Torvalds
2b0460b534 Merge master.kernel.org:/pub/scm/linux/kernel/git/bart/ide-2.6
* master.kernel.org:/pub/scm/linux/kernel/git/bart/ide-2.6: (33 commits)
  amd74xx: remove /proc/ide/amd74xx
  amd74xx/via82cxxx: don't initialize drive->dn
  sis5513: remove /proc/ide/sis
  ide: remove CONFIG_IDEDMA_ONLYDISK
  ide: add "hdx=nodma" kernel parameter
  ide: remove hwif->autodma and drive->autodma
  ide: remove "idex=dma" kernel parameter
  ide: remove CONFIG_BLK_DEV_IDEDMA_FORCED
  ide: use PCI_VDEVICE() macro
  sis5513: clear prefetch and postwrite for ATAPI devices
  it8213/piix/slc90e66: "de-couple" PIO and UDMA modes
  ide: unexport noautodma
  ide: unexport ide_tune_dma
  ide: remove ->ide_dma_check (take 2)
  ide-pmac: add PIO autotune fallback to ->ide_dma_check
  ide-cris: add PIO autotune fallback to ->ide_dma_check
  sl82c105: add PIO autotune fallback to ->ide_dma_check
  cs5530/sc1200: add PIO autotune fallback to ->ide_dma_check
  ide: remove ide_use_fast_pio()
  ide: remove drive->init_speed zeroing
  ...
2007-10-16 16:56:35 -07:00
Linus Torvalds
b883a688ce Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/selinux-2.6
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/selinux-2.6:
  SELinux: kills warnings in Improve SELinux performance when AVC misses
  SELinux: improve performance when AVC misses.
  SELinux: policy selectable handling of unknown classes and perms
  SELinux: Improve read/write performance
  SELinux: tune avtab to reduce memory usage
2007-10-16 16:53:20 -07:00
Yuichi Nakamura
788e7dd4c2 SELinux: Improve read/write performance
It reduces the selinux overhead on read/write by only revalidating
permissions in selinux_file_permission if the task or inode labels have
changed or the policy has changed since the open-time check.  A new LSM
hook, security_dentry_open, is added to capture the necessary state at open
time to allow this optimization.

(see http://marc.info/?l=selinux&m=118972995207740&w=2)

Signed-off-by: Yuichi Nakamura<ynakam@hitachisoft.jp>
Acked-by: Stephen Smalley <sds@tycho.nsa.gov>
Signed-off-by: James Morris <jmorris@namei.org>
2007-10-17 08:59:31 +10:00
Stefan Richter
a64408b96b firewire: adopt read cycle timer ABI from raw1394
This duplicates the read cycle timer feature of raw1394 (added in Linux
2.6.21) in firewire-core's userspace ABI.  The argument to the ioctl is
reordered though to ensure 32/64 bit compatibility.

Signed-off-by: Stefan Richter <stefanr@s5r6.in-berlin.de>
Signed-off-by: Kristian Høgsberg <krh@redhat.com>
2007-10-17 00:00:08 +02:00
Bartlomiej Zolnierkiewicz
c223701cf6 ide: add "hdx=nodma" kernel parameter
* Add "hdx=nodma" option allowing user to disallow DMA for a given device.

* Obsolete "ide=nodma" option.

Acked-by: Sergei Shtylyov <sshtylyov@ru.mvista.com>
Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
2007-10-16 22:29:58 +02:00
Bartlomiej Zolnierkiewicz
9ff6f72f43 ide: remove hwif->autodma and drive->autodma
* hpt34x.c: disable DMA masks for HPT345
  (hwif->autodma is zero so DMA won't be enabled anyway).

* trm290.c: disable IDE_HFLAG_TRUST_BIOS_FOR_DMA flag
  (hwif->autodma is zero so DMA won't be enabled anyway).

* Check noautodma global variable instead of drive->autodma in ide_tune_dma().

  This fixes handling of "ide=nodma" kernel parameter for icside, ide-cris,
  au1xxx-ide, pmac, it821x, jmicron, sgiioc4 and siimage host drivers.

* Remove hwif->autodma (it was not checked by IDE core code anyway) and
  drive->autodma (was set by all host drivers - except HPT345/TRM290 special
  cases - unless "ide=nodma" was used).

While at it:
- remove needless printk() from icside.c
- remove stale FIXME/comment from ide-probe.c
- don't force DMA off if PCI bus-mastering had to be enabled in setup-pci.c
  (this setting was always later over-ridden by host drivers anyway)

Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
2007-10-16 22:29:58 +02:00
Bartlomiej Zolnierkiewicz
0ae2e17865 ide: remove ->ide_dma_check (take 2)
* Add IDE_HFLAG_TRUST_BIOS_FOR_DMA host flag for host drivers that depend
  on BIOS for programming device/controller for DMA.  Set it in cy82c693,
  generic, ns87415, opti621 and trm290 host drivers.

* Add IDE_HFLAG_VDMA host flag for host drivers using VDMA.  Set it in cs5520
  host driver.

* Teach ide_tune_dma() about IDE_HFLAG_TRUST_BIOS_FOR_DMA flag.

* Add generic ide_dma_check() helper and remove all open coded ->ide_dma_check
  implementations.  Fix all places checking for presence of ->ide_dma_check
  hook to check for ->ide_dma_on instead.

* Remove no longer needed code from config_drive_for_dma().

* Make ide_tune_dma() static.

v2:
* Fix config_drive_for_dma() return values.

* Fix ide-dma.c build for CONFIG_BLK_DEV_IDEDMA_PCI=n by adding
  dummy config_drive_for_dma() inline.

* Fix IDE_HFLAG_TRUST_BIOS_FOR_DMA handling in ide_dma_check().

* Fix init_hwif_it8213() comment.

There should be no functionality changes caused by this patch.

Cc: Sergei Shtylyov <sshtylyov@ru.mvista.com>
Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
2007-10-16 22:29:55 +02:00
Bartlomiej Zolnierkiewicz
65c9cd23ca ide: remove ide_use_fast_pio()
Remove ide_use_fast_pio() and just re-tune PIO unconditionally if DMA tuning
has failed in ->ide_dma_check.  All host drivers using ide_use_fast_pio() set
drive->autotune so PIO is always tuned anyway and in some cases we _really_
need to re-tune PIO because PIO and DMA timings are shared.

Acked-by: Sergei Shtylyov <sshtylyov@ru.mvista.com>
Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
2007-10-16 22:29:54 +02:00
Linus Torvalds
821f3eff7c Merge git://git.kernel.org/pub/scm/linux/kernel/git/sam/kbuild
* git://git.kernel.org/pub/scm/linux/kernel/git/sam/kbuild: (40 commits)
  kbuild: introduce ccflags-y, asflags-y and ldflags-y
  kbuild: enable 'make CPPFLAGS=...' to add additional options to CPP
  kbuild: enable use of AFLAGS and CFLAGS on commandline
  kbuild: enable 'make AFLAGS=...' to add additional options to AS
  kbuild: fix AFLAGS use in h8300 and m68knommu
  kbuild: check for wrong use of CFLAGS
  kbuild: enable 'make CFLAGS=...' to add additional options to CC
  kbuild: fix up CFLAGS usage
  kbuild: make modpost detect unterminated device id lists
  kbuild: call export_report from the Makefile
  kbuild: move Kai Germaschewski to CREDITS
  kconfig/menuconfig: distinguish between selected-by-another options and comments
  kconfig: tristate choices with mixed tristate and boolean values
  include/linux/Kbuild: remove duplicate entries
  kbuild: kill backward compatibility checks
  kbuild: kill EXTRA_ARFLAGS
  kbuild: fix documentation in makefiles.txt
  kbuild: call make once for all targets when O=.. is used
  kbuild: pass -g to assembler under CONFIG_DEBUG_INFO
  kbuild: update _shipped files for kconfig syntax cleanup
  ...

Fix up conflicts in arch/um/sys-{x86_64,i386}/Makefile manually.
2007-10-16 11:23:06 -07:00
Linus Torvalds
fc8a327db6 Merge branch 'linus' of master.kernel.org:/pub/scm/linux/kernel/git/perex/alsa
* 'linus' of master.kernel.org:/pub/scm/linux/kernel/git/perex/alsa: (264 commits)
  [ALSA] version 1.0.15
  [ALSA] Fix thinko in cs4231 mce down check
  [ALSA] sun-cs4231: improved waiting after MCE down
  [ALSA] sun-cs4231: use cs4231-regs.h
  [ALSA] This simplifies and fixes waiting loops of the mce_down()
  [ALSA] This patch adds support for a wavetable chip on
  [ALSA] This patch removes open_mutex from the ad1848-lib as
  [ALSA] fix bootup crash in snd_gus_interrupt()
  [ALSA] hda-codec - Fix SKU ID function for realtek codecs
  [ALSA] Support  ASUS P701 eeepc [0x1043 0x82a1] support
  [ALSA] hda-codec - Add array terminator for dmic in STAC codec
  [ALSA] hdsp - Fix zero division
  [ALSA] usb-audio - Fix double comment
  [ALSA] hda-codec - Fix STAC922x volume knob control
  [ALSA] Changed Jaroslav Kysela's e-mail from perex@suse.cz to perex@perex.cz
  [ALSA] hda-codec - Fix for Fujitsu Lifebook C1410
  [ALSA] mpu-401: remove MPU401_INFO_UART_ONLY flag
  [ALSA] mpu-401: do not require an ACK byte for the ENTER_UART command
  [ALSA] via82xx - Add DXS quirk for Shuttle AK31v2
  [ALSA] hda-codec - Fix input_mux numbers for vaio stac92xx
  ...
2007-10-16 10:13:38 -07:00
Linus Torvalds
92d15c2ccb Merge branch 'for-linus' of git://git.kernel.dk/data/git/linux-2.6-block
* 'for-linus' of git://git.kernel.dk/data/git/linux-2.6-block: (63 commits)
  Fix memory leak in dm-crypt
  SPARC64: sg chaining support
  SPARC: sg chaining support
  PPC: sg chaining support
  PS3: sg chaining support
  IA64: sg chaining support
  x86-64: enable sg chaining
  x86-64: update pci-gart iommu to sg helpers
  x86-64: update nommu to sg helpers
  x86-64: update calgary iommu to sg helpers
  swiotlb: sg chaining support
  i386: enable sg chaining
  i386 dma_map_sg: convert to using sg helpers
  mmc: need to zero sglist on init
  Panic in blk_rq_map_sg() from CCISS driver
  remove sglist_len
  remove blk_queue_max_phys_segments in libata
  revert sg segment size ifdefs
  Fixup u14-34f ENABLE_SG_CHAINING
  qla1280: enable use_sg_chaining option
  ...
2007-10-16 10:09:16 -07:00
Geert Uytterhoeven
9a054fbac8 fb: move and rename extern declaration for global_mode_option
Move the extern declaration for global_mode_option to <linux/fb.h> and rename
the variable to fb_mode_option.

Signed-off-by: Geert Uytterhoeven <Geert.Uytterhoeven@sonycom.com>
Cc: "Antonino A. Daplas" <adaplas@pol.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-16 09:43:22 -07:00
Antonino A. Daplas
e400b6ec4e vt/vgacon: Check if screen resize request comes from userspace
Various console drivers are able to resize the screen via the con_resize()
hook.  This hook is also visible in userspace via the TIOCWINSZ, VT_RESIZE and
VT_RESIZEX ioctl's.  One particular utility, SVGATextMode, expects that
con_resize() of the VGA console will always return success even if the
resulting screen is not compatible with the hardware.  However, this
particular behavior of the VGA console, as reported in Kernel Bugzilla Bug
7513, can cause undefined behavior if the user starts with a console size
larger than 80x25.

To work around this problem, add an extra parameter to con_resize().  This
parameter is ignored by drivers except for vgacon.  If this parameter is
non-zero, then the resize request came from a VT_RESIZE or VT_RESIZEX ioctl
and vgacon will always return success.  If this parameter is zero, vgacon will
return -EINVAL if the requested size is not compatible with the hardware.  The
latter is the more correct behavior.

With this change, SVGATextMode should still work correctly while in-kernel and
stty resize calls can expect correct behavior from vgacon.

Signed-off-by: Antonino Daplas <adaplas@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-16 09:43:20 -07:00
Krzysztof Halasa
394d3af7ba Intel FB: more interlaced mode support
Intel FB: allow odd- and even-field-first in interlaced modes, and
proper sync to vertical retrace

Signed-off-by: Krzysztof Halasa <khc@pm.waw.pl>
Cc: "Antonino A. Daplas" <adaplas@pol.net>
Cc: <sylvain.meyer@worldonline.fr>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-16 09:43:20 -07:00
Pavel Pisa
779121e9f1 fbdev: Support for byte-reversed framebuffer formats
Allow generic frame-buffer code to correctly write texts and blit images for
1, 2 and 4 bit per pixel frame-buffer organizations when pixels in bytes are
organized to in opposite order than bytes in long type.

Overhead should be reasonable.  If option is not selected, than compiler
should eliminate completely all overhead.

The feature is disabled at compile time if CONFIG_FB_CFB_REV_PIXELS_IN_BYTE is
not set.

[adaplas]
Convert helper functions to macros if feature is not enabled.

Signed-off-by: Pavel Pisa <pisa@cmp.felk.cvut.cz>
Signed-off-by: Antonino Daplas <adaplas@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-16 09:43:19 -07:00
Ralf Baechle
120c0b6d57 vt: Fix warnings in selection.h
<linux/selection.h> assumes that struct tty_struct has previously been
included.  If not, this pile of warnings will result:

  CC [M]  drivers/video/console/newport_con.o
In file included from drivers/video/console/newport_con.c:18:
include/linux/selection.h:16: warning: 'struct tty_struct' declared inside param
eter list
include/linux/selection.h:16: warning: its scope is only this definition or decl
aration, which is probably not what you want
include/linux/selection.h:17: warning: 'struct tty_struct' declared inside param
eter list
include/linux/selection.h:20: warning: 'struct tty_struct' declared inside param
eter list

Fixed by adding a forward declaration of struct tty_struct.

Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
Signed-off-by: Antonino Daplas <adaplas@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-16 09:43:17 -07:00
Ben Dooks
eb78f9b3fa sm501fb: Ensure panel interface is not tristated when setup
When we setup the panel interface whilst configuring the
framebuffer, we should ensure the panel interface is not
in tristate, in case the bootloader or previous setup has
not enabled it.

Signed-off-by: Ben Dooks <ben-linux@fluff.org>
Signed-off-by: Antonino Daplas <adaplas@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-16 09:43:14 -07:00
Michal Januszewski
cc54f46e39 uvesafb: add connector entries
Add connector idx and val constants for v86d and uvesafb.

Signed-off-by: Michal Januszewski <spock@gentoo.org>
Signed-off-by: Antonino Daplas <adaplas@gmail.com>
Cc: Evgeniy Polyakov <johnpol@2ka.mipt.ru>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-16 09:43:13 -07:00
Michal Januszewski
3d62a44f74 connector: change connector's max message size
Change the maximum message size to 16k to allow transfers of VBE
data blocks from userspace.

Signed-off-by: Michal Januszewski <spock@gentoo.org>
Signed-off-by: Antonino Daplas <adaplas@gmail.com>
Cc: Evgeniy Polyakov <johnpol@2ka.mipt.ru>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-16 09:43:13 -07:00
Adrian Bunk
cce76f9b96 fs/nfsd/export.c: make 3 functions static
This patch makes the following needlessly global functions static:
- exp_get_by_name()
- exp_parent()
- exp_find()

Signed-off-by: Adrian Bunk <bunk@stusta.de>
Cc: Neil Brown <neilb@suse.de>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-16 09:43:10 -07:00
Matthias Kaehlcke
4e3dfacaa0 use mutex instead of semaphore in isdn subsystem common functions
The ISDN subsystem common functions use a semaphore as mutex. Use the
mutex API instead of the (binary) semaphore.

Signed-off-by: Matthias Kaehlcke <matthias.kaehlcke@gmail.com>
Acked-by: Karsten Keil <kkeil@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-16 09:43:10 -07:00
Masami Hiramatsu
f438d914b2 kprobes: support kretprobe blacklist
Introduce architecture dependent kretprobe blacklists to prohibit users
from inserting return probes on the function in which kprobes can be
inserted but kretprobes can not.

This patch also removes "__kprobes" mark from "__switch_to" on x86_64 and
registers "__switch_to" to the blacklist on x86-64, because that mark is to
prohibit user from inserting only kretprobe.

Signed-off-by: Masami Hiramatsu <mhiramat@redhat.com>
Cc: Prasanna S Panchamukhi <prasanna@in.ibm.com>
Acked-by: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Cc: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-16 09:43:10 -07:00
Tony Jones
49dce689ad spi doesn't need class_device
Make the SPI framework and drivers stop using class_device.  Update docs
accordingly ...  highlighting just which sysfs paths should be
"safe"/stable.

Signed-off-by: Tony Jones <tonyj@suse.de>
Signed-off-by: David Brownell <dbrownell@users.sourceforge.net>
Cc: Greg KH <greg@kroah.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-16 09:43:10 -07:00
Paul Jackson
607717a65d cpuset: remove sched domain hooks from cpusets
Remove the cpuset hooks that defined sched domains depending on the setting
of the 'cpu_exclusive' flag.

The cpu_exclusive flag can only be set on a child if it is set on the
parent.

This made that flag painfully unsuitable for use as a flag defining a
partitioning of a system.

It was entirely unobvious to a cpuset user what partitioning of sched
domains they would be causing when they set that one cpu_exclusive bit on
one cpuset, because it depended on what CPUs were in the remainder of that
cpusets siblings and child cpusets, after subtracting out other
cpu_exclusive cpusets.

Furthermore, there was no way on production systems to query the
result.

Using the cpu_exclusive flag for this was simply wrong from the get go.

Fortunately, it was sufficiently borked that so far as I know, almost no
successful use has been made of this.  One real time group did use it to
affectively isolate CPUs from any load balancing efforts.  They are willing
to adapt to alternative mechanisms for this, such as someway to manipulate
the list of isolated CPUs on a running system.  They can do without this
present cpu_exclusive based mechanism while we develop an alternative.

There is a real risk, to the best of my understanding, of users
accidentally setting up a partitioned scheduler domains, inhibiting desired
load balancing across all their CPUs, due to the nonobvious (from the
cpuset perspective) side affects of the cpu_exclusive flag.

Furthermore, since there was no way on a running system to see what one was
doing with sched domains, this change will be invisible to any using code.
Unless they have real insight to the scheduler load balancing choices, they
will be unable to detect that this change has been made in the kernel's
behaviour.

Initial discussion on lkml of this patch has generated much comment.  My
(probably controversial) take on that discussion is that it has reached a
rough concensus that the current cpuset cpu_exclusive mechanism for
defining sched domains is borked.  There is no concensus on the
replacement.  But since we can remove this mechanism, and since its
continued presence risks causing unwanted partitioning of the schedulers
load balancing, we should remove it while we can, as we proceed to work the
replacement scheduler domain mechanisms.

Signed-off-by: Paul Jackson <pj@sgi.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Christoph Lameter <clameter@engr.sgi.com>
Cc: Dinakar Guniguntala <dino@in.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-16 09:43:09 -07:00
Shannon Nelson
7589670f37 DCA: Add Direct Cache Access driver
Direct Cache Access (DCA) is a method for warming the CPU cache before data
is used, with the intent of lessening the impact of cache misses.  This
patch adds a manager and interface for matching up client requests for DCA
services with devices that offer DCA services.

In order to use DCA, a module must do bus writes with the appropriate tag
bits set to trigger a cache read for a specific CPU.  However, different
CPUs and chipsets can require different sets of tag bits, and the methods
for determining the correct bits may be simple hardcoding or may be a
hardware specific magic incantation.  This interface is a way for DCA
clients to find the correct tag bits for the targeted CPU without needing
to know the specifics.

    [Dave Miller] use DEFINE_SPINLOCK()

Signed-off-by: Shannon Nelson <shannon.nelson@intel.com>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-16 09:43:09 -07:00
Shannon Nelson
3e037454bc I/OAT: Add support for MSI and MSI-X
Add support for MSI and MSI-X interrupt handling, including the ability
to choose the desired interrupt method.

Signed-off-by: Shannon Nelson <shannon.nelson@intel.com>
Acked-by: David S. Miller <davem@davemloft.net>
[bunk@kernel.org: drivers/dma/ioat_dma.c: make 3 functions static]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-16 09:43:09 -07:00
Shannon Nelson
223758c77a I/OAT: New device ids
Add device ids for new revs of the Intel I/OAT DMA engine

Signed-off-by: Shannon Nelson <shannon.nelson@intel.com>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-16 09:43:09 -07:00
Adrian Bunk
dbcb0f19c8 mm/mempolicy.c: cleanups
This patch contains the following cleanups:
- every file should include the headers containing the prototypes for
  its global functions
- make the follosing needlessly global functions static:
  - migrate_to_node()
  - do_mbind()
  - sp_alloc()
  - mpol_rebind_policy()

[akpm@linux-foundation.org: fix uninitialised var warning]
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Acked-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-16 09:43:03 -07:00
Adrian Bunk
d8dc74f212 mm/shmem.c: make 3 functions static
This patch makes three needlessly global functions static.

Signed-off-by: Adrian Bunk <bunk@stusta.de>
Cc: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-16 09:43:03 -07:00
Adam Litke
54f9f80d65 hugetlb: Add hugetlb_dynamic_pool sysctl
The maximum size of the huge page pool can be controlled using the overall
size of the hugetlb filesystem (via its 'size' mount option).  However in the
common case the this will not be set as the pool is traditionally fixed in
size at boot time.  In order to maintain the expected semantics, we need to
prevent the pool expanding by default.

This patch introduces a new sysctl controlling dynamic pool resizing.  When
this is enabled the pool will expand beyond its base size up to the size of
the hugetlb filesystem.  It is disabled by default.

Signed-off-by: Adam Litke <agl@us.ibm.com>
Acked-by: Andy Whitcroft <apw@shadowen.org>
Acked-by: Dave McCracken <dave.mccracken@oracle.com>
Cc: William Irwin <bill.irwin@oracle.com>
Cc: David Gibson <david@gibson.dropbear.id.au>
Cc: Ken Chen <kenchen@google.com>
Cc: Badari Pulavarty <pbadari@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-16 09:43:02 -07:00
Yasunori Goto
98f3cfc1dc memory hotplug: Hot-add with sparsemem-vmemmap
This patch is to avoid panic when memory hot-add is executed with
sparsemem-vmemmap.  Current vmemmap-sparsemem code doesn't support memory
hot-add.  Vmemmap must be populated when hot-add.  This is for
2.6.23-rc2-mm2.

Todo: # Even if this patch is applied, the message "[xxxx-xxxx] potential
        offnode page_structs" is displayed. To allocate memmap on its node,
        memmap (and pgdat) must be initialized itself like chicken and
        egg relationship.

      # vmemmap_unpopulate will be necessary for followings.
         - For cancel hot-add due to error.
         - For unplug.

Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com>
Cc: Andy Whitcroft <apw@shadowen.org>
Cc: Christoph Lameter <clameter@sgi.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-16 09:43:02 -07:00
KAMEZAWA Hiroyuki
48e94196a5 fix memory hot remove not configured case.
Now, arch dependent code around CONFIG_MEMORY_HOTREMOVE is a mess.
This patch cleans up them. This is against 2.6.23-rc6-mm1.

 - fix compile failure on ia64/ CONFIG_MEMORY_HOTPLUG && !CONFIG_MEMORY_HOTREMOVE case.
 - For !CONFIG_MEMORY_HOTREMOVE, add generic no-op remove_memory(),
   which returns -EINVAL.
 - removed remove_pages() only used in powerpc.
 - removed no-op remove_memory() in i386, sh, sparc64, x86_64.

 - only powerpc returns -ENOSYS at memory hot remove(no-op). changes it
   to return -EINVAL.

Note:
Currently, only ia64 supports CONFIG_MEMORY_HOTREMOVE. I welcome other
archs if there are requirements and testers.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-16 09:43:02 -07:00
KAMEZAWA Hiroyuki
0c0e619589 memory unplug: page offline
Logic.
 - set all pages in  [start,end)  as isolated migration-type.
   by this, all free pages in the range will be not-for-use.
 - Migrate all LRU pages in the range.
 - Test all pages in the range's refcnt is zero or not.

Todo:
 - allocate migration destination page from better area.
 - confirm page_count(page)== 0 && PageReserved(page) page is safe to be freed..
 (I don't like this kind of page but..
 - Find out pages which cannot be migrated.
 - more running tests.
 - Use reclaim for unplugging other memory type area.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-16 09:43:02 -07:00
KAMEZAWA Hiroyuki
a5d76b54a3 memory unplug: page isolation
Implement generic chunk-of-pages isolation method by using page grouping ops.

This patch add MIGRATE_ISOLATE to MIGRATE_TYPES. By this
 - MIGRATE_TYPES increases.
 - bitmap for migratetype is enlarged.

pages of MIGRATE_ISOLATE migratetype will not be allocated even if it is free.
By this, you can isolated *freed* pages from users. How-to-free pages is not
a purpose of this patch. You may use reclaim and migrate codes to free pages.

If start_isolate_page_range(start,end) is called,
 - migratetype of the range turns to be MIGRATE_ISOLATE  if
   its type is MIGRATE_MOVABLE. (*) this check can be updated if other
   memory reclaiming works make progress.
 - MIGRATE_ISOLATE is not on migratetype fallback list.
 - All free pages and will-be-freed pages are isolated.
To check all pages in the range are isolated or not,  use test_pages_isolated(),
To cancel isolation, use undo_isolate_page_range().

Changes V6 -> V7
 - removed unnecessary #ifdef

There are HOLES_IN_ZONE handling codes...I'm glad if we can remove them..

Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-16 09:43:02 -07:00
KAMEZAWA Hiroyuki
75884fb1c6 memory unplug: memory hotplug cleanup
A clean up patch for "scanning memory resource [start, end)" operation.

Now, find_next_system_ram() function is used in memory hotplug, but this
interface is not easy to use and codes are complicated.

This patch adds walk_memory_resouce(start,len,arg,func) function.
The function 'func' is called per valid memory resouce range in [start,pfn).

[pbadari@us.ibm.com: Error handling in walk_memory_resource()]
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Badari Pulavarty <pbadari@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-16 09:43:01 -07:00
Christoph Lameter
42a9fdbb12 SLUB: Optimize cacheline use for zeroing
We touch a cacheline in the kmem_cache structure for zeroing to get the
size. However, the hot paths in slab_alloc and slab_free do not reference
any other fields in kmem_cache, so we may have to just bring in the
cacheline for this one access.

Add a new field to kmem_cache_cpu that contains the object size. That
cacheline must already be used in the hotpaths. So we save one cacheline
on every slab_alloc if we zero.

We need to update the kmem_cache_cpu object size if an aliasing operation
changes the objsize of an non debug slab.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-16 09:43:01 -07:00
Christoph Lameter
4c93c355d5 SLUB: Place kmem_cache_cpu structures in a NUMA aware way
The kmem_cache_cpu structures introduced are currently an array placed in the
kmem_cache struct. Meaning the kmem_cache_cpu structures are overwhelmingly
on the wrong node for systems with a higher amount of nodes. These are
performance critical structures since the per node information has
to be touched for every alloc and free in a slab.

In order to place the kmem_cache_cpu structure optimally we put an array
of pointers to kmem_cache_cpu structs in kmem_cache (similar to SLAB).

However, the kmem_cache_cpu structures can now be allocated in a more
intelligent way.

We would like to put per cpu structures for the same cpu but different
slab caches in cachelines together to save space and decrease the cache
footprint. However, the slab allocators itself control only allocations
per node. We set up a simple per cpu array for every processor with
100 per cpu structures which is usually enough to get them all set up right.
If we run out then we fall back to kmalloc_node. This also solves the
bootstrap problem since we do not have to use slab allocator functions
early in boot to get memory for the small per cpu structures.

Pro:
	- NUMA aware placement improves memory performance
	- All global structures in struct kmem_cache become readonly
	- Dense packing of per cpu structures reduces cacheline
	  footprint in SMP and NUMA.
	- Potential avoidance of exclusive cacheline fetches
	  on the free and alloc hotpath since multiple kmem_cache_cpu
	  structures are in one cacheline. This is particularly important
	  for the kmalloc array.

Cons:
	- Additional reference to one read only cacheline (per cpu
	  array of pointers to kmem_cache_cpu) in both slab_alloc()
	  and slab_free().

[akinobu.mita@gmail.com: fix cpu hotplug offline/online path]
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Cc: "Pekka Enberg" <penberg@cs.helsinki.fi>
Cc: Akinobu Mita <akinobu.mita@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-16 09:43:01 -07:00
Christoph Lameter
b3fba8da65 SLUB: Move page->offset to kmem_cache_cpu->offset
We need the offset from the page struct during slab_alloc and slab_free. In
both cases we also reference the cacheline of the kmem_cache_cpu structure.
We can therefore move the offset field into the kmem_cache_cpu structure
freeing up 16 bits in the page struct.

Moving the offset allows an allocation from slab_alloc() without touching the
page struct in the hot path.

The only thing left in slab_free() that touches the page struct cacheline for
per cpu freeing is the checking of SlabDebug(page). The next patch deals with
that.

Use the available 16 bits to broaden page->inuse. More than 64k objects per
slab become possible and we can get rid of the checks for that limitation.

No need anymore to shrink the order of slabs if we boot with 2M sized slabs
(slub_min_order=9).

No need anymore to switch off the offset calculation for very large slabs
since the field in the kmem_cache_cpu structure is 32 bits and so the offset
field can now handle slab sizes of up to 8GB.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-16 09:43:01 -07:00
Christoph Lameter
8e65d24c7c SLUB: Do not use page->mapping
After moving the lockless_freelist to kmem_cache_cpu we no longer need
page->lockless_freelist. Restructure the use of the struct page fields in
such a way that we never touch the mapping field.

This is turn allows us to remove the special casing of SLUB when determining
the mapping of a page (needed for corner cases of virtual caches machines that
need to flush caches of processors mapping a page).

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-16 09:43:01 -07:00
Christoph Lameter
dfb4f09609 SLUB: Avoid page struct cacheline bouncing due to remote frees to cpu slab
A remote free may access the same page struct that also contains the lockless
freelist for the cpu slab. If objects have a short lifetime and are freed by
a different processor then remote frees back to the slab from which we are
currently allocating are frequent. The cacheline with the page struct needs
to be repeately acquired in exclusive mode by both the allocating thread and
the freeing thread. If this is frequent enough then performance will suffer
because of cacheline bouncing.

This patchset puts the lockless_freelist pointer in its own cacheline. In
order to make that happen we introduce a per cpu structure called
kmem_cache_cpu.

Instead of keeping an array of pointers to page structs we now keep an array
to a per cpu structure that--among other things--contains the pointer to the
lockless freelist. The freeing thread can then keep possession of exclusive
access to the page struct cacheline while the allocating thread keeps its
exclusive access to the cacheline containing the per cpu structure.

This works as long as the allocating cpu is able to service its request
from the lockless freelist. If the lockless freelist runs empty then the
allocating thread needs to acquire exclusive access to the cacheline with
the page struct lock the slab.

The allocating thread will then check if new objects were freed to the per
cpu slab. If so it will keep the slab as the cpu slab and continue with the
recently remote freed objects. So the allocating thread can take a series
of just freed remote pages and dish them out again. Ideally allocations
could be just recycling objects in the same slab this way which will lead
to an ideal allocation / remote free pattern.

The number of objects that can be handled in this way is limited by the
capacity of one slab. Increasing slab size via slub_min_objects/
slub_max_order may increase the number of objects and therefore performance.

If the allocating thread runs out of objects and finds that no objects were
put back by the remote processor then it will retrieve a new slab (from the
partial lists or from the page allocator) and start with a whole
new set of objects while the remote thread may still be freeing objects to
the old cpu slab. This may then repeat until the new slab is also exhausted.
If remote freeing has freed objects in the earlier slab then that earlier
slab will now be on the partial freelist and the allocating thread will
pick that slab next for allocation. So the loop is extended. However,
both threads need to take the list_lock to make the swizzling via
the partial list happen.

It is likely that this kind of scheme will keep the objects being passed
around to a small set that can be kept in the cpu caches leading to increased
performance.

More code cleanups become possible:

- Instead of passing a cpu we can now pass a kmem_cache_cpu structure around.
  Allows reducing the number of parameters to various functions.
- Can define a new node_match() function for NUMA to encapsulate locality
  checks.

Effect on allocations:

Cachelines touched before this patch:

	Write:	page cache struct and first cacheline of object

Cachelines touched after this patch:

	Write:	kmem_cache_cpu cacheline and first cacheline of object
	Read: page cache struct (but see later patch that avoids touching
		that cacheline)

The handling when the lockless alloc list runs empty gets to be a bit more
complicated since another cacheline has now to be written to. But that is
halfway out of the hot path.

Effect on freeing:

Cachelines touched before this patch:

	Write: page_struct and first cacheline of object

Cachelines touched after this patch depending on how we free:

  Write(to cpu_slab):	kmem_cache_cpu struct and first cacheline of object
  Write(to other):	page struct and first cacheline of object

  Read(to cpu_slab):	page struct to id slab etc. (but see later patch that
  			avoids touching the page struct on free)
  Read(to other):	cpu local kmem_cache_cpu struct to verify its not
  			the cpu slab.

Summary:

Pro:
	- Distinct cachelines so that concurrent remote frees and local
	  allocs on a cpuslab can occur without cacheline bouncing.
	- Avoids potential bouncing cachelines because of neighboring
	  per cpu pointer updates in kmem_cache's cpu_slab structure since
	  it now grows to a cacheline (Therefore remove the comment
	  that talks about that concern).

Cons:
	- Freeing objects now requires the reading of one additional
	  cacheline. That can be mitigated for some cases by the following
	  patches but its not possible to completely eliminate these
	  references.

	- Memory usage grows slightly.

	The size of each per cpu object is blown up from one word
	(pointing to the page_struct) to one cacheline with various data.
	So this is NR_CPUS*NR_SLABS*L1_BYTES more memory use. Lets say
	NR_SLABS is 100 and a cache line size of 128 then we have just
	increased SLAB metadata requirements by 12.8k per cpu.
	(Another later patch reduces these requirements)

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-16 09:43:01 -07:00
Mel Gorman
467c996c1e Print out statistics in relation to fragmentation avoidance to /proc/pagetypeinfo
This patch provides fragmentation avoidance statistics via /proc/pagetypeinfo.
 The information is collected only on request so there is no runtime overhead.
 The statistics are in three parts:

The first part prints information on the size of blocks that pages are
being grouped on and looks like

Page block order: 10
Pages per block:  1024

The second part is a more detailed version of /proc/buddyinfo and looks like

Free pages count per migrate type at order       0      1      2      3      4      5      6      7      8      9     10
Node    0, zone      DMA, type    Unmovable      0      0      0      0      0      0      0      0      0      0      0
Node    0, zone      DMA, type  Reclaimable      1      0      0      0      0      0      0      0      0      0      0
Node    0, zone      DMA, type      Movable      0      0      0      0      0      0      0      0      0      0      0
Node    0, zone      DMA, type      Reserve      0      4      4      0      0      0      0      1      0      1      0
Node    0, zone   Normal, type    Unmovable    111      8      4      4      2      3      1      0      0      0      0
Node    0, zone   Normal, type  Reclaimable    293     89      8      0      0      0      0      0      0      0      0
Node    0, zone   Normal, type      Movable      1      6     13      9      7      6      3      0      0      0      0
Node    0, zone   Normal, type      Reserve      0      0      0      0      0      0      0      0      0      0      4

The third part looks like

Number of blocks type     Unmovable  Reclaimable      Movable      Reserve
Node 0, zone      DMA            0            1            2            1
Node 0, zone   Normal            3           17           94            4

To walk the zones within a node with interrupts disabled, walk_zones_in_node()
is introduced and shared between /proc/buddyinfo, /proc/zoneinfo and
/proc/pagetypeinfo to reduce code duplication.  It seems specific to what
vmstat.c requires but could be broken out as a general utility function in
mmzone.c if there were other other potential users.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Andy Whitcroft <apw@shadowen.org>
Acked-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-16 09:43:00 -07:00
Mel Gorman
d9c2340052 Do not depend on MAX_ORDER when grouping pages by mobility
Currently mobility grouping works at the MAX_ORDER_NR_PAGES level.  This makes
sense for the majority of users where this is also the huge page size.
However, on platforms like ia64 where the huge page size is runtime
configurable it is desirable to group at a lower order.  On x86_64 and
occasionally on x86, the hugepage size may not always be MAX_ORDER_NR_PAGES.

This patch groups pages together based on the value of HUGETLB_PAGE_ORDER.  It
uses a compile-time constant if possible and a variable where the huge page
size is runtime configurable.

It is assumed that grouping should be done at the lowest sensible order and
that the user would not want to override this.  If this is not true,
page_block order could be forced to a variable initialised via a boot-time
kernel parameter.

One potential issue with this patch is that IA64 now parses hugepagesz with
early_param() instead of __setup().  __setup() is called after the memory
allocator has been initialised and the pageblock bitmaps already setup.  In
tests on one IA64 there did not seem to be any problem with using
early_param() and in fact may be more correct as it guarantees the parameter
is handled before the parsing of hugepages=.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Andy Whitcroft <apw@shadowen.org>
Acked-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-16 09:43:00 -07:00
Mel Gorman
64c5e135bf don't group high order atomic allocations
Grouping high-order atomic allocations together was intended to allow
bursty users of atomic allocations to work such as e1000 in situations
where their preallocated buffers were depleted.  This did not work in at
least one case with a wireless network adapter needing order-1 allocations
frequently.  To resolve that, the free pages used for min_free_kbytes were
moved to separate contiguous blocks with the patch
bias-the-location-of-pages-freed-for-min_free_kbytes-in-the-same-max_order_nr_pages-blocks.

It is felt that keeping the free pages in the same contiguous blocks should
be sufficient for bursty short-lived high-order atomic allocations to
succeed, maybe even with the e1000.  Even if there is a failure, increasing
the value of min_free_kbytes will free pages as contiguous bloks in
contrast to the standard buddy allocator which makes no attempt to keep the
minimum number of free pages contiguous.

This patch backs out grouping high order atomic allocations together to
determine if it is really needed or not.  If a new report comes in about
high-order atomic allocations failing, the feature can be reintroduced to
determine if it fixes the problem or not.  As a side-effect, this patch
reduces by 1 the number of bits required to track the mobility type of
pages within a MAX_ORDER_NR_PAGES block.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Andy Whitcroft <apw@shadowen.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-16 09:43:00 -07:00
Mel Gorman
ac0e5b7a6b remove PAGE_GROUP_BY_MOBILITY
Grouping pages by mobility can be disabled at compile-time. This was
considered undesirable by a number of people. However, in the current stack of
patches, it is not a simple case of just dropping the configurable patch as it
would cause merge conflicts.  This patch backs out the configuration option.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Andy Whitcroft <apw@shadowen.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-16 09:43:00 -07:00
Mel Gorman
56fd56b868 Bias the location of pages freed for min_free_kbytes in the same MAX_ORDER_NR_PAGES blocks
The standard buddy allocator always favours the smallest block of pages.
The effect of this is that the pages free to satisfy min_free_kbytes tends
to be preserved since boot time at the same location of memory ffor a very
long time and as a contiguous block.  When an administrator sets the
reserve at 16384 at boot time, it tends to be the same MAX_ORDER blocks
that remain free.  This allows the occasional high atomic allocation to
succeed up until the point the blocks are split.  In practice, it is
difficult to split these blocks but when they do split, the benefit of
having min_free_kbytes for contiguous blocks disappears.  Additionally,
increasing min_free_kbytes once the system has been running for some time
has no guarantee of creating contiguous blocks.

On the other hand, CONFIG_PAGE_GROUP_BY_MOBILITY favours splitting large
blocks when there are no free pages of the appropriate type available.  A
side-effect of this is that all blocks in memory tends to be used up and
the contiguous free blocks from boot time are not preserved like in the
vanilla allocator.  This can cause a problem if a new caller is unwilling
to reclaim or does not reclaim for long enough.

A failure scenario was found for a wireless network device allocating
order-1 atomic allocations but the allocations were not intense or frequent
enough for a whole block of pages to be preserved for MIGRATE_HIGHALLOC.
This was reproduced on a desktop by booting with mem=256mb, forcing the
driver to allocate at order-1, running a bittorrent client (downloading a
debian ISO) and building a kernel with -j2.

This patch addresses the problem on the desktop machine booted with
mem=256mb.  It works by setting aside a reserve of MAX_ORDER_NR_PAGES
blocks, the number of which depends on the value of min_free_kbytes.  These
blocks are only fallen back to when there is no other free pages.  Then the
smallest possible page is used just like the normal buddy allocator instead
of the largest possible page to preserve contiguous pages The pages in free
lists in the reserve blocks are never taken for another migrate type.  The
results is that even if min_free_kbytes is set to a low value, contiguous
blocks will be preserved in the MIGRATE_RESERVE blocks.

This works better than the vanilla allocator because if min_free_kbytes is
increased, a new reserve block will be chosen based on the location of
reclaimable pages and the block will free up as contiguous pages.  In the
vanilla allocator, no effort is made to target a block of pages to free as
contiguous pages and min_free_kbytes pages are scattered randomly.

This effect has been observed on the test machine.  min_free_kbytes was set
initially low but it was kept as a contiguous free block within
MIGRATE_RESERVE.  min_free_kbytes was then set to a higher value and over a
period of time, the free blocks were within the reserve and coalescing.
How long it takes to free up depends on how quickly LRU is rotating.
Amusingly, this means that more activity will free the blocks faster.

This mechanism potentially replaces MIGRATE_HIGHALLOC as it may be more
effective than grouping contiguous free pages together.  It all depends on
whether the number of active atomic high allocations exceeds
min_free_kbytes or not.  If the number of active allocations exceeds
min_free_kbytes, it's worth it but maybe in that situation, min_free_kbytes
should be set higher.  Once there are no more reports of allocation
failures, a patch will be submitted that backs out MIGRATE_HIGHALLOC and
see if the reports stay missing.

Credit to Mariusz Kozlowski for discovering the problem, describing the
failure scenario and testing patches and scenarios.

[akpm@linux-foundation.org: cleanups]
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Andy Whitcroft <apw@shadowen.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-16 09:43:00 -07:00
Mel Gorman
5c0e306647 Fix corruption of memmap on IA64 SPARSEMEM when mem_section is not a power of 2
There are problems in the use of SPARSEMEM and pageblock flags that causes
problems on ia64.

The first part of the problem is that units are incorrect in
SECTION_BLOCKFLAGS_BITS computation.  This results in a map_section's
section_mem_map being treated as part of a bitmap which isn't good.  This
was evident with an invalid virtual address when mem_init attempted to free
bootmem pages while relinquishing control from the bootmem allocator.

The second part of the problem occurs because the pageblock flags bitmap is
be located with the mem_section.  The SECTIONS_PER_ROOT computation using
sizeof (mem_section) may not be a power of 2 depending on the size of the
bitmap.  This renders masks and other such things not power of 2 base.
This issue was seen with SPARSEMEM_EXTREME on ia64.  This patch moves the
bitmap outside of mem_section and uses a pointer instead in the
mem_section.  The bitmaps are allocated when the section is being
initialised.

Note that sparse_early_usemap_alloc() does not use alloc_remap() like
sparse_early_mem_map_alloc().  The allocation required for the bitmap on
x86, the only architecture that uses alloc_remap is typically smaller than
a cache line.  alloc_remap() pads out allocations to the cache size which
would be a needless waste.

Credit to Bob Picco for identifying the original problem and effecting a
fix for the SECTION_BLOCKFLAGS_BITS calculation.  Credit to Andy Whitcroft
for devising the best way of allocating the bitmaps only when required for
the section.

[wli@holomorphy.com: warning fix]
Signed-off-by: Bob Picco <bob.picco@hp.com>
Signed-off-by: Andy Whitcroft <apw@shadowen.org>
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Cc: "Luck, Tony" <tony.luck@intel.com>
Signed-off-by: William Irwin <bill.irwin@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-16 09:43:00 -07:00
Mel Gorman
e010487dbe Group high-order atomic allocations
In rare cases, the kernel needs to allocate a high-order block of pages
without sleeping.  For example, this is the case with e1000 cards configured
to use jumbo frames.  Migrating or reclaiming pages in this situation is not
an option.

This patch groups these allocations together as much as possible by adding a
new MIGRATE_TYPE.  The MIGRATE_HIGHATOMIC type are exactly what they sound
like.  Care is taken that pages of other migrate types do not use the same
blocks as high-order atomic allocations.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-16 09:43:00 -07:00
Mel Gorman
e12ba74d8f Group short-lived and reclaimable kernel allocations
This patch marks a number of allocations that are either short-lived such as
network buffers or are reclaimable such as inode allocations.  When something
like updatedb is called, long-lived and unmovable kernel allocations tend to
be spread throughout the address space which increases fragmentation.

This patch groups these allocations together as much as possible by adding a
new MIGRATE_TYPE.  The MIGRATE_RECLAIMABLE type is for allocations that can be
reclaimed on demand, but not moved.  i.e.  they can be migrated by deleting
them and re-reading the information from elsewhere.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Cc: Andy Whitcroft <apw@shadowen.org>
Cc: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-16 09:43:00 -07:00
Mel Gorman
b92a6edd4b Add a configure option to group pages by mobility
The grouping mechanism has some memory overhead and a more complex allocation
path.  This patch allows the strategy to be disabled for small memory systems
or if it is known the workload is suffering because of the strategy.  It also
acts to show where the page groupings strategy interacts with the standard
buddy allocator.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Joel Schopp <jschopp@austin.ibm.com>
Cc: Andy Whitcroft <apw@shadowen.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-16 09:42:59 -07:00
Mel Gorman
b2a0ac8875 Split the free lists for movable and unmovable allocations
This patch adds the core of the fragmentation reduction strategy.  It works by
grouping pages together based on their ability to migrate or be reclaimed.
Basically, it works by breaking the list in zone->free_area list into
MIGRATE_TYPES number of lists.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-16 09:42:59 -07:00
Mel Gorman
835c134ec4 Add a bitmap that is used to track flags affecting a block of pages
Here is the latest revision of the anti-fragmentation patches.  Of particular
note in this version is special treatment of high-order atomic allocations.
Care is taken to group them together and avoid grouping pages of other types
near them.  Artifical tests imply that it works.  I'm trying to get the
hardware together that would allow setting up of a "real" test.  If anyone
already has a setup and test that can trigger the atomic-allocation problem,
I'd appreciate a test of these patches and a report.  The second major change
is that these patches will apply cleanly with patches that implement
anti-fragmentation through zones.

kernbench shows effectively no performance difference varying between -0.2%
and +2% on a variety of test machines.  Success rates for huge page allocation
are dramatically increased.  For example, on a ppc64 machine, the vanilla
kernel was only able to allocate 1% of memory as a hugepage and this was due
to a single hugepage reserved as min_free_kbytes.  With these patches applied,
17% was allocatable as superpages.  With reclaim-related fixes from Andy
Whitcroft, it was 40% and further reclaim-related improvements should increase
this further.

Changelog Since V28
o Group high-order atomic allocations together
o It is no longer required to set min_free_kbytes to 10% of memory. A value
  of 16384 in most cases will be sufficient
o Now applied with zone-based anti-fragmentation
o Fix incorrect VM_BUG_ON within buffered_rmqueue()
o Reorder the stack so later patches do not back out work from earlier patches
o Fix bug were journal pages were being treated as movable
o Bias placement of non-movable pages to lower PFNs
o More agressive clustering of reclaimable pages in reactions to workloads
  like updatedb that flood the size of inode caches

Changelog Since V27

o Renamed anti-fragmentation to Page Clustering. Anti-fragmentation was giving
  the mistaken impression that it was the 100% solution for high order
  allocations. Instead, it greatly increases the chances high-order
  allocations will succeed and lays the foundation for defragmentation and
  memory hot-remove to work properly
o Redefine page groupings based on ability to migrate or reclaim instead of
  basing on reclaimability alone
o Get rid of spurious inits
o Per-cpu lists are no longer split up per-type. Instead the per-cpu list is
  searched for a page of the appropriate type
o Added more explanation commentary
o Fix up bug in pageblock code where bitmap was used before being initalised

Changelog Since V26
o Fix double init of lists in setup_pageset

Changelog Since V25
o Fix loop order of for_each_rclmtype_order so that order of loop matches args
o gfpflags_to_rclmtype uses gfp_t instead of unsigned long
o Rename get_pageblock_type() to get_page_rclmtype()
o Fix alignment problem in move_freepages()
o Add mechanism for assigning flags to blocks of pages instead of page->flags
o On fallback, do not examine the preferred list of free pages a second time

The purpose of these patches is to reduce external fragmentation by grouping
pages of related types together.  When pages are migrated (or reclaimed under
memory pressure), large contiguous pages will be freed.

This patch works by categorising allocations by their ability to migrate;

Movable - The pages may be moved with the page migration mechanism. These are
	generally userspace pages.

Reclaimable - These are allocations for some kernel caches that are
	reclaimable or allocations that are known to be very short-lived.

Unmovable - These are pages that are allocated by the kernel that
	are not trivially reclaimed. For example, the memory allocated for a
	loaded module would be in this category. By default, allocations are
	considered to be of this type

HighAtomic - These are high-order allocations belonging to callers that
	cannot sleep or perform any IO. In practice, this is restricted to
	jumbo frame allocation for network receive. It is assumed that the
	allocations are short-lived

Instead of having one MAX_ORDER-sized array of free lists in struct free_area,
there is one for each type of reclaimability.  Once a 2^MAX_ORDER block of
pages is split for a type of allocation, it is added to the free-lists for
that type, in effect reserving it.  Hence, over time, pages of the different
types can be clustered together.

When the preferred freelists are expired, the largest possible block is taken
from an alternative list.  Buddies that are split from that large block are
placed on the preferred allocation-type freelists to mitigate fragmentation.

This implementation gives best-effort for low fragmentation in all zones.
Ideally, min_free_kbytes needs to be set to a value equal to 4 * (1 <<
(MAX_ORDER-1)) pages in most cases.  This would be 16384 on x86 and x86_64 for
example.

Our tests show that about 60-70% of physical memory can be allocated on a
desktop after a few days uptime.  In benchmarks and stress tests, we are
finding that 80% of memory is available as contiguous blocks at the end of the
test.  To compare, a standard kernel was getting < 1% of memory as large pages
on a desktop and about 8-12% of memory as large pages at the end of stress
tests.

Following this email are 12 patches that implement thie page grouping feature.
 The first patch introduces a mechanism for storing flags related to a whole
block of pages.  Then allocations are split between movable and all other
allocations.  Following that are patches to deal with per-cpu pages and make
the mechanism configurable.  The next patch moves free pages between lists
when partially allocated blocks are used for pages of another migrate type.
The second last patch groups reclaimable kernel allocations such as inode
caches together.  The final patch related to groupings keeps high-order atomic
allocations.

The last two patches are more concerned with control of fragmentation.  The
second last patch biases placement of non-movable allocations towards the
start of memory.  This is with a view of supporting memory hot-remove of DIMMs
with higher PFNs in the future.  The biasing could be enforced a lot heavier
but it would cost.  The last patch agressively clusters reclaimable pages like
inode caches together.

The fragmentation reduction strategy needs to track if pages within a block
can be moved or reclaimed so that pages are freed to the appropriate list.
This patch adds a bitmap for flags affecting a whole a MAX_ORDER block of
pages.

In non-SPARSEMEM configurations, the bitmap is stored in the struct zone and
allocated during initialisation.  SPARSEMEM statically allocates the bitmap in
a struct mem_section so that bitmaps do not have to be resized during memory
hotadd.  This wastes a small amount of memory per unused section (usually
sizeof(unsigned long)) but the complexity of dynamically allocating the memory
is quite high.

Additional credit to Andy Whitcroft who reviewed up an earlier implementation
of the mechanism an suggested how to make it a *lot* cleaner.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Cc: Andy Whitcroft <apw@shadowen.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-16 09:42:59 -07:00
Christoph Lameter
6cb062296f Categorize GFP flags
The function of GFP_LEVEL_MASK seems to be unclear.  In order to clear up
the mystery we get rid of it and replace GFP_LEVEL_MASK with 3 sets of GFP
flags:

GFP_RECLAIM_MASK	Flags used to control page allocator reclaim behavior.

GFP_CONSTRAINT_MASK	Flags used to limit where allocations can occur.

GFP_SLAB_BUG_MASK	Flags that the slab allocator BUG()s on.

These replace the uses of GFP_LEVEL mask in the slab allocators and in
vmalloc.c.

The use of the flags not included in these sets may occur as a result of a
slab allocation standing in for a page allocation when constructing scatter
gather lists.  Extraneous flags are cleared and not passed through to the
page allocator.  __GFP_MOVABLE/RECLAIMABLE, __GFP_COLD and __GFP_COMP will
now be ignored if passed to a slab allocator.

Change the allocation of allocator meta data in SLAB and vmalloc to not
pass through flags listed in GFP_CONSTRAINT_MASK.  SLAB already removes the
__GFP_THISNODE flag for such allocations.  Generalize that to also cover
vmalloc.  The use of GFP_CONSTRAINT_MASK also includes __GFP_HARDWALL.

The impact of allocator metadata placement on access latency to the
cachelines of the object itself is minimal since metadata is only
referenced on alloc and free.  The attempt is still made to place the meta
data optimally but we consistently allow fallback both in SLAB and vmalloc
(SLUB does not need to allocate metadata like that).

Allocator metadata may serve multiple in kernel users and thus should not
be subject to the limitations arising from a single allocation context.

[akpm@linux-foundation.org: fix fallback_alloc()]
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-16 09:42:59 -07:00
Christoph Lameter
0e1e7c7a73 Memoryless nodes: Use N_HIGH_MEMORY for cpusets
cpusets try to ensure that any node added to a cpuset's mems_allowed is
on-line and contains memory.  The assumption was that online nodes contained
memory.  Thus, it is possible to add memoryless nodes to a cpuset and then add
tasks to this cpuset.  This results in continuous series of oom-kill and
apparent system hang.

Change cpusets to use node_states[N_HIGH_MEMORY] [a.k.a.  node_memory_map] in
place of node_online_map when vetting memories.  Return error if admin
attempts to write a non-empty mems_allowed node mask containing only
memoryless-nodes.

Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Signed-off-by: Bob Picco <bob.picco@hp.com>
Signed-off-by: Nishanth Aravamudan <nacc@us.ibm.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Mel Gorman <mel@skynet.ie>
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-16 09:42:59 -07:00
Christoph Lameter
523b945855 Memoryless nodes: Fix GFP_THISNODE behavior
GFP_THISNODE checks that the zone selected is within the pgdat (node) of the
first zone of a nodelist.  That only works if the node has memory.  A
memoryless node will have its first node on another pgdat (node).

GFP_THISNODE currently will return simply memory on the first pgdat.  Thus it
is returning memory on other nodes.  GFP_THISNODE should fail if there is no
local memory on a node.

Add a new set of zonelists for each node that only contain the nodes that
belong to the zones itself so that no fallback is possible.

Then modify gfp_type to pickup the right zone based on the presence of
__GFP_THISNODE.

Drop the existing GFP_THISNODE checks from the page_allocators hot path.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Acked-by: Nishanth Aravamudan <nacc@us.ibm.com>
Tested-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Acked-by: Bob Picco <bob.picco@hp.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Mel Gorman <mel@skynet.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-16 09:42:59 -07:00
Christoph Lameter
37c0708dbe Memoryless nodes: Add N_CPU node state
We need the check for a node with cpu in zone reclaim.  Zone reclaim will not
allow remote zone reclaim if a node has a cpu.

[Lee.Schermerhorn@hp.com: Move setup of N_CPU node state mask]
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Tested-by:  Lee Schermerhorn <lee.schermerhorn@hp.com>
Acked-by: Bob Picco <bob.picco@hp.com>
Cc: Nishanth Aravamudan <nacc@us.ibm.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Mel Gorman <mel@skynet.ie>
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-16 09:42:58 -07:00
Christoph Lameter
7ea1530ab3 Memoryless nodes: introduce mask of nodes with memory
It is necessary to know if nodes have memory since we have recently begun to
add support for memoryless nodes.  For that purpose we introduce a two new
node states: N_HIGH_MEMORY and N_NORMAL_MEMORY.

A node has its bit in N_HIGH_MEMORY set if it has any memory regardless of the
type of mmemory.  If a node has memory then it has at least one zone defined
in its pgdat structure that is located in the pgdat itself.

A node has its bit in N_NORMAL_MEMORY set if it has a lower zone than
ZONE_HIGHMEM.  This means it is possible to allocate memory that is not
subject to kmap.

N_HIGH_MEMORY and N_NORMAL_MEMORY can then be used in various places to insure
that we do the right thing when we encounter a memoryless node.

[akpm@linux-foundation.org: build fix]
[Lee.Schermerhorn@hp.com: update N_HIGH_MEMORY node state for memory hotadd]
[y-goto@jp.fujitsu.com: Fix memory hotplug + sparsemem build]
Signed-off-by: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
Signed-off-by: Nishanth Aravamudan <nacc@us.ibm.com>
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Acked-by: Bob Picco <bob.picco@hp.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Mel Gorman <mel@skynet.ie>
Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com>
Signed-off-by: Paul Mundt <lethal@linux-sh.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-16 09:42:58 -07:00
Christoph Lameter
1380891071 Memoryless nodes: Generic management of nodemasks for various purposes
Why do we need to support memoryless nodes?

KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:

> For fujitsu, problem is called "empty" node.
>
> When ACPI's SRAT table includes "possible nodes", ia64 bootstrap(acpi_numa_init)
> creates nodes, which includes no memory, no cpu.
>
> I tried to remove empty-node in past, but that was denied.
> It was because we can hot-add cpu to the empty node.
> (node-hotplug triggered by cpu is not implemented now. and it will be ugly.)
>
>
> For HP, (Lee can comment on this later), they have memory-less-node.
> As far as I hear, HP's machine can have following configration.
>
> (example)
> Node0: CPU0   memory AAA MB
> Node1: CPU1   memory AAA MB
> Node2: CPU2   memory AAA MB
> Node3: CPU3   memory AAA MB
> Node4: Memory XXX GB
>
> AAA is very small value (below 16MB)  and will be omitted by ia64 bootstrap.
> After boot, only Node 4 has valid memory (but have no cpu.)
>
> Maybe this is memory-interleave by firmware config.

Christoph Lameter <clameter@sgi.com> wrote:

> Future SGI platforms (actually also current one can have but nothing like
> that is deployed to my knowledge) have nodes with only cpus. Current SGI
> platforms have nodes with just I/O that we so far cannot manage in the
> core. So the arch code maps them to the nearest memory node.

Lee Schermerhorn <Lee.Schermerhorn@hp.com> wrote:

> For the HP platforms, we can configure each cell with from 0% to 100%
> "cell local memory".  When we configure with <100% CLM, the "missing
> percentages" are interleaved by hardware on a cache-line granularity to
> improve bandwidth at the expense of latency for numa-challenged
> applications [and OSes, but not our problem ;-)].  When we boot Linux on
> such a config, all of the real nodes have no memory--it all resides in a
> single interleaved pseudo-node.
>
> When we boot Linux on a 100% CLM configuration [== NUMA], we still have
> the interleaved pseudo-node.  It contains a few hundred MB stolen from
> the real nodes to contain the DMA zone.  [Interleaved memory resides at
> phys addr 0].  The memoryless-nodes patches, along with the zoneorder
> patches, support this config as well.
>
> Also, when we boot a NUMA config with the "mem=" command line,
> specifying less memory than actually exists, Linux takes the excluded
> memory "off the top" rather than distributing it across the nodes.  This
> can result in memoryless nodes, as well.
>

This patch:

Preparation for memoryless node patches.

Provide a generic way to keep nodemasks describing various characteristics of
NUMA nodes.

Remove the node_online_map and the node_possible map and realize the same
functionality using two nodes stats: N_POSSIBLE and N_ONLINE.

[Lee.Schermerhorn@hp.com: Initialize N_*_MEMORY and N_CPU masks for non-NUMA config]
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Tested-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Acked-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Acked-by: Bob Picco <bob.picco@hp.com>
Cc: Nishanth Aravamudan <nacc@us.ibm.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Mel Gorman <mel@skynet.ie>
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Cc: "Serge E. Hallyn" <serge@hallyn.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-16 09:42:58 -07:00
Nick Piggin
55144768e1 fs: remove some AOP_TRUNCATED_PAGE
prepare/commit_write no longer returns AOP_TRUNCATED_PAGE since OCFS2 and
GFS2 were converted to the new aops, so we can make some simplifications
for that.

[michal.k.k.piotrowski@gmail.com: fix warning]
Signed-off-by: Nick Piggin <npiggin@suse.de>
Cc: Michael Halcrow <mhalcrow@us.ibm.com>
Cc: Mark Fasheh <mark.fasheh@oracle.com>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Signed-off-by: Michal Piotrowski <michal.k.k.piotrowski@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-16 09:42:58 -07:00
Nick Piggin
03158cd7eb fs: restore nobh
Implement nobh in new aops.  This is a bit tricky.  FWIW, nobh_truncate is
now implemented in a way that does not create blocks in sparse regions,
which is a silly thing for it to have been doing (isn't it?)

ext2 survives fsx and fsstress. jfs is converted as well... ext3
should be easy to do (but not done yet).

[akpm@linux-foundation.org: coding-style fixes]
Cc: Badari Pulavarty <pbadari@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-16 09:42:58 -07:00
Nick Piggin
a20fa20c54 With reiserfs no longer using the weird generic_cont_expand, remove it completely.
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-16 09:42:56 -07:00
Nick Piggin
89e107877b fs: new cont helpers
Rework the generic block "cont" routines to handle the new aops.  Supporting
cont_prepare_write would take quite a lot of code to support, so remove it
instead (and we later convert all filesystems to use it).

write_begin gets passed AOP_FLAG_CONT_EXPAND when called from
generic_cont_expand, so filesystems can avoid the old hacks they used.

Signed-off-by: Nick Piggin <npiggin@suse.de>
Cc: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-16 09:42:55 -07:00
Nick Piggin
afddba49d1 fs: introduce write_begin, write_end, and perform_write aops
These are intended to replace prepare_write and commit_write with more
flexible alternatives that are also able to avoid the buffered write
deadlock problems efficiently (which prepare_write is unable to do).

[mark.fasheh@oracle.com: API design contributions, code review and fixes]
[akpm@linux-foundation.org: various fixes]
[dmonakhov@sw.ru: new aop block_write_begin fix]
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com>
Signed-off-by: Dmitriy Monakhov <dmonakhov@openvz.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-16 09:42:55 -07:00
Nick Piggin
2f718ffc16 mm: buffered write iterator
Add an iterator data structure to operate over an iovec.  Add usercopy
operators needed by generic_file_buffered_write, and convert that function
over.

Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-16 09:42:55 -07:00
Nick Piggin
08291429cf mm: fix pagecache write deadlocks
Modify the core write() code so that it won't take a pagefault while holding a
lock on the pagecache page. There are a number of different deadlocks possible
if we try to do such a thing:

1.  generic_buffered_write
2.   lock_page
3.    prepare_write
4.     unlock_page+vmtruncate
5.     copy_from_user
6.      mmap_sem(r)
7.       handle_mm_fault
8.        lock_page (filemap_nopage)
9.    commit_write
10.  unlock_page

a. sys_munmap / sys_mlock / others
b.  mmap_sem(w)
c.   make_pages_present
d.    get_user_pages
e.     handle_mm_fault
f.      lock_page (filemap_nopage)

2,8	- recursive deadlock if page is same
2,8;2,8	- ABBA deadlock is page is different
2,6;b,f	- ABBA deadlock if page is same

The solution is as follows:
1.  If we find the destination page is uptodate, continue as normal, but use
    atomic usercopies which do not take pagefaults and do not zero the uncopied
    tail of the destination. The destination is already uptodate, so we can
    commit_write the full length even if there was a partial copy: it does not
    matter that the tail was not modified, because if it is dirtied and written
    back to disk it will not cause any problems (uptodate *means* that the
    destination page is as new or newer than the copy on disk).

1a. The above requires that fault_in_pages_readable correctly returns access
    information, because atomic usercopies cannot distinguish between
    non-present pages in a readable mapping, from lack of a readable mapping.

2.  If we find the destination page is non uptodate, unlock it (this could be
    made slightly more optimal), then allocate a temporary page to copy the
    source data into. Relock the destination page and continue with the copy.
    However, instead of a usercopy (which might take a fault), copy the data
    from the pinned temporary page via the kernel address space.

(also, rename maxlen to seglen, because it was confusing)

This increases the CPU/memory copy cost by almost 50% on the affected
workloads. That will be solved by introducing a new set of pagecache write
aops in a subsequent patch.

Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-16 09:42:54 -07:00
Lee Schermerhorn
754af6f5a8 Mem Policy: add MPOL_F_MEMS_ALLOWED get_mempolicy() flag
Allow an application to query the memories allowed by its context.

Updated numa_memory_policy.txt to mention that applications can use this to
obtain allowed memories for constructing valid policies.

TODO:  update out-of-tree libnuma wrapper[s], or maybe add a new
wrapper--e.g.,  numa_get_mems_allowed() ?

Also, update numa syscall man pages.

Tested with memtoy V>=0.13.

Signed-off-by:  Lee Schermerhorn <lee.schermerhorn@hp.com>
Acked-by: Christoph Lameter <clameter@sgi.com>
Cc: Andi Kleen <ak@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-16 09:42:54 -07:00
Martin Schwidefsky
c92ff1bde0 move mm_struct and vm_area_struct
Move the definitions of struct mm_struct and struct vma_area_struct to
include/mm_types.h.  This allows to define more function in asm/pgtable.h
and friends with inline assemblies instead of macros.  Compile tested on
i386, powerpc, powerpc64, s390-32, s390-64 and x86_64.

[aurelien@aurel32.net: build fix]
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Aurelien Jarno <aurelien@aurel32.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-16 09:42:53 -07:00
Nick Piggin
c0bc9875b7 radix-tree: use indirect bit
Rather than sign direct radix-tree pointers with a special bit, sign the
indirect one that hangs off the root.  This means that, given a lookup_slot
operation, the invalid result will be differentiated from the valid
(previously, valid results could have the bit either set or clear).

This does not affect slot lookups which occur under lock -- they can never
return an invalid result.  Is needed in future for lockless pagecache.

Signed-off-by: Nick Piggin <npiggin@suse.de>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-16 09:42:53 -07:00
Nick Piggin
557ed1fa26 remove ZERO_PAGE
The commit b5810039a5 contains the note

  A last caveat: the ZERO_PAGE is now refcounted and managed with rmap
  (and thus mapcounted and count towards shared rss).  These writes to
  the struct page could cause excessive cacheline bouncing on big
  systems.  There are a number of ways this could be addressed if it is
  an issue.

And indeed this cacheline bouncing has shown up on large SGI systems.
There was a situation where an Altix system was essentially livelocked
tearing down ZERO_PAGE pagetables when an HPC app aborted during startup.
This situation can be avoided in userspace, but it does highlight the
potential scalability problem with refcounting ZERO_PAGE, and corner
cases where it can really hurt (we don't want the system to livelock!).

There are several broad ways to fix this problem:
1. add back some special casing to avoid refcounting ZERO_PAGE
2. per-node or per-cpu ZERO_PAGES
3. remove the ZERO_PAGE completely

I will argue for 3. The others should also fix the problem, but they
result in more complex code than does 3, with little or no real benefit
that I can see.

Why? Inserting a ZERO_PAGE for anonymous read faults appears to be a
false optimisation: if an application is performance critical, it would
not be doing many read faults of new memory, or at least it could be
expected to write to that memory soon afterwards. If cache or memory use
is critical, it should not be working with a significant number of
ZERO_PAGEs anyway (a more compact representation of zeroes should be
used).

As a sanity check -- mesuring on my desktop system, there are never many
mappings to the ZERO_PAGE (eg. 2 or 3), thus memory usage here should not
increase much without it.

When running a make -j4 kernel compile on my dual core system, there are
about 1,000 mappings to the ZERO_PAGE created per second, but about 1,000
ZERO_PAGE COW faults per second (less than 1 ZERO_PAGE mapping per second
is torn down without being COWed). So removing ZERO_PAGE will save 1,000
page faults per second when running kbuild, while keeping it only saves
less than 1 page clearing operation per second. 1 page clear is cheaper
than a thousand faults, presumably, so there isn't an obvious loss.

Neither the logical argument nor these basic tests give a guarantee of no
regressions. However, this is a reasonable opportunity to try to remove
the ZERO_PAGE from the pagefault path. If it is found to cause regressions,
we can reintroduce it and just avoid refcounting it.

The /dev/zero ZERO_PAGE usage and TLB tricks also get nuked.  I don't see
much use to them except on benchmarks.  All other users of ZERO_PAGE are
converted just to use ZERO_PAGE(0) for simplicity. We can look at
replacing them all and maybe ripping out ZERO_PAGE completely when we are
more satisfied with this solution.

Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus "snif" Torvalds <torvalds@linux-foundation.org>
2007-10-16 09:42:53 -07:00
Christoph Lameter
aadb4bc4a1 SLUB: direct pass through of page size or higher kmalloc requests
This gets rid of all kmalloc caches larger than page size.  A kmalloc
request larger than PAGE_SIZE > 2 is going to be passed through to the page
allocator.  This works both inline where we will call __get_free_pages
instead of kmem_cache_alloc and in __kmalloc.

kfree is modified to check if the object is in a slab page. If not then
the page is freed via the page allocator instead. Roughly similar to what
SLOB does.

Advantages:
- Reduces memory overhead for kmalloc array
- Large kmalloc operations are faster since they do not
  need to pass through the slab allocator to get to the
  page allocator.
- Performance increase of 10%-20% on alloc and 50% on free for
  PAGE_SIZEd allocations.
  SLUB must call page allocator for each alloc anyways since
  the higher order pages which that allowed avoiding the page alloc calls
  are not available in a reliable way anymore. So we are basically removing
  useless slab allocator overhead.
- Large kmallocs yields page aligned object which is what
  SLAB did. Bad things like using page sized kmalloc allocations to
  stand in for page allocate allocs can be transparently handled and are not
  distinguishable from page allocator uses.
- Checking for too large objects can be removed since
  it is done by the page allocator.

Drawbacks:
- No accounting for large kmalloc slab allocations anymore
- No debugging of large kmalloc slab allocations.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-16 09:42:53 -07:00
Fengguang Wu
57f6b96c09 filemap: convert some unsigned long to pgoff_t
Convert some 'unsigned long' to pgoff_t.

Signed-off-by: Fengguang Wu <wfg@mail.ustc.edu.cn>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-16 09:42:53 -07:00
Fengguang Wu
535443f515 readahead: remove several readahead macros
Remove VM_MAX_CACHE_HIT, MAX_RA_PAGES and MIN_RA_PAGES.

Signed-off-by: Fengguang Wu <wfg@mail.ustc.edu.cn>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-16 09:42:52 -07:00
Fengguang Wu
6df8ba4f8a radixtree: introduce radix_tree_next_hole()
Introduce radix_tree_next_hole(root, index, max_scan) to scan radix tree for
the first hole.  It will be used in interleaved readahead.

The implementation is dumb and obviously correct.  It can help debug(and
document) the possible smart one in future.

Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Fengguang Wu <wfg@mail.ustc.edu.cn>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-16 09:42:52 -07:00
Fengguang Wu
f4e6b498d6 readahead: combine file_ra_state.prev_index/prev_offset into prev_pos
Combine the file_ra_state members
				unsigned long prev_index
				unsigned int prev_offset
into
				loff_t prev_pos

It is more consistent and better supports huge files.

Thanks to Peter for the nice proposal!

[akpm@linux-foundation.org: fix shift overflow]
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Fengguang Wu <wfg@mail.ustc.edu.cn>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-16 09:42:52 -07:00
Fengguang Wu
0bb7ba6b9c readahead: mmap read-around simplification
Fold file_ra_state.mmap_hit into file_ra_state.mmap_miss and make it an int.

Signed-off-by: Fengguang Wu <wfg@mail.ustc.edu.cn>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-16 09:42:52 -07:00
Fengguang Wu
937085aa35 readahead: compacting file_ra_state
Use 'unsigned int' instead of 'unsigned long' for readahead sizes.

This helps reduce memory consumption on 64bit CPU when a lot of files are
opened.

CC: Andi Kleen <andi@firstfloor.org>
Signed-off-by: Fengguang Wu <wfg@mail.ustc.edu.cn>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-16 09:42:52 -07:00
Jesper Juhl
39e91e4331 Clean up duplicate includes in include/linux/memory_hotplug.h
This patch cleans up duplicate includes in
	include/linux/memory_hotplug.h

Signed-off-by: Jesper Juhl <jesper.juhl@gmail.com>
Acked-by: Yasunori Goto <y-goto@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-16 09:42:52 -07:00
Andy Whitcroft
29c71111d0 vmemmap: generify initialisation via helpers
Convert the common vmemmap population into initialisation helpers for use by
architecture vmemmap populators.  All architecture implementing the
SPARSEMEM_VMEMMAP variant supply an architecture specific vmemmap_populate()
initialiser, which may make use of the helpers.

This allows us to clean up and remove the initialisation Kconfig entries.
With this patch there is a single SPARSEMEM_VMEMMAP_ENABLE Kconfig option to
indicate use of that variant.

Signed-off-by: Andy Whitcroft <apw@shadowen.org>
Acked-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-16 09:42:51 -07:00
Christoph Lameter
8f6aac419b Generic Virtual Memmap support for SPARSEMEM
SPARSEMEM is a pretty nice framework that unifies quite a bit of code over all
the arches.  It would be great if it could be the default so that we can get
rid of various forms of DISCONTIG and other variations on memory maps.  So far
what has hindered this are the additional lookups that SPARSEMEM introduces
for virt_to_page and page_address.  This goes so far that the code to do this
has to be kept in a separate function and cannot be used inline.

This patch introduces a virtual memmap mode for SPARSEMEM, in which the memmap
is mapped into a virtually contigious area, only the active sections are
physically backed.  This allows virt_to_page page_address and cohorts become
simple shift/add operations.  No page flag fields, no table lookups, nothing
involving memory is required.

The two key operations pfn_to_page and page_to_page become:

   #define __pfn_to_page(pfn)      (vmemmap + (pfn))
   #define __page_to_pfn(page)     ((page) - vmemmap)

By having a virtual mapping for the memmap we allow simple access without
wasting physical memory.  As kernel memory is typically already mapped 1:1
this introduces no additional overhead.  The virtual mapping must be big
enough to allow a struct page to be allocated and mapped for all valid
physical pages.  This vill make a virtual memmap difficult to use on 32 bit
platforms that support 36 address bits.

However, if there is enough virtual space available and the arch already maps
its 1-1 kernel space using TLBs (f.e.  true of IA64 and x86_64) then this
technique makes SPARSEMEM lookups even more efficient than CONFIG_FLATMEM.
FLATMEM needs to read the contents of the mem_map variable to get the start of
the memmap and then add the offset to the required entry.  vmemmap is a
constant to which we can simply add the offset.

This patch has the potential to allow us to make SPARSMEM the default (and
even the only) option for most systems.  It should be optimal on UP, SMP and
NUMA on most platforms.  Then we may even be able to remove the other memory
models: FLATMEM, DISCONTIG etc.

[apw@shadowen.org: config cleanups, resplit code etc]
[kamezawa.hiroyu@jp.fujitsu.com: Fix sparsemem_vmemmap init]
[apw@shadowen.org: vmemmap: remove excess debugging]
[apw@shadowen.org: simplify initialisation code and reduce duplication]
[apw@shadowen.org: pull out the vmemmap code into its own file]
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andy Whitcroft <apw@shadowen.org>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Andi Kleen <ak@suse.de>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-16 09:42:51 -07:00
Andy Whitcroft
540557b943 sparsemem: record when a section has a valid mem_map
We have flags to indicate whether a section actually has a valid mem_map
associated with it.  This is never set and we rely solely on the present bit
to indicate a section is valid.  By definition a section is not valid if it
has no mem_map and there is a window during init where the present bit is set
but there is no mem_map, during which pfn_valid() will return true
incorrectly.

Use the existing SECTION_HAS_MEM_MAP flag to indicate the presence of a valid
mem_map.  Switch valid_section{,_nr} and pfn_valid() to this bit.  Add a new
present_section{,_nr} and pfn_present() interfaces for those users who care to
know that a section is going to be valid.

[akpm@linux-foundation.org: coding-syle fixes]
Signed-off-by: Andy Whitcroft <apw@shadowen.org>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Cc: Christoph Lameter <clameter@sgi.com>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Andi Kleen <ak@suse.de>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-16 09:42:51 -07:00
Guennadi Liakhovetski
b3b708fa27 wake up from a serial port
Enable wakeup from serial ports, make it run-time configurable over sysfs,
e.g.,

echo enabled > /sys/devices/platform/serial8250.0/tty/ttyS0/power/wakeup

Requires

# CONFIG_SYSFS_DEPRECATED is not set

Following suggestions from Alan and Russell moved the may_wake_up checks
to serial_core.c. This time actually tested - it does even work. Could
someone, please, verify, that put_device after device_find_child is
correct?

Also would be nice to test with a Natsemi UART, that can wake up the system,
if such systems exist.

For this you just have to apply the patch below, issue the above "echo"
command to one of your Natsemi port, suspend and resume your system, and
verify that your Natsemi port still works.  If you are actually capable of
waking up the system from that port, would be nice to test that as well.

Signed-off-by: Guennadi Liakhovetski <g.liakhovetski@gmx.de>
Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
Cc: Russell King <rmk@arm.linux.org.uk>
Cc: Kay Sievers <kay.sievers@vrfy.org>
Cc: Greg KH <greg@kroah.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-16 09:42:50 -07:00
Guennadi Liakhovetski
aa5346a212 provide stubs for enable_irq_wake() and disable_irq_wake()
Provide {enable,disable}_irq_wakeup dummies for undefined
cross-compilers for platforms without CONFIG_GENERIC_IRQ.

Needed by wake-up-from-a-serial-port.patch

Signed-off-by: Guennadi Liakhovetski <g.liakhovetski@gmx.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-16 09:42:50 -07:00
Alan Cox
bf0df636e5 8250_pci: Autodetect mainpine cards
Add support for a whole range of boards. Some are partly autodetected but
not fully correctly others (PCI Express notably) not at all. Stick all
the right entries in.

Thanks to Mainpine for information and testing.

Signed-off-by: Alan Cox <alan@redhat.com>
Cc: Russell King <rmk@arm.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-16 09:42:50 -07:00
James Bottomley
32e8f70230 introduce DMA_MASK_NONE as a signal for unable to do DMA
Some devices are incapable of DMA and need to be recognised as such.
Introduce a NONE dma mask to facilitate this plus an inline function:
is_device_dma_capable() to check this.

Signed-off-by: James Bottomley <James.Bottomley@SteelEye.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
Cc: Tejun Heo <htejun@gmail.com>
Cc: Natalie Protasevich <protasnb@gmail.com>
Cc: Jeff Garzik <jgarzik@pobox.com>
Cc: Dominik Brodowski <linux@dominikbrodowski.net>
Cc: Russell King <rmk@arm.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-16 09:42:50 -07:00
Ralf Baechle
0322a2b840 Add assembler equivalents to __init{,date}_refok
I need __INIT_REFOK to fix a MODPOST warning for a few MIPS configs which
have to call init code from .text very early in the game due to bootloader
issues.  __INITDATA_REFOK is just for consistency.

Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-16 09:42:49 -07:00
Randy Dunlap
bfe8df3d31 slow down printk during boot
Optionally add a boot delay after each kernel printk() call, crudely
measured in milliseconds, with a maximum delay of 10 seconds per printk.

Enable CONFIG_BOOT_PRINTK_DELAY=y and then add (e.g.):
"lpj=loops_per_jiffy boot_delay=100"
to the kernel command line.

It has been useful in cases like "during boot, my machine just reboots or the
screen goes black" by slowing down printk, (and adding initcall_debug), we can
usually see the last thing that happened before the lights went out which is
usually a valuable clue.

[akpm@linux-foundation.org: not all architectures implement CONFIG_HZ]
[akpm@linux-foundation.org: fix lots of stuff]
[bunk@stusta.de: kernel/printk.c: make 2 variables static]
[heiko.carstens@de.ibm.com: fix slow down printk on boot compile error]
Signed-off-by: Randy Dunlap <rdunlap@xenotime.net>
Signed-off-by: Dave Jones <davej@redhat.com>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-16 09:42:49 -07:00
Timur Tabi
b0c813ceee [ALSA] ASoC CS4270 codec device driver
This patch adds ALSA SoC support for the Cirrus Logic CS4270 codec.  The
following features are suppored:
1) Stand-alone and software mode
2) Software mode via I2C only
3) Master mode, not Slave
4) No power management

Signed-off-by: Timur Tabi <timur@freescale.com>
Signed-off-by: Takashi Iwai <tiwai@suse.de>
Signed-off-by: Jaroslav Kysela <perex@suse.cz>
2007-10-16 15:58:19 +02:00
Hans-Christian Egtvedt
eafe570847 [ALSA] ALSA sound driver for the AT73C213 DAC using Atmel SSC driver
This patch adds support for the AT73C213 DAC using the misc Atmel SSC driver in
I2S mode. The driver also requires a SPI to setup the registers and control
volume.
It has been tested with an AT32AP7000 on the ATSTK1000 development board. The
driver should also work with any Atmel device with an SSC module supported by
the Atmel SSC driver (atmel-ssc).
The atmel-ssc driver is just submitted to the Linux kernel. Please see mail
thread http://lkml.org/lkml/2007/7/16/32

Signed-off-by: Hans-Christian Egtvedt <hcegtvedt@atmel.com>
Signed-off-by: Takashi Iwai <tiwai@suse.de>
Signed-off-by: Jaroslav Kysela <perex@suse.cz>
2007-10-16 15:57:50 +02:00
Jens Axboe
55c16a7004 IDE: sg chaining support
Acked-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-10-16 11:21:00 +02:00
Jens Axboe
ba2da2f8d6 i2o: sg chaining support
Acked-by: Alan Cox <alan@redhat.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-10-16 11:21:00 +02:00
Jens Axboe
8726021626 libata: convert to using sg helpers
This converts libata to using the sg helpers for looking up sg
elements, instead of doing it manually.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-10-16 11:14:12 +02:00
Jens Axboe
70eb8040dc Add chained sg support to linux/scatterlist.h
The core of the patch - allow the last sg element in a scatterlist
table to point to the start of a new table. We overload the LSB of
the page pointer to indicate whether this is a valid sg entry, or
merely a link to the next list.

Includes a fix from Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
correcting the ifdef ARCH_HAS_SG_CHAIN guarding sg_last().

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-10-16 11:08:51 +02:00
Jens Axboe
96b418c960 Add sg helpers for iterating over a scatterlist table
First step to being able to change the scatterlist setup without
having to modify drivers (a lot :-)

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-10-16 11:07:10 +02:00
Adrian Bunk
bb879463b5 remove ide_get_error_location()
Signed-off-by: Adrian Bunk <bunk@kernel.org>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-10-16 11:05:06 +02:00
Jens Axboe
fd5d806266 block: convert blkdev_issue_flush() to use empty barriers
Then we can get rid of ->issue_flush_fn() and all the driver private
implementations of that.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-10-16 11:05:02 +02:00
Jens Axboe
bf2de6f5a4 block: Initial support for data-less (or empty) barrier support
This implements functionality to pass down or insert a barrier
in a queue, without having data attached to it. The ->prepare_flush_fn()
infrastructure from data barriers are reused to provide this
functionality.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-10-16 11:03:56 +02:00
Jens Axboe
a0cd128542 block: add end_queued_request() and end_dequeued_request() helpers
We can use this helper in the elevator core for BLKPREP_KILL, and it'll
also be useful for the empty barrier patch.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-10-16 11:03:53 +02:00
Randy Dunlap
e6716b87d5 docbook: fix filesystems content
Fix filesystems docbook warnings.

Warning(linux-2.6.23-git8//fs/debugfs/file.c:241): No description found for parameter 'name'
Warning(linux-2.6.23-git8//fs/debugfs/file.c:241): No description found for parameter 'mode'
Warning(linux-2.6.23-git8//fs/debugfs/file.c:241): No description found for parameter 'parent'
Warning(linux-2.6.23-git8//fs/debugfs/file.c:241): No description found for parameter 'value'
Warning(linux-2.6.23-git8//include/linux/jbd.h:404): No description found for parameter 'h_lockdep_map'

Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-15 17:56:36 -07:00
Randy Dunlap
fd39c86b3d docbook: fix usb content
Fix USB docbook warnings.

Warning(linux-2.6.23-git8//include/linux/usb/gadget.h:487): No description found for parameter 'g'
Warning(linux-2.6.23-git8//include/linux/usb/gadget.h:506): No description found for parameter 'g'

Warning(linux-2.6.23-git8//drivers/usb/core/hub.c:1416): No description found for parameter 'usb_dev'

Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-15 17:56:36 -07:00
Linus Torvalds
65a6ec0d72 Merge branch 'devel' of master.kernel.org:/home/rmk/linux-2.6-arm
* 'devel' of master.kernel.org:/home/rmk/linux-2.6-arm: (95 commits)
  [ARM] 4578/1: CM-x270: PCMCIA support
  [ARM] 4577/1: ITE 8152 PCI bridge support
  [ARM] 4576/1: CM-X270 machine support
  [ARM] pxa: Avoid pxa_gpio_mode() in gpio_direction_{in,out}put()
  [ARM] pxa: move pxa_set_mode() from pxa2xx_mainstone.c to mainstone.c
  [ARM] pxa: move pxa_set_mode() from pxa2xx_lubbock.c to lubbock.c
  [ARM] pxa: Make cpu_is_pxaXXX dependent on configuration symbols
  [ARM] pxa: PXA3xx base support
  [NET] smc91x: fix PXA DMA support code
  [SERIAL] Fix console initialisation ordering
  [ARM] pxa: tidy up arch/arm/mach-pxa/Makefile
  [ARM] Update arch/arm/Kconfig for drivers/Kconfig changes
  [ARM] 4600/1: fix kernel build failure with build-id-supporting binutils
  [ARM] 4599/1: Preserve ATAG list for use with kexec (2.6.23)
  [ARM] Rename consistent_sync() as dma_cache_maint()
  [ARM] 4572/1: ep93xx: add cirrus logic edb9307 support
  [ARM] 4596/1: S3C2412: Correct IRQs for SDI+CF and add decoding support
  [ARM] 4595/1: ns9xxx: define registers as void __iomem * instead of volatile u32
  [ARM] 4594/1: ns9xxx: use the new gpio functions
  [ARM] 4593/1: ns9xxx: implement generic clockevents
  ...
2007-10-15 16:08:50 -07:00
Linus Torvalds
541010e4b8 Merge branch 'locks' of git://linux-nfs.org/~bfields/linux
* 'locks' of git://linux-nfs.org/~bfields/linux:
  nfsd: remove IS_ISMNDLCK macro
  Rework /proc/locks via seq_files and seq_list helpers
  fs/locks.c: use list_for_each_entry() instead of list_for_each()
  NFS: clean up explicit check for mandatory locks
  AFS: clean up explicit check for mandatory locks
  9PFS: clean up explicit check for mandatory locks
  GFS2: clean up explicit check for mandatory locks
  Cleanup macros for distinguishing mandatory locks
  Documentation: move locks.txt in filesystems/
  locks: add warning about mandatory locking races
  Documentation: move mandatory locking documentation to filesystems/
  locks: Fix potential OOPS in generic_setlease()
  Use list_first_entry in locks_wake_up_blocks
  locks: fix flock_lock_file() comment
  Memory shortage can result in inconsistent flocks state
  locks: kill redundant local variable
  locks: reverse order of posix_locks_conflict() arguments
2007-10-15 16:07:40 -07:00
Linus Torvalds
a52cefc80f Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6
* 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6: (42 commits)
  [IPV6]: Consolidate the ip6_pol_route_(input|output) pair
  [TCP]: Make snd_cwnd_cnt 32-bit
  [TCP]: Update the /proc/net/tcp documentation
  [NETNS]: Don't panic on creating the namespace's loopback
  [NEIGH]: Ensure that pneigh_lookup is protected with RTNL
  [INET]: kmalloc+memset -> kzalloc in frag_alloc_queue
  [ISDN]: Fix compile with CONFIG_ISDN_X25 disabled.
  [IPV6]: Replace sk_buff ** with sk_buff * in input handlers
  [SELINUX]: Update for netfilter ->hook() arg changes.
  [INET]: Consolidate the xxx_put
  [INET]: Small cleanup for xxx_put after evictor consolidation
  [INET]: Consolidate the xxx_evictor
  [INET]: Consolidate the xxx_frag_destroy
  [INET]: Consolidate xxx_the secret_rebuild
  [INET]: Consolidate the xxx_frag_kill
  [INET]: Collect common frag sysctl variables together
  [INET]: Collect frag queues management objects together
  [INET]: Move common fields from frag_queues in one place.
  [TG3]: Fix performance regression on 5705.
  [ISDN]: Remove local copy of device name to make sure renames work.
  ...
2007-10-15 14:06:58 -07:00
Linus Torvalds
f2e1d89f9b Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input: (40 commits)
  Input: use full RCU API
  Input: remove tsdev interface
  Input: add support for Blackfin BF54x Keypad controller
  Input: appletouch - another fix for idle reset logic
  HWMON: hdaps - switch to using input-polldev
  Input: add support for SEGA Dreamcast keyboard
  Input: omap-keyboard - don't pretend we support changing keymap
  Input: lifebook - fix X and Y axis range
  Input: usbtouchscreen - add support for GeneralTouch devices
  Input: fix open count handling in input interfaces
  Input: keyboard - add CapsShift lock
  Input: adbhid - produce all CapsLock key events
  Input: ALPS - add signature for ThinkPad R61
  Input: jornada720_kbd - send MSC_SCAN events
  Input: add support for the HP Jornada 7xx (710/720/728) touchscreen
  Input: add support for HP Jornada 7xx onboard keyboard
  Input: add support for HP Jornada onboard keyboard (HP6XX)
  Input: ucb1400_ts - use schedule_timeout_uninterruptible
  Input: xpad - fix dependancy on LEDS class
  Input: auto-select INPUT for MAC_EMUMOUSEBTN option
  ...

Resolved conflicts manually in drivers/hwmon/applesmc.c: converting from
a class device to a device and converting to use input-polldev created a
few apparently trivial clashes..
2007-10-15 13:41:39 -07:00
Ilpo Järvinen
f78a1b3892 [TCP]: Make snd_cwnd_cnt 32-bit
Very little point of having 32-bit snd_cnwd if this is not
32-bit as well, as a number of snd_cwnd incrementation formulas
assume that snd_cwnd_cnt can be at least as large as snd_cwnd.

Whether 32-bit is useful was discussed when e0ef57cc56
was made:
  http://marc.info/?l=linux-netdev&m=117218144409825&w=2

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-10-15 12:59:43 -07:00
Karsten Keil
faca94ffae [ISDN]: Remove local copy of device name to make sure renames work.
Signed-off-by: Karsten Keil <kkeil@suse.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-10-15 12:26:37 -07:00
Herbert Xu
3db05fea51 [NETFILTER]: Replace sk_buff ** with sk_buff *
With all the users of the double pointers removed, this patch mops up by
finally replacing all occurances of sk_buff ** in the netfilter API by
sk_buff *.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-10-15 12:26:29 -07:00
Herbert Xu
37d4187922 [NETFILTER]: Do not copy skb in skb_make_writable
Now that all callers of netfilter can guarantee that the skb is not shared,
we no longer have to copy the skb in skb_make_writable.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-10-15 12:26:27 -07:00
Herbert Xu
e0053ec07e [SKBUFF]: Add skb_morph
This patch creates a new function skb_morph that's just like skb_clone
except that it lets user provide the spare skb that will be overwritten
by the one that's to be cloned.

This will be used by IP fragment reassembly so that we get back the same
skb that went in last (rather than the head skb that we get now which
requires us to carry around double pointers all over the place).

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-10-15 12:26:24 -07:00
Brice Goglin
eabd7e35c0 Add skb_is_gso_v6
Add skb_is_gso_v6().

Signed-off-by: Brice Goglin <brice@myri.com>
Signed-off-by: Jeff Garzik <jeff@garzik.org>
2007-10-15 14:24:07 -04:00
Mike Rapoport
a8fc078955 [ARM] 4577/1: ITE 8152 PCI bridge support
This patch provides driver for ITE 8152 PCI bridge.

Signed-off-by: Mike Rapoport <mike@compulab.co.il>
Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
2007-10-15 18:53:59 +01:00
Linus Torvalds
f4921aff5b Merge git://git.linux-nfs.org/pub/linux/nfs-2.6
* git://git.linux-nfs.org/pub/linux/nfs-2.6: (131 commits)
  NFSv4: Fix a typo in nfs_inode_reclaim_delegation
  NFS: Add a boot parameter to disable 64 bit inode numbers
  NFS: nfs_refresh_inode should clear cache_validity flags on success
  NFS: Fix a connectathon regression in NFSv3 and NFSv4
  NFS: Use nfs_refresh_inode() in ops that aren't expected to change the inode
  SUNRPC: Don't call xprt_release in call refresh
  SUNRPC: Don't call xprt_release() if call_allocate fails
  SUNRPC: Fix buggy UDP transmission
  [23/37] Clean up duplicate includes in
  [2.6 patch] net/sunrpc/rpcb_clnt.c: make struct rpcb_program static
  SUNRPC: Use correct type in buffer length calculations
  SUNRPC: Fix default hostname created in rpc_create()
  nfs: add server port to rpc_pipe info file
  NFS: Get rid of some obsolete macros
  NFS: Simplify filehandle revalidation
  NFS: Ensure that nfs_link() returns a hashed dentry
  NFS: Be strict about dentry revalidation when doing exclusive create
  NFS: Don't zap the readdir caches upon error
  NFS: Remove the redundant nfs_reval_fsid()
  NFSv3: Always use directory post-op attributes in nfs3_proc_lookup
  ...

Fix up trivial conflict due to sock_owned_by_user() cleanup manually in
net/sunrpc/xprtsock.c
2007-10-15 10:47:35 -07:00
Linus Torvalds
419217cb1d Merge branch 'v2.6.24-lockdep' of git://git.kernel.org/pub/scm/linux/kernel/git/peterz/linux-2.6-lockdep
* 'v2.6.24-lockdep' of git://git.kernel.org/pub/scm/linux/kernel/git/peterz/linux-2.6-lockdep:
  lockdep: annotate dir vs file i_mutex
  lockdep: per filesystem inode lock class
  lockdep: annotate kprobes irq fiddling
  lockdep: annotate rcu_read_{,un}lock{,_bh}
  lockdep: annotate journal_start()
  lockdep: s390: connect the sysexit hook
  lockdep: x86_64: connect the sysexit hook
  lockdep: i386: connect the sysexit hook
  lockdep: syscall exit check
  lockdep: fixup mutex annotations
  lockdep: fix mismatched lockdep_depth/curr_chain_hash
  lockdep: Avoid /proc/lockdep & lock_stat infinite output
  lockdep: maintainers
2007-10-15 10:40:41 -07:00
Linus Torvalds
b5869ce7f6 Merge git://git.kernel.org/pub/scm/linux/kernel/git/mingo/linux-2.6-sched
* git://git.kernel.org/pub/scm/linux/kernel/git/mingo/linux-2.6-sched: (140 commits)
  sched: sync wakeups preempt too
  sched: affine sync wakeups
  sched: guest CPU accounting: maintain guest state in KVM
  sched: guest CPU accounting: maintain stats in account_system_time()
  sched: guest CPU accounting: add guest-CPU /proc/<pid>/stat fields
  sched: guest CPU accounting: add guest-CPU /proc/stat field
  sched: domain sysctl fixes: add terminator comment
  sched: domain sysctl fixes: do not crash on allocation failure
  sched: domain sysctl fixes: unregister the sysctl table before domains
  sched: domain sysctl fixes: use for_each_online_cpu()
  sched: domain sysctl fixes: use kcalloc()
  Make scheduler debug file operations const
  sched: enable wake-idle on CONFIG_SCHED_MC=y
  sched: reintroduce topology.h tunings
  sched: allow the immediate migration of cache-cold tasks
  sched: debug, improve migration statistics
  sched: debug: increase width of debug line
  sched: activate task_hot() only on fair-scheduled tasks
  sched: reintroduce cache-hot affinity
  sched: speed up context-switches a bit
  ...
2007-10-15 08:22:16 -07:00
Linus Torvalds
df3d80f5a5 Merge master.kernel.org:/pub/scm/linux/kernel/git/jejb/scsi-misc-2.6
* master.kernel.org:/pub/scm/linux/kernel/git/jejb/scsi-misc-2.6: (207 commits)
  [SCSI] gdth: fix CONFIG_ISA build failure
  [SCSI] esp_scsi: remove __dev{init,exit}
  [SCSI] gdth: !use_sg cleanup and use of scsi accessors
  [SCSI] gdth: Move members from SCp to gdth_cmndinfo, stage 2
  [SCSI] gdth: Setup proper per-command private data
  [SCSI] gdth: Remove gdth_ctr_tab[]
  [SCSI] gdth: switch to modern scsi host registration
  [SCSI] gdth: gdth_interrupt() gdth_get_status() & gdth_wait() fixes
  [SCSI] gdth: clean up host private data
  [SCSI] gdth: Remove virt hosts
  [SCSI] gdth: Reorder scsi_host_template intitializers
  [SCSI] gdth: kill gdth_{read,write}[bwl] wrappers
  [SCSI] gdth: Remove 2.4.x support, in-kernel changelog
  [SCSI] gdth: split out pci probing
  [SCSI] gdth: split out eisa probing
  [SCSI] gdth: split out isa probing
  gdth: Make one abuse of scsi_cmnd less obvious
  [SCSI] NCR5380: Use scsi_eh API for REQUEST_SENSE invocation
  [SCSI] usb storage: use scsi_eh API in REQUEST_SENSE execution
  [SCSI] scsi_error: Refactoring scsi_error to facilitate in synchronous REQUEST_SENSE
  ...
2007-10-15 08:19:33 -07:00
Linus Torvalds
37ca506adc Merge branch 'nfs-server-stable' of git://linux-nfs.org/~bfields/linux
* 'nfs-server-stable' of git://linux-nfs.org/~bfields/linux:
  knfsd: query filesystem for NFSv4 getattr of FATTR4_MAXNAME
  knfsd: nfsv4 delegation recall should take reference on client
  knfsd: don't shutdown callbacks until nfsv4 client is freed
  knfsd: let nfsd manage timing out its own leases
  knfsd: Add source address to sunrpc svc errors
  knfsd: 64 bit ino support for NFS server
  svcgss: move init code into separate function
  knfsd: remove code duplication in nfsd4_setclientid()
  nfsd warning fix
  knfsd: fix callback rpc cred
  knfsd: move nfsv4 slab creation/destruction to module init/exit
  knfsd: spawn kernel thread to probe callback channel
  knfsd: nfs4 name->id mapping not correctly parsing negative downcall
  knfsd: demote some printk()s to dprintk()s
  knfsd: cleanup of nfsd4 cmp_* functions
  knfsd: delete code made redundant by map_new_errors
  nfsd: fix horrible indentation in nfsd_setattr
  nfsd: remove unused cache_for_each macro
  nfsd: tone down inaccurate dprintk
2007-10-15 08:16:53 -07:00
Jiri Kosina
57d292bd7e HID: fix HIDIOCGRDESC memory access in hidraw
Fix bogus copying of data into userspace when HIDIOCGRDESC is issued.
HID-transport layer makes sure that dev->hid->rdesc is not larger than
HID_MAX_DESCRIPTOR_SIZE.

Noticed-by: Al Viro <viro@ftp.linux.org.uk>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-15 08:12:00 -07:00
Laurent Vivier
94886b84b1 sched: guest CPU accounting: maintain stats in account_system_time()
modify account_system_time() to add cputime to cpustat->guest if we are
running a VCPU. We add this cputime to cpustat->user instead of
cpustat->system because this part of KVM code is in fact user code
although it is executed in the kernel. We duplicate VCPU time between
guest and user to allow an unmodified "top(1)" to display correct value.
A modified "top(1)" is able to display good cpu user time and cpu guest
time by subtracting cpu guest time from cpu user time. Update "gtime" in
task_struct accordingly.

Signed-off-by: Laurent Vivier <Laurent.Vivier@bull.net>
Acked-by: Avi Kivity <avi@qumranet.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-10-15 17:00:19 +02:00
Laurent Vivier
9ac52315d4 sched: guest CPU accounting: add guest-CPU /proc/<pid>/stat fields
like for cpustat, introduce the "gtime" (guest time of the task) and
"cgtime" (guest time of the task children) fields for the
tasks. Modify signal_struct and task_struct.

Modify /proc/<pid>/stat to display these new fields.

Signed-off-by: Laurent Vivier <Laurent.Vivier@bull.net>
Acked-by: Avi Kivity <avi@qumranet.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-10-15 17:00:19 +02:00
Laurent Vivier
5e84cfde51 sched: guest CPU accounting: add guest-CPU /proc/stat field
as recent CPUs introduce a third running state, after "user" and
"system", we need a new field, "guest", in cpustat to store the time
used by the CPU to run virtual CPU. Modify /proc/stat to display this
new field.

Signed-off-by: Laurent Vivier <Laurent.Vivier@bull.net>
Acked-by: Avi Kivity <avi@qumranet.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-10-15 17:00:19 +02:00
Ingo Molnar
7a6c6bcee0 sched: enable wake-idle on CONFIG_SCHED_MC=y
most multicore CPUs today have shared L2 caches, so tune things so
that the spreading amongst cores is more aggressive.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-10-15 17:00:19 +02:00
Ingo Molnar
95dbb421d1 sched: reintroduce topology.h tunings
reintroduce the 2.6.22 topology.h tunings again - they result in
slightly better balancing.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-10-15 17:00:19 +02:00
Ingo Molnar
cc367732ff sched: debug, improve migration statistics
add new migration statistics when SCHED_DEBUG and SCHEDSTATS
is enabled. Available in /proc/<PID>/sched.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-10-15 17:00:18 +02:00
Ingo Molnar
da84d96176 sched: reintroduce cache-hot affinity
reintroduce a simplified version of cache-hot/cold scheduling
affinity. This improves performance with certain SMP workloads,
such as sysbench.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-10-15 17:00:18 +02:00
Mike Galbraith
95938a35c5 sched: prevent wakeup over-scheduling
Prevent wakeup over-scheduling.  Once a task has been preempted by a
task of the same or lower priority, it becomes ineligible for repeated
preemption by same until it has been ticked, or slept.  Instead, the
task is marked for preemption at the next tick.  Tasks of higher
priority still preempt immediately.

Signed-off-by: Mike Galbraith <efault@gmx.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-10-15 17:00:14 +02:00
Dhaval Giani
5cb350baf5 sched: group scheduling, sysfs tunables
Add tunables in sysfs to modify a user's cpu share.

A directory is created in sysfs for each new user in the system.

	/sys/kernel/uids/<uid>/cpu_share

Reading this file returns the cpu shares granted for the user.
Writing into this file modifies the cpu share for the user. Only an
administrator is allowed to modify a user's cpu share.

Ex:
	# cd /sys/kernel/uids/
	# cat 512/cpu_share
	1024
	# echo 2048 > 512/cpu_share
	# cat 512/cpu_share
	2048
	#

Signed-off-by: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
Signed-off-by: Dhaval Giani <dhaval@linux.vnet.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-10-15 17:00:14 +02:00
Ingo Molnar
4cf86d77f5 sched: cleanup: rename task_grp to task_group
cleanup: rename task_grp to task_group. No need to save two characters
and 'grp' is annoying to read.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-10-15 17:00:14 +02:00
Mike Galbraith
af92723262 sched: cleanup, remove the TASK_NONINTERACTIVE flag
Here's another piece of low hanging obsolete fruit.

Remove obsolete TASK_NONINTERACTIVE.

Signed-off-by: Mike Galbraith <efault@gmx.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-10-15 17:00:13 +02:00
Ingo Molnar
5522d5d5f7 sched: mark scheduling classes as const
mark scheduling classes as const. The speeds up the code
a bit and shrinks it:

   text    data     bss     dec     hex filename
  40027    4018     292   44337    ad31 sched.o.before
  40190    3842     292   44324    ad24 sched.o.after

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
2007-10-15 17:00:12 +02:00
Peter Zijlstra
5f6d858ecc sched: speed up and simplify vslice calculations
speed up and simplify vslice calculations.

[ From: Mike Galbraith <efault@gmx.de>: build fix ]

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-10-15 17:00:12 +02:00
Ingo Molnar
2d72376b3a sched: clean up schedstats, cnt -> count
rename all 'cnt' fields and variables to the less yucky 'count' name.

yuckage noticed by Andrew Morton.

no change in code, other than the /proc/sched_debug bkl_count string got
a bit larger:

   text    data     bss     dec     hex filename
  38236    3506      24   41766    a326 sched.o.before
  38240    3506      24   41770    a32a sched.o.after

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
2007-10-15 17:00:12 +02:00
Ingo Molnar
94359f05cb sched: undo some of the recent changes
undo some of the recent changes that are not needed after all,
such as last_min_vruntime.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
2007-10-15 17:00:11 +02:00
Peter Zijlstra
67e9fb2a39 sched: add vslice
add vslice: the load-dependent "virtual slice" a task should
run ideally, so that the observed latency stays within the
sched_latency window.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
2007-10-15 17:00:10 +02:00
Ingo Molnar
c18b8a7cbc sched: remove unneeded tunables
remove unneeded tunables.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
2007-10-15 17:00:10 +02:00
Ingo Molnar
b8efb56172 sched debug: BKL usage statistics
add per task and per rq BKL usage statistics.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
2007-10-15 17:00:10 +02:00
Srivatsa Vaddagiri
24e377a832 sched: add fair-user scheduler
Enable user-id based fair group scheduling. This is useful for anyone
who wants to test the group scheduler w/o having to enable
CONFIG_CGROUPS.

A separate scheduling group (i.e struct task_grp) is automatically created for 
every new user added to the system. Upon uid change for a task, it is made to 
move to the corresponding scheduling group.

A /proc tunable (/proc/root_user_share) is also provided to tune root
user's quota of cpu bandwidth.

Signed-off-by: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
Signed-off-by: Dhaval Giani <dhaval@linux.vnet.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
2007-10-15 17:00:09 +02:00
Srivatsa Vaddagiri
9b5b77512d sched: clean up code under CONFIG_FAIR_GROUP_SCHED
With the view of supporting user-id based fair scheduling (and not just
container-based fair scheduling), this patch renames several functions
and makes them independent of whether they are being used for container
or user-id based fair scheduling.

Also fix a problem reported by KAMEZAWA Hiroyuki (wrt allocating
less-sized array for tg->cfs_rq[] and tf->se[]).

Signed-off-by: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
Signed-off-by: Dhaval Giani <dhaval@linux.vnet.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
2007-10-15 17:00:09 +02:00
Srivatsa Vaddagiri
83b699ed20 sched: revert recent removal of set_curr_task()
Revert removal of set_curr_task.
Use put_prev_task/set_curr_task when changing groups/policies

Signed-off-by: Srivatsa Vaddagiri < vatsa@linux.vnet.ibm.com>
Signed-off-by: Dhaval Giani <dhaval@linux.vnet.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
2007-10-15 17:00:08 +02:00
Dmitry Adamushko
f6b53205e1 sched: rework enqueue/dequeue_entity() to get rid of set_curr_task()
rework enqueue/dequeue_entity() to get rid of 
sched_class::set_curr_task(). This simplifies sched_setscheduler(), 
rt_mutex_setprio() and sched_move_tasks().

   text    data     bss     dec     hex filename
  24330    2734      20   27084    69cc sched.o.before
  24233    2730      20   26983    6967 sched.o.after

Signed-off-by: Dmitry Adamushko <dmitry.adamushko@gmail.com>
Signed-off-by: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
2007-10-15 17:00:08 +02:00
Dmitry Adamushko
4530d7ab0f sched: simplify sched_class::yield_task()
the 'p' (task_struct) parameter in the sched_class :: yield_task() is
redundant as the caller is always the 'current'. Get rid of it.

   text    data     bss     dec     hex filename
  24341    2734      20   27095    69d7 sched.o.before
  24330    2734      20   27084    69cc sched.o.after

Signed-off-by: Dmitry Adamushko <dmitry.adamushko@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
2007-10-15 17:00:08 +02:00
Dmitry Adamushko
30cfdcfc5f sched: do not keep current in the tree and get rid of sched_entity::fair_key
Get rid of 'sched_entity::fair_key'.

As a side effect, 'current' is not kept withing the tree for 
SCHED_NORMAL/BATCH tasks anymore. This simplifies some parts of code 
(e.g. entity_tick() and yield_task_fair()) and also somewhat optimizes 
them (e.g. a single update_curr() now vs. dequeue/enqueue() before in 
entity_tick()).

Signed-off-by: Dmitry Adamushko <dmitry.adamushko@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
2007-10-15 17:00:07 +02:00
Ingo Molnar
bbdba7c0e1 sched: remove wait_runtime fields and features
remove wait_runtime based fields and features, now that the CFS
math has been changed over to the vruntime metric.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Mike Galbraith <efault@gmx.de>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
2007-10-15 17:00:06 +02:00
Ingo Molnar
e22f5bbf86 sched: remove wait_runtime limit
remove the wait_runtime-limit fields and the code depending on it, now
that the math has been changed over to rely on the vruntime metric.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Mike Galbraith <efault@gmx.de>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
2007-10-15 17:00:06 +02:00
Ingo Molnar
e9acbff648 sched: introduce se->vruntime
introduce se->vruntime as a sum of weighted delta-exec's, and use that
as the key into the tree.

the idea to use absolute virtual time as the basic metric of scheduling
has been first raised by William Lee Irwin, advanced by Tong Li and first
prototyped by Roman Zippel in the "Really Fair Scheduler" (RFS) patchset.

also see:

   http://lkml.org/lkml/2007/9/2/76

for a simpler variant of this patch.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Mike Galbraith <efault@gmx.de>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
2007-10-15 17:00:04 +02:00
Ingo Molnar
8ebc91d936 sched: remove stat_gran
remove the stat_gran code - it was disabled by default and it causes
unnecessary overhead.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Mike Galbraith <efault@gmx.de>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
2007-10-15 17:00:03 +02:00
Ingo Molnar
2bd8e6d422 sched: use constants if !CONFIG_SCHED_DEBUG
use constants if !CONFIG_SCHED_DEBUG.

this speeds up the code and reduces code-size:

    text    data     bss     dec     hex filename
   27464    3014      16   30494    771e sched.o.before
   26929    3010      20   29959    7507 sched.o.after

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Mike Galbraith <efault@gmx.de>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
2007-10-15 17:00:02 +02:00
Ingo Molnar
eba1ed4b7e sched: debug: track maximum 'slice'
track the maximum amount of time a task has executed while
the CPU load was at least 2x. (i.e. at least two nice-0
tasks were runnable)

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Mike Galbraith <efault@gmx.de>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
2007-10-15 17:00:02 +02:00
Roland Dreier
744ea922c9 mlx4_core: Kill mlx4_write64_raw()
It's a leftover from development that's never used in the real driver.

Signed-off-by: Roland Dreier <rolandd@cisco.com>
2007-10-14 20:17:39 -07:00
Thomas Gleixner
1595f452f3 clockevents: introduce force broadcast notifier
The 64bit SMP bootup is slightly different to the 32bit one. It enables
the boot CPU local APIC timer before all CPUs are brought up. Some AMD C1E
systems have the C1E feature flag only set in the secondary CPU. Due to
the early enable of the boot CPU local APIC timer the APIC timer is
registered as a fully functional device. When we detect the wreckage during
the bringup of the secondary CPU, we need to force the boot CPU into
broadcast mode. 

Add a new notifier reason and implement the force broadcast in the clock
events layer.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2007-10-14 22:57:45 +02:00
Linus Torvalds
4fa435018d Merge branch 'release' of git://lm-sensors.org/kernel/mhoffman/hwmon-2.6
* 'release' of git://lm-sensors.org/kernel/mhoffman/hwmon-2.6: (53 commits)
  hwmon: (vt8231) fix sparse warning
  hwmon: (sis5595) fix sparse warning
  hwmon: (w83627hf) don't assume bank 0
  hwmon: (w83627hf) Fix setting fan min right after driver load
  hwmon: (w83627hf) De-macro sysfs callback functions
  hwmon: Add new combined driver for FSC chips
  hwmon: (ibmpex) Release IPMI user if hwmon registration fails
  hwmon: (dme1737) Add sch311x support
  hwmon: (dme1737) group functions logically
  hwmon: (dme1737) cleanups
  hwmon: IBM power meter driver
  hwmon: (coretemp) Add support for Celeron 4xx
  hwmon: (lm87) Disable VID when it should be
  hwmon: (w83781d) Add individual alarm and beep files
  hwmon: VRM is not read from registers
  MAINTAINERS: update hwmon subsystem git trees
  hwmon: Fix the code examples in documentation
  hwmon: update sysfs interface document - error handling
  hwmon: (thmc50) Fix a debug message
  hwmon: (thmc50) Don't create temp3 if not enabled
  ...
2007-10-14 12:50:19 -07:00
Al Viro
f53f4137ba fix endianness bug in inet_lro
all uses of and almost all assignments to lro_desc->tcp_ack assume that it's
net-endian; one converts net-endian to host-endian and sticks it in
lro_desc->tcp_ack.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-14 12:41:52 -07:00
Al Viro
9df7c98a0f inet_lro: trivial endianness annotations
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-14 12:41:52 -07:00
Al Viro
5ba253313d more low-hanging fruits - kernel, fs, lib signedness
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-14 12:41:52 -07:00
Al Viro
64b33619a3 long vs. unsigned long - low-hanging fruits in drivers
deal with signedness of the stuff passed to set_bit() et.al.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-14 12:41:51 -07:00
Jiri Kosina
d057fd4cb8 Merge branch 'hidraw' into for-linus 2007-10-14 14:47:56 +02:00
Jiri Kosina
86166b7bcd HID: add hidraw interface
hidraw is an interface that is going to obsolete hiddev one
day.

Many userland applications are using libusb instead of using
kernel-provided hiddev interface. This is caused by various
reasons - the HID parser in kernel doesn't handle all the
HID hardware on the planet properly, some devices might require
its own specific quirks/drivers, etc.

hiddev interface tries to do its best to parse all the received
reports properly, and presents only parsed usages into userspace.
This is however often not enough, and that's the reason why
many userland applications just don't use hiddev at all, and
rather use libusb to read raw USB events and process them on
their own.

Another drawback of hiddev is that it is USB-specific.

hidraw interface provides userspace readers with really raw HID
reports, no matter what the low-level transport layer is (USB/BT),
and gives the userland applications all the freedom to process
the HID reports in a way they wish to.

Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2007-10-14 14:47:26 +02:00
Khelben Blackstaff
e2bca0749c Input: add KEY_LOGOFF
HUT 1.12 defines Logoff usage 0x19c in Consumer page. There are
keyboards out there emitting this usage code (for example Microsoft
Wireless Laser Keyboard 6000). Add this key so that HID code could
map usages to it.

Signed-off-by: Khelben Blackstaff <eye.of.the.8eholder@gmail.com>
Signed-off-by: Dmitry Torokhov <dtor@mail.ru>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2007-10-14 13:40:02 +02:00
Tomoya Adachi
08f06177f4 USBHID: report descriptor fix for MacBook JIS keyboard
This patch fixes the problem, that Japanese MacBook doesn't recognize some keys
like '\'(yen, or backslash), '|'(pipe), and '_'(underscore).

It is due to that MacBook JIS keyboard (jp106) sends wrong report descriptor.
It saids "logical maximum = 0x65", so Keyboard.0089 is mapped to Key.Unknown,
while it should be accepted as Key.Yen.

Signed-off-by: Tomoya Adachi <adachi@il.is.s.u-tokyo.ac.jp>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2007-10-14 13:40:01 +02:00
Stelian Pop
0ce91cf9ce HID: enable hiddev for the SantaRosa MacBookPro IR receiver
The infrared remote receiver found in the SantaRosa MacBookPro
laptops (MacBookPro3,1) need to be forced to expose a HIDDEV
interface (instead of HIDINPUT) so that lirc can access it using
the 'macmini' driver.

The patch below adds the required quirk for forcing the HIDDEV
interface to be activated (HID_QUIRK_HIDDEV) and introduces a new
quirk which forces the HIDINPUT interface to be ignored
(HID_QUIRK_IGNORE_HIDINPUT).

Note that Apple calls this receiver 'IRController4' (info taken
from Apple's driver Info.plist). Older Mac{Book,Mini,Pro}s seem
to all use the 'IRController1' device (USB id 05ac:8240) which
doesn't need those quirks.

Signed-off-by: Stelian Pop <stelian@popies.net>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2007-10-14 13:40:01 +02:00
Jiri Kosina
4dc21a8005 Input: add KEY_SPELLCHECK
HUT 1.12 defines Spell Check usage 0x1ab in Consumer page. There are
keyboards out there emitting this usage code (for example Microsoft
Natural Ergonomic Keyboard 4000). Add this key so that HID code could
map usages to it.

Acked-by: Dmitry Torokhov <dtor@mail.ru>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2007-10-14 13:40:00 +02:00
Peter Zijlstra
14358e6dda lockdep: annotate dir vs file i_mutex
On Mon, 2007-09-24 at 22:13 -0400, Steven Rostedt wrote:
> The circular lock seems to be this:
> 
> #1:
> 
>   sys_mmap2:              down_write(&mm->mmap_sem);
>   nfs_revalidate_mapping: mutex_lock(&inode->i_mutex);
> 
> 
> #0:
> 
>   vfs_readdir:     mutex_lock(&inode->i_mutex);
>    - during the readdir (filldir64), we take a user fault (missing page?)
>     and call do_page_fault -
>   do_page_fault:   down_read(&mm->mmap_sem);
> 
> 
> So it does indeed look like a circular locking. Now the question is, "is
> this a bug?".  Looking like the inode of #1 must be a file or something
> else that you can mmap and the inode of #0 seems it must be a directory.
> I would say "no".
> 
> Now if you can readdir on a file or mmap a directory, then this could be
> an issue.
> 
> Otherwise, I'd love to see someone teach lockdep about this issue! ;-)

Make a distinction between file and dir usage of i_mutex.
The inode should be complete and unused at unlock_new_inode(), re-init
i_mutex depending on its type.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
2007-10-14 01:38:33 +02:00
Peter Zijlstra
d475fd428c lockdep: per filesystem inode lock class
Give each filesystem its own inode lock class. The various filesystems have
different locking order wrt the inode locks; esp. the pseudo filesystems differ
from the rest.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
2007-10-15 14:51:31 +02:00
David Brownell
6662cbb989 i2c: Rename the PEC functionality bit
Rename I2C_FUNC_SMBUS_HWPEC_CALC as I2C_FUNC_SMBUS_PEC, and list that
functionality as always available through the software implementation.
Update documentation accordingly (and list similar requirements).

The way it's currently packaged doesn't present the capability in a
useful way.

Signed-off-by: David Brownell <dbrownell@users.sourceforge.net>
Signed-off-by: Jean Delvare <khali@linux-fr.org>
2007-10-13 23:56:33 +02:00
David Brownell
08fb68bb4b i2c: Move i2c-dev interfaces to i2c-dev.h
Move the i2c-dev support into <linux/i2c-dev.h> where it should always
have lived.  Now <linux/i2c.h> no longer holds stuff related to the
optional userspace /dev/i2c-X interface.  Improve the descriptions
for these ioctl requests.

Signed-off-by: David Brownell <dbrownell@users.sourceforge.net>
Signed-off-by: Jean Delvare <khali@linux-fr.org>
2007-10-13 23:56:32 +02:00
David Brownell
53be795934 i2c: Remove i2c_algorithm.algo_control()
This removes:

 - An effectively unused hook:  i2c_algorithm.algo_control.

 - The i2c_control() call, used only by i2c-dev to call that
   unused hook or set two barely supported adapter params.

   (That param setting moves into i2c-dev.c ... still iffy
   due to lack of locking, but no other changes.)

As shown by diffstat, this is a net code shrink.  It also reduces the
complexity of the I2C adapter and /dev interfaces.

Signed-off-by: David Brownell <dbrownell@users.sourceforge.net>
Signed-off-by: Jean Delvare <khali@linux-fr.org>
2007-10-13 23:56:32 +02:00
David Brownell
a64ec07d3d i2c: Document struct i2c_msg
Clarify use of the I2C_M_* flags by highlighting the fact that
most of them depend on I2C_FUNC_PROTOCOL_MANGLING.

Also provide kerneldoc for i2c_smbus_read_block_data() and also
for "struct i2c_msg".

Signed-off-by: David Brownell <dbrownell@users.sourceforge.net>
Signed-off-by: Jean Delvare <khali@linux-fr.org>
2007-10-13 23:56:31 +02:00
Adrian Bunk
83eaaed0d0 i2c-core: Make some code static
After the i2c-isa removal some code can become static.

Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Jean Delvare <khali@linux-fr.org>
2007-10-13 23:56:30 +02:00
David Brownell
3bbb835d4c i2c: New-style devices can support driver model wakeup flags
We need to be able to flag I2C devices, such as RTCs, which can issue wake
events (usually through IRQ lines).  This adds an i2c_board_info.flags bit,
and uses it to initialize the i2c device node.  (And shrinks a few lines
that were overly long.)

Signed-off-by: David Brownell <dbrownell@users.sourceforge.net>
Signed-off-by: Jean Delvare <khali@linux-fr.org>
2007-10-13 23:56:29 +02:00
Jean Delvare
cee37ae407 i2c: Kill struct i2c_device_id
I2C devices do not have any form of ID as PCI or USB devices have.
No driver uses "MODULE_DEVICE_TABLE(i2c, ...)" because it doesn't
make sense. So we can get rid of struct i2c_device_id and the
associated support code.

Signed-off-by: Jean Delvare <khali@linux-fr.org>
Cc: Greg KH <greg@kroah.com>
2007-10-13 23:56:29 +02:00
Linus Torvalds
bcd11eaa22 Merge git://git.kernel.org/pub/scm/linux/kernel/git/bart/ide-2.6
* git://git.kernel.org/pub/scm/linux/kernel/git/bart/ide-2.6: (27 commits)
  alim15x3: remove redundant m5229_revision check
  sc1200: fix ->dma_base equal zero handling
  cs5520: fix ->dma_base equal zero handling
  sgiioc4: add missing ->dma_base check
  cs5535: add missing ->dma_base check
  ide: remove CONFIG_IDEDMA_IVB config option
  ide: change master/slave IDENTIFY order
  ide: move ide_config_drive_speed() calls to upper layers (take 2)
  pdc202xx_new: check ide_config_drive_speed() return value
  cs5535: check ide_config_drive_speed() return value
  amd74xx/via82cxxx: check ide_config_drive_speed() return value
  au1xxx: fix au1xxx_set_pio_mode()
  icside: use ide_tune_dma()
  ide-pmac: fix PIO setup and enable autotune
  ide-pmac: use ide_tune_dma() (take 2)
  ide-pmac: remove pmac_ide_do_setfeature() (take 2)
  ide-pmac: remove nIEN clearing from pmac_ide_do_setfeature()
  ide-pmac: use __ide_wait_stat()
  ide-pmac: remove extra good status wait from pmac_ide_do_setfeature()
  ide: add __ide_wait_stat() helper
  ...
2007-10-13 10:13:27 -07:00
Linus Torvalds
c8c55bcb43 Merge git://git.infradead.org/mtd-2.6
* git://git.infradead.org/mtd-2.6: (91 commits)
  [MTD] [NAND] Blackfin on-chip NAND Flash Controller driver
  [MTD] [NOR] fix ctrl-alt-del can't reboot for intel flash bug
  [MTD] [NAND] Fix compiler warning in Alauda driver
  [JFFS2] Remove stray debugging printk
  [JFFS2] Handle dirents on the flash with embedded zero bytes in names.
  [JFFS2] Check for creation of dirents with embedded zero bytes in name.
  [JFFS2] Don't count all 'very dirty' blocks except in debug mode
  [JFFS2] Check whether garbage-collection actually obsoleted its victim.
  [JFFS2] Relax threshold for triggering GC due to dirty blocks.
  [MTD] [OneNAND] Fix typo related with recent commit
  [JFFS2] Trigger garbage collection when very_dirty_list size becomes excessive
  [MTD] [NAND] Avoid deadlock in erase callback; release chip lock first.
  [MTD] [NAND] Resume method for CAFÉ NAND controller
  [MTD] [NAND] Fix PCI ident table for CAFÉ NAND controller.
  [MTD] [NAND] s3c2410: fix arch moves
  [MTD] [OneNAND] fix numerous races
  [MTD] map driver for NOR flash on the Intel Vermilion Range chipset
  [JFFS2] Fix unpoint length
  [MTD] fix CFI point method for discontiguous maps
  [MTD] MAPS: Merge Lubbock and Mainstone drivers into common PXA2xx driver
  ...
2007-10-13 10:12:15 -07:00
Linus Torvalds
3749c66c67 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/avi/kvm
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/avi/kvm: (106 commits)
  KVM: Replace enum by #define
  KVM: Skip pio instruction when it is emulated, not executed
  KVM: x86 emulator: popf
  KVM: x86 emulator: fix src, dst value initialization
  KVM: x86 emulator: jmp abs
  KVM: x86 emulator: lea
  KVM: X86 emulator: jump conditional short
  KVM: x86 emulator: imlpement jump conditional relative
  KVM: x86 emulator: sort opcodes into ascending order
  KVM: Improve emulation failure reporting
  KVM: x86 emulator: pushf
  KVM: x86 emulator: call near
  KVM: x86 emulator: push imm8
  KVM: VMX: Fix exit qualification width on i386
  KVM: Move main vcpu loop into subarch independent code
  KVM: VMX: Move vm entry failure handling to the exit handler
  KVM: MMU: Don't do GFP_NOWAIT allocations
  KVM: Rename kvm_arch_ops to kvm_x86_ops
  KVM: Simplify memory allocation
  KVM: Hoist SVM's get_cs_db_l_bits into core code.
  ...
2007-10-13 10:02:11 -07:00
Randy Dunlap
c4ea43c552 net core: fix kernel-doc for new function parameters
Fix networking code kernel-doc for newly added parameters.

Warning(linux-2.6.23-git2//net/core/sock.c:879): No description found for parameter 'net'
Warning(linux-2.6.23-git2//net/core/dev.c:570): No description found for parameter 'net'
Warning(linux-2.6.23-git2//net/core/dev.c:594): No description found for parameter 'net'
Warning(linux-2.6.23-git2//net/core/dev.c:617): No description found for parameter 'net'
Warning(linux-2.6.23-git2//net/core/dev.c:641): No description found for parameter 'net'
Warning(linux-2.6.23-git2//net/core/dev.c:667): No description found for parameter 'net'
Warning(linux-2.6.23-git2//net/core/dev.c:722): No description found for parameter 'net'
Warning(linux-2.6.23-git2//net/core/dev.c:959): No description found for parameter 'net'
Warning(linux-2.6.23-git2//net/core/dev.c:1195): No description found for parameter 'dev'
Warning(linux-2.6.23-git2//net/core/dev.c:2105): No description found for parameter 'n'
Warning(linux-2.6.23-git2//net/core/dev.c:3272): No description found for parameter 'net'
Warning(linux-2.6.23-git2//net/core/dev.c:3445): No description found for parameter 'net'
Warning(linux-2.6.23-git2//include/linux/netdevice.h:1301): No description found for parameter 'cpu'

Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-13 09:52:26 -07:00
Linus Torvalds
dcf397f037 Merge git://git.kernel.org/pub/scm/linux/kernel/git/lethal/sh-2.6
* git://git.kernel.org/pub/scm/linux/kernel/git/lethal/sh-2.6: (124 commits)
  sh: allow building for both r2d boards in same binary.
  sh: fix r2d board detection
  sh: Discard .exit.text/.exit.data at runtime.
  sh: Fix up some section alignments in linker script.
  sh: Fix SH-4 DMAC CHCR masking.
  sh: Rip out left-over nommu cond syscall cruft.
  sh: Make kgdb i-cache flushing less inept.
  sh: kgdb section mismatches and tidying.
  sh: cleanup struct irqaction initializers.
  sh: early_printk tidying.
  video: pvr2fb: Add TV (RGB) support to Dreamcast PVR driver.
  sh: Conditionalize gUSA support.
  sh: Follow gUSA preempt changes in __switch_to().
  sh: Tidy up gUSA preempt handling.
  sh: __copy_user() optimizations for small copies.
  sh: clkfwk: Support multi-level clock propagation.
  sh: Fix URAM start address on SH7785.
  sh: Use boot_cpu_data for CPU probe.
  sh: Support extended mode TLB on SH-X3.
  sh: Bump MAX_ACTIVE_REGIONS for SH7785.
  ...
2007-10-13 09:49:04 -07:00
Bartlomiej Zolnierkiewicz
88b2b32bab ide: move ide_config_drive_speed() calls to upper layers (take 2)
* Convert {ide_hwif_t,ide_pci_device_t}->host_flag to be u16.

* Add IDE_HFLAG_POST_SET_MODE host flag to indicate the need to program 
  the host for the transfer mode after programming the device.  Set it
  in au1xxx-ide, amd74xx, cs5530, cs5535, pdc202xx_new, sc1200, pmac
  and via82cxxx host drivers.

* Add IDE_HFLAG_NO_SET_MODE host flag to indicate the need to completely
  skip programming of host/device for the transfer mode ("smart" hosts).
  Set it in it821x host driver and check it in ide_tune_dma().

* Add ide_set_pio_mode()/ide_set_dma_mode() helpers and convert all
  direct ->set_pio_mode/->speedproc users to use these helpers.

* Move ide_config_drive_speed() calls from ->set_pio_mode/->speedproc
  methods to callers.

* Rename ->speedproc method to ->set_dma_mode, make it void and update
  all implementations accordingly.

* Update ide_set_xfer_rate() comments.

* Unexport ide_config_drive_speed().

v2:
* Fix issues noticed by Sergei:
  - export ide_set_dma_mode() instead of moving ->set_pio_mode abuse wrt
    to setting DMA modes from sc1200_set_pio_mode() to do_special()
  - check IDE_HFLAG_NO_SET_MODE in ide_tune_dma()
  - check for (hwif->set_pio_mode) == NULL in ide_set_pio_mode()
  - check for (hwif->set_dma_mode) == NULL in ide_set_dma_mode()
  - return -1 from ide_set_{pio,dma}_mode() if ->set_{pio,dma}_mode == NULL
  - don't set ->set_{pio,dma}_mode on it821x in "smart" mode
  - fix build problem in pmac.c
  - minor fixes in au1xxx-ide.c/cs5530.c/siimage.c
  - improve patch description

Changes in behavior caused by this patch:
- HDIO_SET_PIO_MODE ioctl would now return -ENOSYS for attempts to change
  PIO mode if it821x controller is in "smart" mode
- removal of two debugging printk-s (from cs5530.c and sc1200.c)
- transfer modes 0x00-0x07 passed from user space may be programmed twice on
  the device (not really an issue since 0x00 is not supported correctly by
  any host driver ATM, 0x01 is not supported at all and 0x02-0x07 are invalid)

Acked-by: Sergei Shtylyov <sshtylyov@ru.mvista.com>
Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
2007-10-13 17:47:51 +02:00
Bartlomiej Zolnierkiewicz
75d7d963e3 icside: use ide_tune_dma()
* Add "good DMA drives" hack for icside to ide-dma.c::ide_find_dma_mode()
  (in the long-term it should be either removed or generalized for all hosts).

* Use ide_tune_dma() in icside.c::icside_dma_check().

  This results in the following changes in behavior:
  - pre-EIDE SWDMA modes are now also respected
  - drive->autodma is checked instead of hwif->autodma
    (doesn't really matter as icside sets both to "1")

* Make ide-dma.c::__ide_dma_good_drive() static and drop "__" prefix.

Cc: Russell King <rmk@arm.linux.org.uk>
Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
2007-10-13 17:47:50 +02:00
Bartlomiej Zolnierkiewicz
aedea5910c ide-pmac: remove pmac_ide_do_setfeature() (take 2)
Use ide_config_drive_speed() instead of pmac_ide_do_setfeature() and remove
the latter, also  ide-iops.c::__ide_wait_stat() could be static again.

Since for IDE PMAC host driver IDE_CONTROL_REG is always true, device's
->quirk_list is always zero and ->ide_dma_host_{on,off} are nops than
the only changes in behavior are:

* if PIO mode is set then ->dma_off_queitly is called to disable DMA

* if setting transfer mode fails ide_dump_status() is called to dump status

v2:
* IDE PMAC controllers allow separate PIO and DMA timings and PPC userland
  depends on this fact, and calls "hdparm -p" without calling "hdparm -d".

  Therefore to compensate for DMA being disabled by ide_config_drive_speed()
  for PIO modes:

  - add IDE_HFLAG_SET_PIO_MODE_KEEP_DMA flag and set it in PMAC host driver

  - add handling of the new flag to ide-io.c::do_special()

Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
2007-10-13 17:47:50 +02:00
Bartlomiej Zolnierkiewicz
ddf151026a ide-pmac: use __ide_wait_stat()
* Use __ide_wait_stat() instead of wait_for_ready() in pmac_ide_do_setfeature().

While at it do following changes to match __ide_wait_stat() call in
ide_config_drive_speed():

* Wait WAIT_CMD time (20 sec) instead of 2 sec for device to clear BUSY_STAT.

* Check DRQ_STAT bit (shouldn't be set for good device status).

Also remove no longer needed wait_for_ready() from ide-iops.c.

Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
2007-10-13 17:47:49 +02:00
Bartlomiej Zolnierkiewicz
74af21cf4d ide: add __ide_wait_stat() helper
* Split off checking of the status register from ide_wait_stat() to
  __ide_wait_stat() helper.

* Use the new helper in ide_config_drive_speed().  The only change in the
  functionality is that the function now fails if after 20 sec (WAIT_CMD)
  device is still busy (BUSY_STAT bit is set) while previously instead of
  failing the function continued with checking for the correct device status
  (which would give the device additional 10 usec to clear BUSY_STAT bit).

* Remove stale comment for ide_config_drive_speed().

* Remove duplicate comment for ide_wait_stat() from <linux/ide.h>.

Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Acked-by: Sergei Shtylyov <sshtylyov@ru.mvista.com>
Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
2007-10-13 17:47:49 +02:00
David Woodhouse
ebf8889bd1 Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux-2.6 2007-10-13 14:58:23 +01:00
David Woodhouse
b160292cc2 Merge Linux 2.6.23 2007-10-13 14:43:54 +01:00
Kevin Hao
c4a9f88daf [MTD] [NOR] fix ctrl-alt-del can't reboot for intel flash bug
When we press ctrl-alt-del,kernel_restart_prepare will invoke 
cfi_intelext_reboot which will set flash to read array mode, but later 
when device_shutdown is invoked which may put current work queue to 
sleep and other process may be scheduled to running and programming 
flash in not FL_READY mode again. So we can't boot up if this flash is 
used for bootloader.

Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: David Woodhouse <dwmw2@infradead.org>
2007-10-13 14:36:18 +01:00
Avi Kivity
8a45450d0a KVM: Replace enum by #define
Easier for existence test (#ifdef) in userspace.

Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-10-13 10:18:29 +02:00
Eddie Dong
96ad2cc613 KVM: in-kernel LAPIC save and restore support
This patch adds a new vcpu-based IOCTL to save and restore the local
apic registers for a single vcpu. The kernel only copies the apic page as
a whole, extraction of registers is left to userspace side. On restore, the
APIC timer is restarted from the initial count, this introduces a little
delay, but works fine.

Signed-off-by: Yaozu (Eddie) Dong <eddie.dong@intel.com>
Signed-off-by: Qing He <qing.he@intel.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-10-13 10:18:25 +02:00
He, Qing
6bf9e962d1 KVM: in-kernel IOAPIC save and restore support
This patch adds support for in-kernel ioapic save and restore (to
and from userspace). It uses the same get/set_irqchip ioctl as
in-kernel PIC.

Signed-off-by: Qing He <qing.he@intel.com>
Signed-off-by: Yaozu (Eddie) Dong <eddie.dong@intel.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-10-13 10:18:25 +02:00
He, Qing
6ceb9d791e KVM: Add get/set irqchip ioctls for in-kernel PIC live migration support
This patch adds two new ioctls to dump and write kernel irqchips for
save/restore and live migration. PIC s/r and l/m is implemented in this
patch.

Signed-off-by: Yaozu (Eddie) Dong <eddie.dong@intel.com>
Signed-off-by: Qing He <qing.he@intel.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-10-13 10:18:25 +02:00
Eddie Dong
b6958ce44a KVM: Emulate hlt in the kernel
By sleeping in the kernel when hlt is executed, we simplify the in-kernel
guest interrupt path considerably.

Signed-off-by: Gregory Haskins <ghaskins@novell.com>
Signed-off-by: Yaozu (Eddie) Dong <eddie.dong@intel.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-10-13 10:18:25 +02:00
Eddie Dong
97222cc831 KVM: Emulate local APIC in kernel
Because lightweight exits (exits which don't involve userspace) are many
times faster than heavyweight exits, it makes sense to emulate high usage
devices in the kernel.  The local APIC is one such device, especially for
Windows and for SMP, so we add an APIC model to kvm.

It also allows in-kernel host-side drivers to inject interrupts without
going through userspace.

[compile fix on i386 from Jindrich Makovicka]

Signed-off-by: Yaozu (Eddie) Dong <Eddie.Dong@intel.com>
Signed-off-by: Qing He <qing.he@intel.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-10-13 10:18:25 +02:00
Eddie Dong
85f455f7dd KVM: Add support for in-kernel PIC emulation
Signed-off-by: Yaozu (Eddie) Dong <eddie.dong@intel.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-10-13 10:18:24 +02:00
Yang, Sheng
253abdee5e KVM: Communicate cr8 changes to userspace
This allows running 64-bit Windows.

Signed-off-by: Sheng Yang <sheng.yang@intel.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-10-13 10:18:23 +02:00
Jeff Dike
519ef35341 KVM: add hypercall nr to kvm_run
Add the hypercall number to kvm_run and initialize it.  This changes the ABI,
but as this particular ABI was unusable before this no users are affected.

Signed-off-by: Jeff Dike <jdike@linux.intel.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-10-13 10:18:20 +02:00
Rusty Russell
9eb829ced8 KVM: Trivial: Use standard BITMAP macros, open-code userspace-exposed header
Creating one's own BITMAP macro seems suboptimal: if we use manual
arithmetic in the one place exposed to userspace, we can use standard
macros elsewhere.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-10-13 10:18:18 +02:00
Rusty Russell
dea8caee7b KVM: Trivial: /dev/kvm interface is no longer experimental.
KVM interface is no longer experimental.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-10-13 10:18:17 +02:00
Avi Kivity
24cbc7e9cb KVM: Future-proof the exit information union ABI
Note that as the size of struct kvm_run is not part of the ABI, we can add
things at the end.

Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-10-13 10:18:17 +02:00
Dmitry Torokhov
b981d8b3f5 Merge master.kernel.org:/pub/scm/linux/kernel/git/torvalds/linux-2.6
Conflicts:

	drivers/macintosh/adbhid.c
2007-10-12 21:27:47 -04:00
Linus Torvalds
ab9c232286 Merge branch 'upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/jgarzik/libata-dev
* 'upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/jgarzik/libata-dev: (119 commits)
  [libata] struct pci_dev related cleanups
  libata: use ata_exec_internal() for PMP register access
  libata: implement ATA_PFLAG_RESETTING
  libata: add @timeout to ata_exec_internal[_sg]()
  ahci: fix notification handling
  ahci: clean up PORT_IRQ_BAD_PMP enabling
  ahci: kill leftover from enabling NCQ over PMP
  libata: wrap schedule_timeout_uninterruptible() in loop
  libata: skip suppress reporting if ATA_EHI_QUIET
  libata: clear ehi description after initial host report
  pata_jmicron: match vendor and class code only
  libata: add ST9160821AS / 3.ALD to NCQ blacklist
  pata_acpi: ACPI driver support
  libata-core: Expose gtm methods for driver use
  libata: add HDT722516DLA380 to NCQ blacklist
  libata: blacklist NCQ on Seagate Barracuda ST380817AS
  [libata] Turn on ACPI by default
  libata_scsi: Fix ATAPI transfer lengths
  libata: correct handling of SRST reset sequences
  libata: Integrate ACPI-based PATA/SATA hotplug - version 5
  ...
2007-10-12 16:16:41 -07:00
Linus Torvalds
6a84258e5f Merge master.kernel.org:/pub/scm/linux/kernel/git/gregkh/pci-2.6
* master.kernel.org:/pub/scm/linux/kernel/git/gregkh/pci-2.6: (37 commits)
  PCI: merge almost all of pci_32.h and pci_64.h together
  PCI: X86: Introduce and enable PCI domain support
  PCI: Add 'nodomains' boot option, and pci_domains_supported global
  PCI: modify PCI bridge control ISA flag for clarity
  PCI: use _CRS for PCI resource allocation
  PCI: avoid P2P prefetch window for expansion ROMs
  PCI: skip ISA ioresource alignment on some systems
  PCI: remove transparent bridge sizing
  pci: write file size to inode on proc bus file write
  pci: use size stored in proc_dir_entry for proc bus files
  pci: implement "pci=noaer"
  PCI: fix IDE legacy mode resources
  MSI: Use correct data offset for 32-bit MSI in read_msi_msg()
  PCI: Fix incorrect argument order to list_add_tail() in PCI dynamic ID code
  PCI: i386: Compaq EVO N800c needs PCI bus renumbering
  PCI: Remove no longer correct documentation regarding MSI vector assignment
  PCI: re-enable onboard sound on "MSI K8T Neo2-FIR"
  PCI: quirk_vt82c586_acpi: Omit reading PCI revision ID
  PCI: quirk amd_8131_mmrbc: Omit reading pci revision ID
  cpqphp: Use PCI_CLASS_REVISION instead of PCI_REVISION_ID for read
  ...
2007-10-12 15:50:23 -07:00
Linus Torvalds
efefc6eb38 Merge master.kernel.org:/pub/scm/linux/kernel/git/gregkh/driver-2.6
* master.kernel.org:/pub/scm/linux/kernel/git/gregkh/driver-2.6: (75 commits)
  PM: merge device power-management source files
  sysfs: add copyrights
  kobject: update the copyrights
  kset: add some kerneldoc to help describe what these strange things are
  Driver core: rename ktype_edd and ktype_efivar
  Driver core: rename ktype_driver
  Driver core: rename ktype_device
  Driver core: rename ktype_class
  driver core: remove subsystem_init()
  sysfs: move sysfs file poll implementation to sysfs_open_dirent
  sysfs: implement sysfs_open_dirent
  sysfs: move sysfs_dirent->s_children into sysfs_dirent->s_dir
  sysfs: make sysfs_root a regular directory dirent
  sysfs: open code sysfs_attach_dentry()
  sysfs: make s_elem an anonymous union
  sysfs: make bin attr open get active reference of parent too
  sysfs: kill unnecessary NULL pointer check in sysfs_release()
  sysfs: kill unnecessary sysfs_get() in open paths
  sysfs: reposition sysfs_dirent->s_mode.
  sysfs: kill sysfs_update_file()
  ...
2007-10-12 15:49:37 -07:00
Linus Torvalds
117494a1b6 Merge master.kernel.org:/pub/scm/linux/kernel/git/gregkh/usb-2.6
* master.kernel.org:/pub/scm/linux/kernel/git/gregkh/usb-2.6: (142 commits)
  USB: fix race in autosuspend reschedule
  atmel_usba_udc: Keep track of the device status
  USB: Nikon D40X unusual_devs entry
  USB: serial core should respect driver requirements
  USB: documentation for USB power management
  USB: skip autosuspended devices during system resume
  USB: mutual exclusion for EHCI init and port resets
  USB: allow usbstorage to have LUNS greater than 2Tb
  USB: Adding support for SHARP WS011SH to ipaq.c
  USB: add atmel_usba_udc driver
  USB: ohci SSB bus glue
  USB: ehci build fixes on au1xxx, ppc-soc
  USB: add runtime frame_no quirk for big-endian OHCI
  USB: funsoft: Fix termios
  USB: visor: termios bits
  USB: unusual_devs entry for Nikon DSC D2Xs
  USB: re-remove <linux/usb_sl811.h>
  USB: move <linux/usb_gadget.h> to <linux/usb/gadget.h>
  USB: Export URB statistics for powertop
  USB: serial gadget: Disable endpoints on unload
  ...
2007-10-12 15:49:10 -07:00
Linus Torvalds
4d5709a7b7 Merge master.kernel.org:/pub/scm/linux/kernel/git/davej/cpufreq
* master.kernel.org:/pub/scm/linux/kernel/git/davej/cpufreq:
  [CPUFREQ] Don't take semaphore in cpufreq_quick_get()
  [CPUFREQ] Support different families in fid/did to frequency conversion
  [CPUFREQ] cpufreq_stats: misc cpuinit section annotations
  [CPUFREQ] implement !CONFIG_CPU_FREQ stub for  cpufreq_unregister_notifier()
  [CPUFREQ] mark hotplug notifier callback as __cpuinit
  [CPUFREQ] Only check for transition latency on problematic governors (kconfig fix)
  [CPUFREQ] allow ondemand and conservative cpufreq governors to be used as default
  [CPUFREQ] move policy's governor initialisation out of low-level drivers into cpufreq core
  [CPUFREQ] Longhaul - Add support for PM133 northbridge
  [CPUFREQ] x86: use num_online_nodes to get physical cpus numbers for
2007-10-12 15:42:01 -07:00
Linus Torvalds
57c5b9998e Merge git://git.kernel.org/pub/scm/linux/kernel/git/tglx/linux-2.6-x86
* git://git.kernel.org/pub/scm/linux/kernel/git/tglx/linux-2.6-x86: (40 commits)
  x86: HPET add another ICH7 PCI id
  x86: HPET force enable ICH5 suspend/resume fix
  x86: HPET force enable for ICH5
  x86: HPET try to activate force detected hpet
  x86: HPET force enable o ICH7 and later
  x86: HPET restructure hpet code for hpet force enable
  clock events: allow replacement of broadcast timer
  i386/x8664: cleanup the shared hpet code
  i386: Remove the useless #ifdef in i8253.h
  ACPI: remove the now unused ifdef code
  jiffies: remove unused macros
  x86_64: cleanup apic.c after clock events switch
  x86_64: remove now unused code
  x86: unify timex.h variants
  x86: kill 8253pit.h
  x86: disable apic timer for AMD C1E enabled CPUs
  x86: Fix irq0 / local apic timer accounting
  x86_64: convert to clock events
  x86_64: Add (not yet used) clock event functions
  x86_64: prepare idle loop for dynamic ticks
  ...
2007-10-12 15:39:39 -07:00
Jeff Garzik
32a2eea795 PCI: Add 'nodomains' boot option, and pci_domains_supported global
* Introduce pci_domains_supported global, hardcoded to zero if
  !CONFIG_PCI_DOMAINS.

* Introduce 'nodomains' boot option, which clears pci_domains_supported
  on platforms that enable it by default (x86, x86-64, and others when
  they are converted to use this).

Signed-off-by: Jeff Garzik <jgarzik@redhat.com>
Cc: Andi Kleen <ak@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2007-10-12 15:03:18 -07:00
Gary Hade
11949255d9 PCI: modify PCI bridge control ISA flag for clarity
Modify PCI Bridge Control ISA flag for clarity

This patch changes PCI_BRIDGE_CTL_NO_ISA to PCI_BRIDGE_CTL_ISA
and modifies it's clarifying comment and locations where used.
The change reduces the chance of future confusion since it makes
the set/unset meaning of the bit the same in both the bridge
control register and bridge_ctl field of the pci_bus struct.

Signed-off-by: Gary Hade <garyhade@us.ibm.com>
Acked-by: Linas Vepstas <linas@austin.ibm.com>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2007-10-12 15:03:18 -07:00
Alex Chiang
9f672153ba PCI: Add missing PCI capability IDs
These IDs are in pciutils, but haven't been added to the kernel
yet.

Signed-off-by: Alex Chiang <achiang@hp.com>
Signed-off-by: Matthew Wilcox <matthew@wil.cx>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2007-10-12 15:03:16 -07:00
Thomas Backlund
b205f6b267 i386: add support for picopower irq router
Add support for PicoPower PT86C523 IRQ router to be used with the in-kernel
yenta driver for CardBus.  With this patch cardbus works on e.g.  Dell
Latitude XPi P150CD.

Initial patch for kernel 2.4 series by Sune Mølgaard
http://molgaard.org/code/linux-2.4.31-picopower.patch

Ported to 2.6.20 by Chmouel Boudjnah (http://www.chmouel.com)

Testing and confirmation that it works by Austin Acton

Cleaned up a little for inclusion in a 2.6.21-rc7 based kernel.

Added some more cleanups according to CodingStyle, as noted by
Randy Dunlap on LKML.

[akpm@linux-foundation.org: build fixes]
Signed-off-by: Thomas Backlund <tmb@mandriva.org>
Cc: Andi Kleen <ak@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2007-10-12 15:03:15 -07:00
Alan Stern
063a2da8f0 USB: serial core should respect driver requirements
This patch (as997) fixes a bug in the USB serial core.  The core needs
to pay attention to drivers' requirements regarding the number and
type of endpoints a device has.

At the same time, the patch changes the NUM_DONT_CARE constant (which
is stored in a single-byte field) from -1 to a safer, unsigned value.
It also improves the kerneldoc for several fields in the
usb_serial_driver structure.

Finally, the patch replaces a list_for_each() with list_for_each_entry().

Signed-off-by: Alan Stern <stern@rowland.harvard.edu>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2007-10-12 14:55:34 -07:00
Alan Stern
271f9e68f3 USB: skip autosuspended devices during system resume
System suspends and hibernation are supposed to be as transparent as
possible.  By this reasoning, if a USB device is already autosuspended
before the system sleep begins then it should remain autosuspended
after the system wakes up.

This patch (as1001) adds a skip_sys_resume flag to the usb_device
structure and uses it to avoid waking up devices which were suspended
when a system sleep began.

Signed-off-by: Alan Stern <stern@rowland.harvard.edu>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2007-10-12 14:55:34 -07:00
David Brownell
27f5d75afa USB: re-remove <linux/usb_sl811.h>
Remove <linux/usb_sl811.h> ... somehow this was recreated when
the Blackfin arch was merged, instead of using <linux/usb/sl811.h>
which is the correct header.

Signed-off-by: David Brownell <dbrownell@users.sourceforge.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2007-10-12 14:55:31 -07:00
David Brownell
9454a57ab5 USB: move <linux/usb_gadget.h> to <linux/usb/gadget.h>
Move <linux/usb_gadget.h> to <linux/usb/gadget.h>, reducing
some of the clutter in the main include directory.

Signed-off-by: David Brownell <dbrownell@users.sourceforge.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2007-10-12 14:55:31 -07:00
Sarah Sharp
4d59d8a113 USB: Export URB statistics for powertop
powertop currently tracks interrupts generated by uhci, ehci, and ohci,
but it has no way of telling which USB device to blame USB bus activity on.
This patch exports the number of URBs that are submitted for a given device.
Cat the file 'urbnum' in /sys/bus/usb/devices/.../

Signed-off-by: Sarah Sharp <sarah.a.sharp@intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2007-10-12 14:55:30 -07:00
Alan Stern
a691efa988 USB: remove USB_QUIRK_NO_AUTOSUSPEND
This patch (as995) cleans up the remains of the former NO_AUTOSUSPEND
quirk.  Since autosuspend is disabled by default, we will let
userspace worry about which devices can safely be suspended.  Thus the
lengthy series of quirk entries is no longer needed, and neither is
the quirk ID.  I suppose someone might eventually run across a hub
that can't be suspended; let's ignore the possibility for now.

The patch also cleans up the hasty way in which autosuspend gets
disabled.  Setting udev->autosuspend_delay to -1 wasn't quite right,
because the value is always supposed to be a multiple of HZ.  It's
better to leave the delay value alone and set autosuspend_disabled,
which is what the quirk routine used to do.

Signed-off-by: Alan Stern <stern@rowland.harvard.edu>
Cc: stable <stable@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2007-10-12 14:55:28 -07:00
Alan Stern
6840d2555a USB: flush outstanding URBs when suspending
This patch (as989) makes usbcore flush all outstanding URBs for each
device as the device is suspended.  This will be true even when
CONFIG_USB_SUSPEND is not enabled.

In addition, an extra can_submit flag is added to the usb_device
structure.  That flag will be turned off whenever a suspend request
has been received for the device, even if the device isn't actually
suspended because CONFIG_USB_SUSPEND isn't set.

It's no longer necessary to check for the device state being equal to
USB_STATE_SUSPENDED during URB submission; that check can be replaced
by a check of the can_submit flag.  This also permits us to remove
some questionable references to the deprecated power.power_state field.

Signed-off-by: Alan Stern <stern@rowland.harvard.edu>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2007-10-12 14:55:26 -07:00
Alan Stern
1431d2a44c USB: get rid of urb->lock
Now that urb->status isn't used, urb->lock doesn't protect anything.
This patch (as980) removes it and replaces it with a private mutex in
the one remaining place it was still used: usb_kill_urb.

Signed-off-by: Alan Stern <stern@rowland.harvard.edu>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2007-10-12 14:55:23 -07:00
Alan Stern
eb23105462 USB: add urb->unlinked field
This patch (as970) adds a new urb->unlinked field, which is used to
store the status of unlinked URBs since we can't use urb->status for
that purpose any more.  To help simplify the HCDs, usbcore will check
urb->unlinked before calling the completion handler; if the value is
set it will automatically override the status reported by the HCD.

Signed-off-by: Alan Stern <stern@rowland.harvard.edu>
CC: David Brownell <david-b@pacbell.net>
CC: Olav Kongas <ok@artecdesign.ee>
CC: Yoshihiro Shimoda <shimoda.yoshihiro@renesas.com>
CC: Tony Olech <tony.olech@elandigitalsystems.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2007-10-12 14:55:19 -07:00
Inaky Perez-Gonzalez
da04b7a427 usb: introduce usb_device authorization bits
This just modifies 'struct usb_device' to contain the 'authorized'
bit. It also adds a 'wusb' bit. This is needed because nonauthorized
(and thus non-authenticated) wusb devices will fail certain kind of
simple requests (such as string descriptors). By knowing the device is
WUSB, we just avoid them.

Signed-off-by: Inaky Perez-Gonzalez <inaky@linux.intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2007-10-12 14:55:04 -07:00
David Brownell
a4e3ef5597 USB: gadget: gadget_is_{dualspeed,otg} predicates and cleanup
This adds two small inlines to the gadget stack, which will
often evaluate to compile-time constants.  That can help
shrink object code and remove #ifdeffery.

 - gadget_is_dualspeed(), currently always a compile-time
   constant (depending on which controller is selected).

 - gadget_is_otg(), usually a compile time "false", but this
   is a runtime test if the platform enables OTG (since it's
   reasonable to populate boards with different USB sockets).

It also updates two peripheral controller drivers to use these:

 - fsl_usb2_udc, mostly OTG-related bugfixes:  non-OTG devices
   must follow the rules about drawing VBUS power, and OTG ones
   need to reject invalid SET_FEATURE requests.

 - omap_udc, just scrubbing a bit of #ifdeffery.

And also gadgetfs, which lost some #ifdefs and moved to a more
standard handling of DEBUG and VERBOSE_DEBUG.

The main benefits come from patches which will follow.

Signed-off-by: David Brownell <dbrownell@users.sourceforge.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2007-10-12 14:55:03 -07:00
Alan Stern
d617bc83ff USB: cleanup for previous patches
This patch (as951) cleans up a few loose ends from earlier patches.
Redundant checks for non-NULL urb->dev are removed, as are checks of
urb->dev->bus (which can never be NULL).  Conversely, a check for
non-NULL urb->ep is added to the unlink paths.

A homegrown round-down-to-power-of-2 loop is simplified by using the
ilog2 routine.  The comparison in usb_urb_dir_in() is made more
transparent.

Signed-off-by: Alan Stern <stern@rowland.harvard.edu>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2007-10-12 14:55:01 -07:00
Alan Stern
5e60a16139 USB: avoid using urb->pipe in usbcore
This patch (as946) eliminates many of the uses of urb->pipe in
usbcore.  Unfortunately there will have to be a significant API
change, affecting all USB drivers, before we can remove it entirely.
This patch contents itself with changing only the interface to
usb_buffer_map_sg() and friends: The pipe argument is replaced with a
direction flag.  That can be done easily because those routines get
used in only one place.

Signed-off-by: Alan Stern <stern@rowland.harvard.edu>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2007-10-12 14:55:00 -07:00
Alan Stern
fea3409112 USB: add direction bit to urb->transfer_flags
This patch (as945) adds a bit to urb->transfer_flags for recording the
direction of the URB.  The bit is set/cleared automatically in
usb_submit_urb() so drivers don't have to worry about it (although as
a result, it isn't valid until the URB has been submitted).  Inline
routines are added for easily checking an URB's direction.  They
replace calls to usb_pipein in the DMA-mapping parts of hcd.c.

For non-control endpoints, the direction is determined directly from
the endpoint descriptor.  However control endpoints are
bi-directional; for them the direction is determined from the
bRequestType byte and the wLength value in the setup packet.

Signed-off-by: Alan Stern <stern@rowland.harvard.edu>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2007-10-12 14:55:00 -07:00
Alan Stern
bdd016ba64 USB: add ep->enable
This patch (as944) adds an explicit "enabled" field to the
usb_host_endpoint structure and uses it in place of the current
mechanism.  This is merely a time-space tradeoff; it makes checking
whether URBs may be submitted to an endpoint simpler.  The existing
mechanism is efficient when converting urb->pipe to an endpoint
pointer, but it's not so efficient when urb->ep is used instead.

As a side effect, the procedure for enabling an endpoint is now a
little more complicated.  The ad-hoc inline code in usb.c and hub.c
for enabling ep0 is now replaced with calls to usb_enable_endpoint,
which is no longer static.

Signed-off-by: Alan Stern <stern@rowland.harvard.edu>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2007-10-12 14:55:00 -07:00
Alan Stern
5b653c79c0 USB: add urb->ep
This patch (as943) prepares the way for eliminating urb->pipe by
introducing an endpoint pointer into struct urb.  For now urb->ep
is set by usb_submit_urb() from the pipe value; eventually drivers
will set it themselves and we will remove urb->pipe completely.

The patch also adds new inline routines to retrieve an endpoint
descriptor's number and transfer type, essentially as replacements for
usb_pipeendpoint and usb_pipetype.

usb_submit_urb(), usb_hcd_submit_urb(), and usb_hcd_unlink_urb() are
converted to use the new field and new routines.  Other parts of
usbcore will be converted in later patches.

Signed-off-by: Alan Stern <stern@rowland.harvard.edu>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2007-10-12 14:55:00 -07:00
David Brownell
efc9052e01 USB: usb_gadget.h whitespace fixes
This just fixes some whitespace bugs in <linux/usb_gadget.h>,
mostly extraneous spaces where a single tab suffices.

Signed-off-by: David Brownell <dbrownell@users.sourceforge.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2007-10-12 14:54:59 -07:00
Tejun Heo
6d66f5cd26 sysfs: add copyrights
Sysfs has gone through considerable amount of reimplementation.  Add
copyrights.  Any objections?  :-)

Signed-off-by: Tejun Heo <htejun@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2007-10-12 14:51:12 -07:00
Greg Kroah-Hartman
f0e7e1bd77 kobject: update the copyrights
I've been hacking on these files for a while now, might as well make it
official...

Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2007-10-12 14:51:12 -07:00
Greg Kroah-Hartman
6adf7554b9 kset: add some kerneldoc to help describe what these strange things are
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2007-10-12 14:51:12 -07:00
Greg Kroah-Hartman
e4bc16621d driver core: remove subsystem_init()
There is only one user of it, and it is only a wrapper for kset_init().

Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2007-10-12 14:51:11 -07:00
Tejun Heo
a4e8b91254 sysfs: move sysfs file poll implementation to sysfs_open_dirent
Sysfs file poll implementation is scattered over sysfs and kobject.
Event numbering is done in sysfs_dirent but wait itself is done on
kobject.  This not only unecessarily bloats both kobject and
sysfs_dirent but is also buggy - if a sysfs_dirent is removed while
there still are pollers, the associaton betwen the kobject and
sysfs_dirent breaks and kobject may be freed with the pollers still
sleeping on it.

This patch moves whole poll implementation into sysfs_open_dirent.
Each time a sysfs_open_dirent is created, event number restarts from 1
and pollers sleep on sysfs_open_dirent.  As event sequence number is
meaningless without any open file and pollers should have open file
and thus sysfs_open_dirent, this ephemeral event counting works and is
a saner implementation.

This patch fixes the dnagling sleepers bug and reduces the sizes of
kobject and sysfs_dirent by one pointer.

Signed-off-by: Tejun Heo <htejun@gmail.com>
Acked-by: Cornelia Huck <cornelia.huck@de.ibm.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2007-10-12 14:51:11 -07:00