asgardius/android_kernel_motorola_sm6225

Author	SHA1	Message	Date
Rusty Russell	cb38fa23c1	virtio: de-structify virtio_block status byte Ron Minnich points out that a struct containing a char is not always sizeof(char); simplest to remove the structure to avoid confusion. Cc: "ron minnich" <rminnich@gmail.com> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>	2008-05-02 21:50:45 +10:00
Christian Borntraeger	8147313287	virtio: export more headers to userspace Rusty, is there a reason why we dont export the virtio headers for 9p, balloon, console, pci, and virtio_ring? kvm uses make sync, but I think it is still useful to heave these headers exported as they might be useful for other userspace tools. I dont export virtio.h, because it does not seem to have useful information for userspace and it requires scatterlist.h which is also not exported. See also my other mail about your "virtio: change config to guest endian." patch. Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>	2008-05-02 21:50:44 +10:00
Thomas Gleixner	1adb0850a1	genirq: reenable a nobody cared disabled irq when a new driver arrives Uwe Kleine-Koenig has some strange hardware where one of the shared interrupts can be asserted during boot before the appropriate driver loads. Requesting the shared irq line from another driver result in a spurious interrupt storm which finally disables the interrupt line. I have seen similar behaviour on resume before (the hardware does not work anymore so I can not verify). Change the spurious disable logic to increment the disable depth and mark the interrupt with an extra flag which allows us to reenable the interrupt when a new driver arrives which requests the same irq line. In the worst case this will disable the irq again via the spurious trap, but there is a decent chance that the new driver is the one which can handle the already asserted interrupt and makes the box usable again. Eric Biederman said further: This case also happens on a regular basis in kdump kernels where we deliberately don't shutdown the hardware before starting the new kernel. This patch should reduce the need for using irqpoll in that situation by a small amount. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-and-Acked-by: Uwe Kleine-König <Uwe.Kleine-Koenig@digi.com>	2008-05-02 13:40:34 +02:00
Randy Dunlap	9941d945f4	[RAPIDIO] fix current kernel-doc notation Fix current (-git16) missing docbook/kernel-doc notation in RapidIO files. Warning(linux-2.6.25-git16//include/linux/rio.h:187): No description found for parameter 'sys_size' Warning(linux-2.6.25-git16//include/linux/rio.h:187): No description found for parameter 'phy_type' Warning(linux-2.6.25-git16//arch/powerpc/sysdev/fsl_rio.c:188): No description found for parameter 'mport' Warning(linux-2.6.25-git16//arch/powerpc/sysdev/fsl_rio.c:224): No description found for parameter 'mport' Warning(linux-2.6.25-git16//arch/powerpc/sysdev/fsl_rio.c:245): No description found for parameter 'mport' Warning(linux-2.6.25-git16//arch/powerpc/sysdev/fsl_rio.c:270): No description found for parameter 'mport' Warning(linux-2.6.25-git16//arch/powerpc/sysdev/fsl_rio.c:311): No description found for parameter 'mport' Warning(linux-2.6.25-git16//arch/powerpc/sysdev/fsl_rio.c:996): No description found for parameter 'dev' Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com> Signed-off-by: Kumar Gala <galak@kernel.crashing.org>	2008-05-01 23:01:54 -05:00
Kirill A. Shutemov	2218228392	Make linux/wireless.h be able to compile Signed-off-by: Kirill A. Shutemov <kirill@shutemov.name> Signed-off-by: John W. Linville <linville@tuxdriver.com>	2008-05-01 17:38:35 -04:00
Linus Torvalds	2c4aabcca8	Merge git://git.infradead.org/mtd-2.6 * git://git.infradead.org/mtd-2.6: [MTD][NOR] Add physical address to point() method [JFFS2] Track parent inode for directories (for NFS export) [JFFS2] Invert last argument of jffs2_gc_fetch_inode(), make it boolean. [JFFS2] Quiet lockdep false positive. [JFFS2] Clean up jffs2_alloc_inode() and jffs2_i_init_once() [MTD] Delete long-unused jedec.h header file. [MTD] [NAND] at91_nand: use at91_nand_{en,dis}able consistently.	2008-05-01 11:15:28 -07:00
Jared Hulbert	a98889f3d8	[MTD][NOR] Add physical address to point() method Adding the ability to get a physical address from point() in addition to virtual address. This physical address is required for XIP of userspace code from flash. Signed-off-by: Jared Hulbert <jaredeh@gmail.com> Reviewed-by: Jörn Engel <joern@logfs.org> Acked-by: Nicolas Pitre <nico@cam.org> Acked-by: Greg Ungerer <gerg@uclinux.org> Signed-off-by: David Woodhouse <dwmw2@infradead.org>	2008-05-01 18:59:11 +01:00
Al Viro	2030a42cec	[PATCH] sanitize anon_inode_getfd() a) none of the callers even looks at inode or file returned by anon_inode_getfd() b) any caller that would try to look at those would be racy, since by the time it returns we might have raced with close() from another thread and that file would be pining for fjords. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2008-05-01 13:08:50 -04:00
Al Viro	9f3acc3140	[PATCH] split linux/file.h Initial splitoff of the low-level stuff; taken to fdtable.h Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2008-05-01 13:08:16 -04:00
Al Viro	a2dcb44c3c	[PATCH] make osf_select() use core_sys_select() ... instead of open-coding it Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2008-05-01 13:07:28 -04:00
Linus Torvalds	03fc922f40	Merge git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux-2.6-for-linus * git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux-2.6-for-linus: module: add MODULE_STATE_GOING notifier call module: Enhance verify_export_symbols module: set unused_gpl_crcs instead of overwriting unused_crcs module: neaten __find_symbol, rename to find_symbol module: reduce module image and resident size module: make module_sect_attrs private to kernel/module.c	2008-05-01 08:26:56 -07:00
Jan Kara	c32e026efc	quota: add a convenience macro for filesystems Note that it cannot be an inline function because we don't have struct super_block prototype... Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-05-01 08:04:01 -07:00
Scott Kilau	99da9047e6	jsm: add new supported board to jsm serial driver Add new PCI Express Neo/JSM board to the supported list of drivers in the JSM driver. Signed-off-by: Scott Kilau <scottk@digi.com> Acked-by: Ananda V <avenkat@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-05-01 08:04:01 -07:00
Randy Dunlap	2850699c59	sysfs: sysfs_update_group stub for CONFIG_SYSFS=n scsi_transport_spi uses sysfs_update_group() when CONFIG_SYSFS=n, so provide a stub for it. next-20080423/drivers/scsi/scsi_transport_spi.c:1467: error: implicit declaration of function 'sysfs_update_group' make[3]: *** [drivers/scsi/scsi_transport_spi.o] Error 1 Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com> Cc: Greg KH <greg@kroah.com> Cc: James Bottomley <James.Bottomley@HansenPartnership.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-05-01 08:03:59 -07:00
David Brownell	34990cf702	Add a new sysfs_streq() string comparison function Add a new sysfs_streq() string comparison function, which ignores the trailing newlines found in sysfs inputs. By example: sysfs_streq("a", "b") ==> false sysfs_streq("a", "a") ==> true sysfs_streq("a", "a\n") ==> true sysfs_streq("a\n", "a") ==> true This is intended to simplify parsing of sysfs inputs, letting them avoid the need to manually strip off newlines from inputs. Signed-off-by: David Brownell <dbrownell@users.sourceforge.net> Acked-by: Greg KH <greg@kroah.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-05-01 08:03:59 -07:00
Roman Zippel	7dffa3c673	ntp: handle leap second via timer Remove the leap second handling from second_overflow(), which doesn't have to check for it every second anymore. With CONFIG_NO_HZ this also makes sure the leap second is handled close to the full second. Additionally this makes it possible to abort a leap second properly by resetting the STA_INS/STA_DEL status bits. Signed-off-by: Roman Zippel <zippel@linux-m68k.org> Cc: john stultz <johnstul@us.ibm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-05-01 08:03:59 -07:00
Roman Zippel	8383c42399	ntp: remove current_tick_length() current_tick_length used to do a little more, but now it just returns tick_length, which we can also access directly at the few places, where it's needed. Signed-off-by: Roman Zippel <zippel@linux-m68k.org> Cc: john stultz <johnstul@us.ibm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-05-01 08:03:59 -07:00
Roman Zippel	7fc5c78409	ntp: rename TICK_LENGTH_SHIFT to NTP_SCALE_SHIFT As TICK_LENGTH_SHIFT is used for more than just the tick length, the name isn't quite approriate anymore, so this renames it to NTP_SCALE_SHIFT. Signed-off-by: Roman Zippel <zippel@linux-m68k.org> Cc: john stultz <johnstul@us.ibm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-05-01 08:03:59 -07:00
Roman Zippel	153b5d054a	ntp: support for TAI This adds support for setting the TAI value (International Atomic Time). The value is reported back to userspace via timex (as we don't have a ntp_gettime() syscall). Signed-off-by: Roman Zippel <zippel@linux-m68k.org> Cc: john stultz <johnstul@us.ibm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-05-01 08:03:59 -07:00
Roman Zippel	9f14f669d1	ntp: increase time_offset resolution time_offset is already a 64bit value but its resolution barely used, so this makes better use of it by replacing SHIFT_UPDATE with TICK_LENGTH_SHIFT. Side note: the SHIFT_HZ in SHIFT_UPDATE was incorrect for CONFIG_NO_HZ and the primary reason for changing time_offset to 64bit to avoid the overflow. Signed-off-by: Roman Zippel <zippel@linux-m68k.org> Cc: john stultz <johnstul@us.ibm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-05-01 08:03:58 -07:00
Roman Zippel	074b3b8794	ntp: increase time_freq resolution This changes time_freq to a 64bit value and makes it static (the only outside user had no real need to modify it). Intermediate values were already 64bit, so the change isn't that big, but it saves a little in shifts by replacing SHIFT_NSEC with TICK_LENGTH_SHIFT. PPM_SCALE is then used to convert between user space and kernel space representation. Signed-off-by: Roman Zippel <zippel@linux-m68k.org> Cc: john stultz <johnstul@us.ibm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-05-01 08:03:58 -07:00
Roman Zippel	eea83d896e	ntp: NTP4 user space bits update This adds a few more things from the ntp nanokernel related to user space. It's now possible to select the resolution used of some values via STA_NANO and the kernel reports in which mode it works (pll/fll). If some values for adjtimex() are outside the acceptable range, they are now simply normalized instead of letting the syscall fail. I removed MOD_CLKA/MOD_CLKB as the mapping didn't really makes any sense, the kernel doesn't support setting the clock. Signed-off-by: Roman Zippel <zippel@linux-m68k.org> Cc: john stultz <johnstul@us.ibm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-05-01 08:03:58 -07:00
Roman Zippel	f8bd2258e2	remove div_long_long_rem x86 is the only arch right now, which provides an optimized for div_long_long_rem and it has the downside that one has to be very careful that the divide doesn't overflow. The API is a little akward, as the arguments for the unsigned divide are signed. The signed version also doesn't handle a negative divisor and produces worse code on 64bit archs. There is little incentive to keep this API alive, so this converts the few users to the new API. Signed-off-by: Roman Zippel <zippel@linux-m68k.org> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Ingo Molnar <mingo@elte.hu> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: john stultz <johnstul@us.ibm.com> Cc: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-05-01 08:03:58 -07:00
Roman Zippel	6f6d6a1a6a	rename div64_64 to div64_u64 Rename div64_64 to div64_u64 to make it consistent with the other divide functions, so it clearly includes the type of the divide. Move its definition to math64.h as currently no architecture overrides the generic implementation. They can still override it of course, but the duplicated declarations are avoided. Signed-off-by: Roman Zippel <zippel@linux-m68k.org> Cc: Avi Kivity <avi@qumranet.com> Cc: Russell King <rmk@arm.linux.org.uk> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: David Howells <dhowells@redhat.com> Cc: Jeff Dike <jdike@addtoit.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: "David S. Miller" <davem@davemloft.net> Cc: Patrick McHardy <kaber@trash.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-05-01 08:03:58 -07:00
Roman Zippel	2418f4f28f	introduce explicit signed/unsigned 64bit divide The current do_div doesn't explicitly say that it's unsigned and the signed counterpart is missing, which is e.g. needed when dealing with time values. This introduces 64bit signed/unsigned divide functions which also attempts to cleanup the somewhat awkward calling API, which often requires the use of temporary variables for the dividend. To avoid the need for temporary variables everywhere for the remainder, each divide variant also provides a version which doesn't return the remainder. Each architecture can now provide optimized versions of these function, otherwise generic fallback implementations will be used. As an example I provided an alternative for the current x86 divide, which avoids the asm casts and using an union allows gcc to generate better code. It also avoids the upper divde in a few more cases, where the result is known (i.e. upper quotient is zero). Signed-off-by: Roman Zippel <zippel@linux-m68k.org> Cc: john stultz <johnstul@us.ibm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-05-01 08:03:58 -07:00
Rusty Russell	ea01e798e2	module: reduce module image and resident size Resulting reduction (x86-64, gcc 4.1.2) with my (special purpose, i.e. much reduced) configurations: - 16k kernel resident size - 180k module resident size - 10k module image size Signed-off-by: Jan Beulich <jbeulich@novell.com> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>	2008-05-01 21:14:59 +10:00
Rusty Russell	a58730c421	module: make module_sect_attrs private to kernel/module.c No-one else is using these afaics. Signed-off-by: Jan Beulich <jbeulich@novell.com> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>	2008-05-01 21:14:59 +10:00
David S. Miller	c2a3b23345	Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/linville/wireless-2.6	2008-05-01 02:06:32 -07:00
Luis Carlos Cobo	51ceddade0	mac80211: use 4-byte mesh sequence number This follows the new 802.11s/D2.0 draft. Signed-off-by: Luis Carlos Cobo <luisca@cozybit.com> Signed-off-by: John W. Linville <linville@tuxdriver.com>	2008-04-30 20:34:26 -04:00
Greg Kroah-Hartman	c3bb7fadaf	klist: fix coding style errors in klist.h and klist.c Finally clean up the odd spacing in these files. Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>	2008-04-30 16:52:58 -07:00
Kay Sievers	c3b19ff06e	driver core: remove no longer used "struct class_device" Signed-off-by: Kay Sievers <kay.sievers@vrfy.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>	2008-04-30 16:52:49 -07:00
Kumar Gala	4f452e8aa4	devres: support addresses greater than an unsigned long via dev_ioremap Use a resource_size_t instead of unsigned long since some arch's are capable of having ioremap deal with addresses greater than the size of a unsigned long. Signed-off-by: Kumar Gala <galak@kernel.crashing.org> Cc: Tejun Heo <htejun@gmail.com> Cc: Jeff Garzik <jgarzik@pobox.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>	2008-04-30 16:52:48 -07:00
Randy Dunlap	1cbfb7a5ac	sysfs: sysfs_update_group stub for CONFIG_SYSFS=n scsi_transport_spi uses sysfs_update_group() when CONFIG_SYSFS=n, so provide a stub for it. next-20080423/drivers/scsi/scsi_transport_spi.c:1467: error: implicit declaration of function 'sysfs_update_group' make[3]: *** [drivers/scsi/scsi_transport_spi.o] Error 1 Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com> Cc: James Bottomley <James.Bottomley@HansenPartnership.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>	2008-04-30 16:52:47 -07:00
Tejun Heo	93dd40013f	klist: implement klist_add_{after\|before}() Add klist_add_after() and klist_add_before() which puts a new node after and before an existing node, respectively. This is useful for callers which need to keep klist ordered. Note that synchronizing between simultaneous additions for ordering is the caller's responsibility. Signed-off-by: Tejun Heo <htejun@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>	2008-04-30 16:52:47 -07:00
Tejun Heo	1da43e4a9e	klist: implement KLIST_INIT() and DEFINE_KLIST() klist is missing static initializers and definition helper. Add them. Signed-off-by: Tejun Heo <htejun@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>	2008-04-30 16:52:47 -07:00
Linus Torvalds	08acd4f8af	Merge branch 'release' of git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux-acpi-2.6 * 'release' of git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux-acpi-2.6: (179 commits) ACPI: Fix acpi_processor_idle and idle= boot parameters interaction acpi: fix section mismatch warning in pnpacpi intel_menlo: fix build warning ACPI: Cleanup: Remove unneeded, multiple local dummy variables ACPI: video - fix permissions on some proc entries ACPI: video - properly handle errors when registering proc elements ACPI: video - do not store invalid entries in attached_array list ACPI: re-name acpi_pm_ops to acpi_suspend_ops ACER_WMI/ASUS_LAPTOP: fix build bug thinkpad_acpi: fix possible NULL pointer dereference if kstrdup failed ACPI: check a return value correctly in acpi_power_get_context() #if 0 acpi/bay.c:eject_removable_drive() eeepc-laptop: add hwmon fan control eeepc-laptop: add backlight eeepc-laptop: add base driver ACPI: thinkpad-acpi: bump up version to 0.20 ACPI: thinkpad-acpi: fix selects in Kconfig ACPI: thinkpad-acpi: use a private workqueue ACPI: thinkpad-acpi: fluff really minor fix ACPI: thinkpad-acpi: use uppercase for "LED" on user documentation ... Fixed conflicts in drivers/acpi/video.c and drivers/misc/intel_menlow.c manually.	2008-04-30 11:52:52 -07:00
Len Brown	008238b54a	Merge branch 'pnp' into release	2008-04-30 13:59:05 -04:00
Ingo Molnar	ae3a0064e6	inlining: do not allow gcc below version 4 to optimize inlining fix the condition to match intention: always use the old inlining behavior on all gcc versions below 4. this should solve the UML build problem. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-30 08:42:49 -07:00
Robert P. J. Day	969a19f1c4	Drop the exporting of empty <linux/byteorder/generic.h> Fix up the contents of <linux/byteorder/> so that it doesn't export a content-free generic.h to user space. This involves: * Removing the __KERNEL__ tests from generic.h and dropping it from Kbuild. * Wrapping the inclusions of generic.h in both big_endian.h and little_endian.h in __KERNEL__ tests. * Shifting big_endian.h and little_endian.h from header-y to unifdef-y in Kbuild. Signed-off-by: Robert P. J. Day <rpjday@crashcourse.ca> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-30 08:29:54 -07:00
Robert P. J. Day	735643ee6c	Remove "#ifdef __KERNEL__" checks from unexported headers Remove the "#ifdef __KERNEL__" tests from unexported header files in linux/include whose entire contents are wrapped in that preprocessor test. Signed-off-by: Robert P. J. Day <rpjday@crashcourse.ca> Cc: David Woodhouse <dwmw2@infradead.org> Cc: Sam Ravnborg <sam@ravnborg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-30 08:29:54 -07:00
Thomas Gleixner	237fc6e7a3	add hrtimer specific debugobjects code hrtimers have now dynamic users in the network code. Put them under debugobjects surveillance as well. Add calls to the generic object debugging infrastructure and provide fixup functions which allow to keep the system alive when recoverable problems have been detected by the object debugging core code. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Greg KH <greg@kroah.com> Cc: Randy Dunlap <randy.dunlap@oracle.com> Cc: Kay Sievers <kay.sievers@vrfy.org> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-30 08:29:53 -07:00
Thomas Gleixner	c6f3a97f86	debugobjects: add timer specific object debugging code Add calls to the generic object debugging infrastructure and provide fixup functions which allow to keep the system alive when recoverable problems have been detected by the object debugging core code. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Ingo Molnar <mingo@elte.hu> Cc: Greg KH <greg@kroah.com> Cc: Randy Dunlap <randy.dunlap@oracle.com> Cc: Kay Sievers <kay.sievers@vrfy.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-30 08:29:53 -07:00
Thomas Gleixner	3ac7fe5a4a	infrastructure to debug (dynamic) objects We can see an ever repeating problem pattern with objects of any kind in the kernel: 1) freeing of active objects 2) reinitialization of active objects Both problems can be hard to debug because the crash happens at a point where we have no chance to decode the root cause anymore. One problem spot are kernel timers, where the detection of the problem often happens in interrupt context and usually causes the machine to panic. While working on a timer related bug report I had to hack specialized code into the timer subsystem to get a reasonable hint for the root cause. This debug hack was fine for temporary use, but far from a mergeable solution due to the intrusiveness into the timer code. The code further lacked the ability to detect and report the root cause instantly and keep the system operational. Keeping the system operational is important to get hold of the debug information without special debugging aids like serial consoles and special knowledge of the bug reporter. The problems described above are not restricted to timers, but timers tend to expose it usually in a full system crash. Other objects are less explosive, but the symptoms caused by such mistakes can be even harder to debug. Instead of creating specialized debugging code for the timer subsystem a generic infrastructure is created which allows developers to verify their code and provides an easy to enable debug facility for users in case of trouble. The debugobjects core code keeps track of operations on static and dynamic objects by inserting them into a hashed list and sanity checking them on object operations and provides additional checks whenever kernel memory is freed. The tracked object operations are: - initializing an object - adding an object to a subsystem list - deleting an object from a subsystem list Each operation is sanity checked before the operation is executed and the subsystem specific code can provide a fixup function which allows to prevent the damage of the operation. When the sanity check triggers a warning message and a stack trace is printed. The list of operations can be extended if the need arises. For now it's limited to the requirements of the first user (timers). The core code enqueues the objects into hash buckets. The hash index is generated from the address of the object to simplify the lookup for the check on kfree/vfree. Each bucket has it's own spinlock to avoid contention on a global lock. The debug code can be compiled in without being active. The runtime overhead is minimal and could be optimized by asm alternatives. A kernel command line option enables the debugging code. Thanks to Ingo Molnar for review, suggestions and cleanup patches. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@elte.hu> Cc: Greg KH <greg@kroah.com> Cc: Randy Dunlap <randy.dunlap@oracle.com> Cc: Kay Sievers <kay.sievers@vrfy.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-30 08:29:53 -07:00
Thomas Gleixner	30327acf78	slab: add a flag to prevent debug_free checks on a kmem_cache This is a preperatory patch for the debugobjects infrastructure. The flag prevents debug_free checks on kmem_caches. This is necessary to avoid resursive calls into a debug mechanism which uses a kmem_cache itself. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Ingo Molnar <mingo@elte.hu> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-30 08:29:53 -07:00
Harvey Harrison	bdf4bbaaee	Add macros similar to min/max/min_t/max_t Also, change the variable names used in the min/max macros to avoid shadowed variable warnings when min/max min_t/max_t are nested. Small formatting changes to make all the macros have a similar form. [akpm@linux-foundation.org: coding-style fixes] [akpm@linux-foundation.org: fix v4l build] Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com> Cc: Mauro Carvalho Chehab <mchehab@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com> Cc: Jeff Garzik <jeff@garzik.org> Cc: Tejun Heo <htejun@gmail.com> Cc: Michael Buesch <mb@bu3sch.de> Cc: "John W. Linville" <linville@tuxdriver.com> Cc: Miklos Szeredi <miklos@szeredi.hu> Cc: Dmitry Torokhov <dtor@mail.ru> Cc: Jiri Kosina <jkosina@suse.cz> Cc: Arnaldo Carvalho de Melo <acme@ghostprotocols.net> Cc: Randy Dunlap <randy.dunlap@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-30 08:29:53 -07:00
Samuel Thibault	f7511d5f66	Basic braille screen reader support This adds a minimalistic braille screen reader support. This is meant to be used by blind people e.g. on boot failures or when / cannot be mounted etc and thus the userland screen readers can not work. [akpm@linux-foundation.org: fix exports] Signed-off-by: Samuel Thibault <samuel.thibault@ens-lyon.org> Cc: Jiri Kosina <jikos@jikos.cz> Cc: Dmitry Torokhov <dtor@mail.ru> Acked-by: Alan Cox <alan@redhat.com> Cc: Randy Dunlap <randy.dunlap@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-30 08:29:52 -07:00
Christoph Hellwig	86098fa011	reiserfs: use open_bdev_excl Use the proper helper to open a blockdevice by name for filesystem use, this makes sure it's properly claimed (also added for open-by-number) and gets rid of the struct file abuse. Tested by mounting a reiserfs filesystem with external journal. Signed-off-by: Christoph Hellwig <hch@lst.de> Cc: Chris Mason <chris.mason@oracle.com> Cc: Jeff Mahoney <jeffm@suse.com> Acked-by: Edward Shishkin <edward.shishkin@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-30 08:29:51 -07:00
Miklos Szeredi	fc3ba692a4	mm: Add NR_WRITEBACK_TEMP counter Fuse will use temporary buffers to write back dirty data from memory mappings (normal writes are done synchronously). This is needed, because there cannot be any guarantee about the time in which a write will complete. By using temporary buffers, from the MM's point if view the page is written back immediately. If the writeout was due to memory pressure, this effectively migrates data from a full zone to a less full zone. This patch adds a new counter (NR_WRITEBACK_TEMP) for the number of pages used as temporary buffers. [Lee.Schermerhorn@hp.com: add vmstat_text for NR_WRITEBACK_TEMP] Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Cc: Christoph Lameter <clameter@sgi.com> Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-30 08:29:50 -07:00
Miklos Szeredi	dd5656e59c	mm: bdi: export bdi_writeout_inc() Fuse needs this for writable mmap support. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-30 08:29:50 -07:00
Miklos Szeredi	e4ad08fe64	mm: bdi: add separate writeback accounting capability Add a new BDI capability flag: BDI_CAP_NO_ACCT_WB. If this flag is set, then don't update the per-bdi writeback stats from test_set_page_writeback() and test_clear_page_writeback(). Misc cleanups: - convert bdi_cap_writeback_dirty() and friends to static inline functions - create a flag that includes all three dirty/writeback related flags, since almst all users will want to have them toghether Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-30 08:29:50 -07:00
Miklos Szeredi	76f1418b48	mm: bdi: move statistics to debugfs Move BDI statistics to debugfs: /sys/kernel/debug/bdi/<bdi>/stats Use postcore_initcall() to initialize the sysfs class and debugfs, because debugfs is initialized in core_initcall(). Update descriptions in ABI documentation. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-30 08:29:50 -07:00
Peter Zijlstra	a42dde0415	mm: bdi: allow setting a maximum for the bdi dirty limit Add "max_ratio" to /sys/class/bdi. This indicates the maximum percentage of the global dirty threshold allocated to this bdi. [mszeredi@suse.cz] - fix parsing in max_ratio_store(). - export bdi_set_max_ratio() to modules - limit bdi_dirty with bdi->max_ratio - document new sysfs attribute Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-30 08:29:50 -07:00
Peter Zijlstra	189d3c4a94	mm: bdi: allow setting a minimum for the bdi dirty limit Under normal circumstances each device is given a part of the total write-back cache that relates to its current avg writeout speed in relation to the other devices. min_ratio - allows one to assign a minimum portion of the write-back cache to a particular device. This is useful in situations where you might want to provide a minimum QoS. (One request for this feature came from flash based storage people who wanted to avoid writing out at all costs - they of course needed some pdflush hacks as well) max_ratio - allows one to assign a maximum portion of the dirty limit to a particular device. This is useful in situations where you want to avoid one device taking all or most of the write-back cache. Eg. an NFS mount that is prone to get stuck, or a FUSE mount which you don't trust to play fair. Add "min_ratio" to /sys/class/bdi. This indicates the minimum percentage of the global dirty threshold allocated to this bdi. [mszeredi@suse.cz] - fix parsing in min_ratio_store() - document new sysfs attribute Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-30 08:29:50 -07:00
Peter Zijlstra	cf0ca9fe5d	mm: bdi: export BDI attributes in sysfs Provide a place in sysfs (/sys/class/bdi) for the backing_dev_info object. This allows us to see and set the various BDI specific variables. In particular this properly exposes the read-ahead window for all relevant users and /sys/block/<block>/queue/read_ahead_kb should be deprecated. With patient help from Kay Sievers and Greg KH [mszeredi@suse.cz] - split off NFS and FUSE changes into separate patches - document new sysfs attributes under Documentation/ABI - do bdi_class_init as a core_initcall, otherwise the "default" BDI won't be initialized - remove bdi_init_fmt macro, it's not used very much [akpm@linux-foundation.org: fix ia64 warning] Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Kay Sievers <kay.sievers@vrfy.org> Acked-by: Greg KH <greg@kroah.com> Cc: Trond Myklebust <trond.myklebust@fys.uio.no> Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-30 08:29:49 -07:00
Pavel Emelyanov	caafa43243	pidns: make pid->level and pid_ns->level unsigned These values represent the nesting level of a namespace and pids living in it, and it's always non-negative. Turning this from int to unsigned int saves some space in pid.c (11 bytes on x86 and 64 on ia64) by letting the compiler optimize the pid_nr_ns a bit. E.g. on ia64 this removes the sign extension calls, which compiler adds to optimize access to pid->nubers[ns->level]. Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-30 08:29:49 -07:00
Oleg Nesterov	24336eaeec	pids: introduce change_pid() helper Based on Eric W. Biederman's idea. Without tasklist_lock held task_session()/task_pgrp() can return NULL if the caller races with setprgp()/setsid() which does detach_pid() + attach_pid(). This can happen even if task == current. Intoduce the new helper, change_pid(), which should be used instead. This way the caller always sees the special pid != NULL, either old or new. Also change the prototype of attach_pid(), it always returns 0 and nobody check the returned value. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Pavel Emelyanov <xemul@openvz.org> Cc: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-30 08:29:48 -07:00
Pavel Emelyanov	5cd204550b	Deprecate find_task_by_pid() There are some places that are known to operate on tasks' global pids only: * the rest_init() call (called on boot) * the kgdb's getthread * the create_kthread() (since the kthread is run in init ns) So use the find_task_by_pid_ns(..., &init_pid_ns) there and schedule the find_task_by_pid for removal. [sukadev@us.ibm.com: Fix warning in kernel/pid.c] Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Sukadev Bhattiprolu <sukadev@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-30 08:29:48 -07:00
Sukadev Bhattiprolu	718a916338	devpts: factor out PTY index allocation Factor out the code used to allocate/free a pts index into new interfaces, devpts_new_index() and devpts_kill_index(). This localizes the external data structures used in managing the pts indices. [akpm@linux-foundation.org: undo accidental mutex2sem conversion] Signed-off-by: Sukadev Bhattiprolu <sukadev@us.ibm.com> Signed-off-by: Serge Hallyn <serue@us.ibm.com> Signed-off-by: Matt Helsley <matthltc@us.ibm.com> Acked-by: H. Peter Anvin <hpa@zytor.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-30 08:29:48 -07:00
Alan Cox	39c2e60f8c	tty: add throttle/unthrottle helpers Something Arjan suggested which allows us to clean up the code nicely Signed-off-by: Alan Cox <alan@redhat.com> Cc: Arjan van de Ven <arjan@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-30 08:29:47 -07:00
Alan Cox	f34d7a5b70	tty: The big operations rework - Operations are now a shared const function block as with most other Linux objects - Introduce wrappers for some optional functions to get consistent behaviour - Wrap put_char which used to be patched by the tty layer - Document which functions are needed/optional - Make put_char report success/fail - Cache the driver->ops pointer in the tty as tty->ops - Remove various surplus lock calls we no longer need - Remove proc_write method as noted by Alexey Dobriyan - Introduce some missing sanity checks where certain driver/ldisc combinations would oops as they didn't check needed methods were present [akpm@linux-foundation.org: fix fs/compat_ioctl.c build] [akpm@linux-foundation.org: fix isicom] [akpm@linux-foundation.org: fix arch/ia64/hp/sim/simserial.c build] [akpm@linux-foundation.org: fix kgdb] Signed-off-by: Alan Cox <alan@redhat.com> Acked-by: Greg Kroah-Hartman <gregkh@suse.de> Cc: Jason Wessel <jason.wessel@windriver.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-30 08:29:47 -07:00
Alan Cox	76b25a5509	char: switch gs, cyclades and esp to return int for put_char Signed-off-by: Alan Cox <alan@redhat.com> Cc: Jiri Slaby <jirislaby@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-30 08:29:45 -07:00
Alan Cox	5d0fdf1e01	tty_io: fix remaining pid struct locking This fixes the last couple of pid struct locking failures I know about. [oleg@tv-sign.ru: clean up do_task_stat()] Signed-off-by: Alan Cox <alan@redhat.com> Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-30 08:29:40 -07:00
Alan Cox	47f86834bb	redo locking of tty->pgrp Historically tty->pgrp and friends were pid_t and the code "knew" they were safe. The change to pid structs opened up a few races and the removal of the BKL in places made them quite hittable. We put tty->pgrp under the ctrl_lock for the tty. Signed-off-by: Alan Cox <alan@redhat.com> Cc: Oleg Nesterov <oleg@tv-sign.ru> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-30 08:29:40 -07:00
Alan Cox	04f378b198	tty: BKL pushdown - Push the BKL down into the line disciplines - Switch the tty layer to unlocked_ioctl - Introduce a new ctrl_lock spin lock for the control bits - Eliminate much of the lock_kernel use in n_tty - Prepare to (but don't yet) call the drivers with the lock dropped on the paths that historically held the lock BKL now primarily protects open/close/ldisc change in the tty layer [jirislaby@gmail.com: a couple of fixes] Signed-off-by: Alan Cox <alan@redhat.com> Signed-off-by: Jiri Slaby <jirislaby@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-30 08:29:40 -07:00
Oleg Nesterov	53b6f9fbd3	ptrace: introduce ptrace_reparented() helper Add another trivial helper for the sake of grep. It also auto-documents the fact that ->parent != real_parent implies ->ptrace. No functional changes. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Acked-by: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-30 08:29:38 -07:00
Roland McGrath	f3de272b82	signals: use HAVE_SET_RESTORE_SIGMASK Change all the #ifdef TIF_RESTORE_SIGMASK conditionals in non-arch code to #ifdef HAVE_SET_RESTORE_SIGMASK. If arch code defines it first, the generic set_restore_sigmask() using TIF_RESTORE_SIGMASK is not defined. Signed-off-by: Roland McGrath <roland@redhat.com> Cc: Oleg Nesterov <oleg@tv-sign.ru> Cc: Ingo Molnar <mingo@elte.hu> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: "Luck, Tony" <tony.luck@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-30 08:29:37 -07:00
Roland McGrath	7648d961fc	signals: set_restore_sigmask TIF_SIGPENDING Set TIF_SIGPENDING in set_restore_sigmask. This lets arch code take TIF_RESTORE_SIGMASK out of the set of bits that will be noticed on return to user mode. On some machines those bits are scarce, and we can free this unneeded one up for other uses. It is probably the case that TIF_SIGPENDING is always set anyway everywhere set_restore_sigmask() is used. But this is some cheap paranoia in case there is an arcane case where it might not be. Signed-off-by: Roland McGrath <roland@redhat.com> Cc: Oleg Nesterov <oleg@tv-sign.ru> Cc: Ingo Molnar <mingo@elte.hu> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: "Luck, Tony" <tony.luck@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-30 08:29:37 -07:00
Roland McGrath	4e4c22c711	signals: add set_restore_sigmask This adds the set_restore_sigmask() inline in <linux/thread_info.h> and replaces every set_thread_flag(TIF_RESTORE_SIGMASK) with a call to it. No change, but abstracts the details of the flag protocol from all the calls. Signed-off-by: Roland McGrath <roland@redhat.com> Cc: Oleg Nesterov <oleg@tv-sign.ru> Cc: Ingo Molnar <mingo@elte.hu> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: "Luck, Tony" <tony.luck@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-30 08:29:37 -07:00
Oleg Nesterov	fae5fa44f1	signals: fix /sbin/init protection from unwanted signals The global init has a lot of long standing problems with the unhandled fatal signals. - The "is_global_init(current)" check in get_signal_to_deliver() protects only the main thread. Sub-thread can dequee the fatal signal and shutdown the whole thread group except the main thread. If it dequeues SIGSTOP /sbin/init will be stopped, this is not right too. Note that we can't use is_global_init(->group_leader), this breaks exec and this can't solve other problems we have. - Even if afterwards ignored, the fatal signals sets SIGNAL_GROUP_EXIT on delivery. This breaks exec, has other bad implications, and this is just wrong. Introduce the new SIGNAL_UNKILLABLE flag to fix these problems. It also helps to solve some other problems addressed by the subsequent patches. Currently we use this flag for the global init only, but it could also be used by kthreads and (perhaps) by the sub-namespace inits. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Acked-by: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-30 08:29:37 -07:00
Oleg Nesterov	ac5c215383	signals: join send_sigqueue() with send_group_sigqueue() We export send_sigqueue() and send_group_sigqueue() for the only user, posix_timer_event(). This is a bit silly, because both are just trivial helpers on top of do_send_sigqueue() and because the we pass the unused .si_signo parameter. Kill them both, rename do_send_sigqueue() to send_sigqueue(), and export it. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Cc: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-30 08:29:36 -07:00
Oleg Nesterov	6ca25b5513	kill_pid_info: don't take now unneeded tasklist_lock Previously handle_stop_signal(SIGCONT) could drop ->siglock. That is why kill_pid_info(SIGCONT) takes tasklist_lock to make sure the target task can't go away after unlock. Not needed now. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Cc: Roland McGrath <roland@redhat.com> Cc: Jiri Kosina <jkosina@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-30 08:29:34 -07:00
Oleg Nesterov	e442055193	signals: re-assign CLD_CONTINUED notification from the sender to reciever Based on discussion with Jiri and Roland. In short: currently handle_stop_signal(SIGCONT, p) sends the notification to p->parent, with this patch p itself notifies its parent when it becomes running. handle_stop_signal(SIGCONT) has to drop ->siglock temporary in order to notify the parent with do_notify_parent_cldstop(). This leads to multiple problems: - as Jiri Kosina pointed out, the stopped task can resume without actually seeing SIGCONT which may have a handler. - we race with another sig_kernel_stop() signal which may come in that window. - we race with sig_fatal() signals which may set SIGNAL_GROUP_EXIT in that window. - we can't avoid taking tasklist_lock() while sending SIGCONT. With this patch handle_stop_signal() just sets the new SIGNAL_CLD_CONTINUED flag in p->signal->flags and returns. The notification is sent by the first task which returns from finish_stop() (there should be at least one) or any other signalled thread from get_signal_to_deliver(). This is a user-visible change. Say, currently kill(SIGCONT, stopped_child) can't return without seeing SIGCHLD, with this patch SIGCHLD can be delayed unpredictably. Another difference is that if the child is ptraced by another process, CLD_CONTINUED may be delivered to ->real_parent after ptrace_detach() while currently it always goes to the tracer which doesn't actually need this notification. Hopefully not a problem. The patch asks for the futher obvious cleanups, I'll send them separately. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Cc: Roland McGrath <roland@redhat.com> Cc: Jiri Kosina <jkosina@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-30 08:29:34 -07:00
Dan Williams	6bfe0b4990	md: support blocking writes to an array on device failure Allows a userspace metadata handler to take action upon detecting a device failure. Based on an original patch by Neil Brown. Changes: -added blocked_wait waitqueue to rdev -don't qualify Blocked with Faulty always let userspace block writes -added md_wait_for_blocked_rdev to wait for the block device to be clear, if userspace misses the notification another one is sent every 5 seconds -set MD_RECOVERY_NEEDED after clearing "blocked" -kill DoBlock flag, just test mddev->external Signed-off-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-30 08:29:33 -07:00
Bryan Wu	2f3517418d	Blackfin serial driver: this driver enable SPORTs on Blackfin emulate UART Signed-off-by: Bryan Wu <bryan.wu@analog.com> Cc: Alan Cox <alan@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-30 08:29:30 -07:00
Yinghai Lu	70b9f7dc14	x86/pci: remove flag in pci_cfg_space_size_ext so let pci_cfg_space_size call it directly without flag. Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com> Signed-off-by: Jesse Barnes <jbarnes@virtuousgeek.org>	2008-04-29 15:34:05 -07:00
Christoph Hellwig	3dcf54515a	ext4: move headers out of include/linux Move ext4 headers out of include/linux. This is just the trivial move, there's some more thing that could be done later. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Mingming Cao <cmm@us.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2008-04-29 18:13:32 -04:00
Sam Ravnborg	98db6f193c	x86: fix section mismatch in pci_scan_bus Fix following section mismatch warning: WARNING: vmlinux.o(.text+0x275616): Section mismatch in reference from the function pci_scan_bus() to the function .devinit.text:pci_scan_bus_parented() The warning was seen with a CONFIG_DEBUG_SECTION_MISMATCH=y build. The inline function pci_scan_bus refer to functions annotated __devinit - so annotate it __devinit too. This revealed a few x86 specific functions that were only used from __init or __devinit context. So annotate these __devinit and the warning was killed. The added include in pci.h was not strictly required but added to avoid being dependent on indirect includes. Signed-off-by: Sam Ravnborg <sam@ravnborg.org> Signed-off-by: Jesse Barnes <jbarnes@hobbes.lan>	2008-04-29 13:41:59 -07:00
Jens Axboe	7663c1e279	Improve queue_is_locked() spin_is_locked() doesn't work on UP without spinlock debugging. Make it safer and just return 1 on UP, so we don't get false positives. The plan is to kill this debug function during the -rc cycle. Signed-off-by: Jens Axboe <jens.axboe@oracle.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-29 12:36:54 -07:00
Linus Torvalds	9781db7b34	Merge branch 'audit.b50' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/audit-current * 'audit.b50' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/audit-current: [PATCH] new predicate - AUDIT_FILETYPE [patch 2/2] Use find_task_by_vpid in audit code [patch 1/2] audit: let userspace fully control TTY input auditing [PATCH 2/2] audit: fix sparse shadowed variable warnings [PATCH 1/2] audit: move extern declarations to audit.h Audit: MAINTAINERS update Audit: increase the maximum length of the key field Audit: standardize string audit interfaces Audit: stop deadlock from signals under load Audit: save audit_backlog_limit audit messages in case auditd comes back Audit: collect sessionid in netlink messages Audit: end printk with newline	2008-04-29 11:41:22 -07:00
Linus Torvalds	a217656cb2	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jbarnes/pci-2.6 * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jbarnes/pci-2.6: (21 commits) pciehp: fix error message about getting hotplug control pci/irq: let pci_device_shutdown to call pci_msi_shutdown v2 pci/irq: restore mask_bits in msi shutdown -v3 doc: replace yet another dev with pdev for consistency in DMA-mapping.txt PCI: don't expose struct pci_vpd to userspace doc: fix an incorrect suggestion to pass NULL for PCI like buses Consistently use pdev as the variable of type struct pci_dev *. pciehp: Fix command write shpchp: fix slot name make pciehp_acpi_get_hp_hw_control_from_firmware() pciehp: Clean up pcie_init() pciehp: Mask hotplug interrupt at controller release pciehp: Remove useless hotplug interrupt enabling pciehp: Fix wrong slot capability check pciehp: Fix wrong slot control register access pciehp: Add missing memory barrier pciehp: Fix interrupt event handlig pciehp: fix slot name Update MAINTAINERS with location of PCI tree PCI: Add Intel SCH PCI IDs ...	2008-04-29 10:17:59 -07:00
Linus Torvalds	8f45c1a58a	block: fix queue locking verification The new queue_flag_set/clear() functions verify that the queue is locked, but in doing so they will actually instead oops if the queue lock hasn't been initialized at all. So fix the lock debug test to consider the "no lock" case to be unlocked. This way you get a nice WARN_ON_ONCE() instead of a fatal oops. Bug introduced by commit `75ad23bc0f` ("block: make queue flags non-atomic"). Cc: Jens Axboe <jens.axboe@oracle.com> Cc: Nick Piggin <npiggin@suse.de> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-29 10:16:38 -07:00
Yinghai Lu	d52877c7b1	pci/irq: let pci_device_shutdown to call pci_msi_shutdown v2 [PATCH 2/2] pci/irq: let pci_device_shutdown to call pci_msi_shutdown v2 this change \| commit `23a274c8a5` \| Author: Prakash, Sathya <sathya.prakash@lsi.com> \| Date: Fri Mar 7 15:53:21 2008 +0530 \| \| [SCSI] mpt fusion: Enable MSI by default for SAS controllers \| \| This patch modifies the driver to enable MSI by default for all SAS chips. \| \| Signed-off-by: Sathya Prakash <sathya.prakash@lsi.com> \| Signed-off-by: James Bottomley <James.Bottomley@HansenPartnership.com> \| Causes the kexec of a RHEL 5.1 kernel to fail. root casue: the rhel 5.1 kernel still uses INTx emulation. and mptscsih_shutdown doesn't call pci_disable_msi to reenable INTx on kexec path So call pci_msi_shutdown in the shutdown path to do the same thing to msix Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com> Signed-off-by: Jesse Barnes <jbarnes@hobbes.lan>	2008-04-29 09:12:51 -07:00
Yinghai Lu	8e149e09f9	pci/irq: restore mask_bits in msi shutdown -v3 [PATCH 1/2] pci/irq: restore mask_bits in msi shutdown -v3 Yinghai found that kexec'ing a RHEL 5.1 kernel with 2.6.25-rc3+ kernels prevents his NIC from working. He bisected to \| commit `89d694b9db` \| Author: Thomas Gleixner <tglx@linutronix.de> \| Date: Mon Feb 18 18:25:17 2008 +0100 \| \| genirq: do not leave interupts enabled on free_irq \| \| The default_disable() function was changed in commit: \| \| `76d2160147` \| genirq: do not mask interrupts by default \| For MSI, default_shutdown will call mask_bit for msi device. All mask bits will left disabled after free_irq. Then in the kexec case, the next kernel can only use msi_enable bit, so all device's MSI can not be used. So lets to restore the mask bit to its pci reset defined value (enabled) when we disable the kernels use of msi to be a little friendlier to kexec'd kernels. Extend msi_set_mask_bit to msi_set_mask_bits to take mask, so we can fully restore that to 0x00 instead of 0xfe. Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com> Signed-off-by: Jesse Barnes <jbarnes@hobbes.lan>	2008-04-29 09:11:12 -07:00
Linus Torvalds	5f78e4d339	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/x86/linux-2.6-x86-bigbox-pci * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/x86/linux-2.6-x86-bigbox-pci: x86: add pci=check_enable_amd_mmconf and dmi check x86: work around io allocation overlap of HT links acpi: get boot_cpu_id as early for k8_scan_nodes x86_64: don't need set default res if only have one root bus x86: double check the multi root bus with fam10h mmconf x86: multi pci root bus with different io resource range, on 64-bit x86: use bus conf in NB conf fun1 to get bus range on, on 64-bit x86: get mp_bus_to_node early x86 pci: remove checking type for mmconfig probe x86: remove unneeded check in mmconf reject driver core: try parent numa_node at first before using default x86: seperate mmconf for fam10h out from setup_64.c x86: if acpi=off, force setting the mmconf for fam10h x86_64: check MSR to get MMCONFIG for AMD Family 10h x86_64: check and enable MMCONFIG for AMD Family 10h x86_64: set cfg_size for AMD Family 10h in case MMCONFIG x86: mmconf enable mcfg early x86: clear pci_mmcfg_virt when mmcfg get rejected x86: validate against acpi motherboard resources Fixed up fairly trivial conflicts in arch/x86/pci/{init.c,pci.h} due to OLPC support manually.	2008-04-29 08:26:51 -07:00
Linus Torvalds	867a89e0b7	Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/paulus/powerpc * 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/paulus/powerpc: [RAPIDIO] Change RapidIO doorbell source and target ID field to 16-bit [RAPIDIO] Add RapidIO connection info print out and re-training for broken connections [RAPIDIO] Add serial RapidIO controller support, which includes MPC8548, MPC8641 [RAPIDIO] Add RapidIO node probing into MPC86xx_HPCN board id table [RAPIDIO] Add RapidIO node into MPC8641HPCN dts file [RAPIDIO] Auto-probe the RapidIO system size [RAPIDIO] Add OF-tree support to RapidIO controller driver [RAPIDIO] Add RapidIO multi mport support [RAPIDIO] Move include/asm-ppc/rio.h to asm-powerpc [RAPIDIO] Add RapidIO option to kernel configuration [RAPIDIO] Change RIO function mpc85xx_ to fsl_ [POWERPC] Provide walk_memory_resource() for powerpc [POWERPC] Update lmb data structures for hotplug memory add/remove [POWERPC] Hotplug memory remove notifications for powerpc [POWERPC] windfarm: Add PowerMac 12,1 support [POWERPC] Fix building of pmac32 when CONFIG_NVRAM=m [POWERPC] Add IRQSTACKS support on ppc32 [POWERPC] Use __always_inline for xchg* and cmpxchg* [POWERPC] Add fast little-endian switch system call	2008-04-29 08:19:14 -07:00
Linus Torvalds	44473d9913	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davej/cpufreq * git://git.kernel.org/pub/scm/linux/kernel/git/davej/cpufreq: [CPUFREQ] state info wrong after resume [CPUFREQ] allow use of the powersave governor as the default one [CPUFREQ] document the currently undocumented parts of the sysfs interface [CPUFREQ] expose cpufreq coordination requirements regardless of coordination mechanism	2008-04-29 08:18:49 -07:00
Linus Torvalds	bd5d435a96	Merge branch 'for-linus' of git://git.kernel.dk/linux-2.6-block * 'for-linus' of git://git.kernel.dk/linux-2.6-block: block: Skip I/O merges when disabled block: add large command support block: replace sizeof(rq->cmd) with BLK_MAX_CDB ide: use blk_rq_init() to initialize the request block: use blk_rq_init() to initialize the request block: rename and export rq_init() block: no need to initialize rq->cmd with blk_get_request block: no need to initialize rq->cmd in prepare_flush_fn hook block/blk-barrier.c:blk_ordered_cur_seq() mustn't be inline block/elevator.c:elv_rq_merge_ok() mustn't be inline block: make queue flags non-atomic block: add dma alignment and padding support to blk_rq_map_kern unexport blk_max_pfn ps3disk: Remove superfluous cast block: make rq_init() do a full memset() relay: fix splice problem	2008-04-29 08:18:03 -07:00
Thomas Gleixner	fee4b19fb3	bitops: remove "optimizations" The mapsize optimizations which were moved from x86 to the generic code in commit `64970b68d2` increased the binary size on non x86 architectures. Looking into the real effects of the "optimizations" it turned out that they are not used in find_next_bit() and find_next_zero_bit(). The ones in find_first_bit() and find_first_zero_bit() are used in a couple of places but none of them is a real hot path. Remove the "optimizations" all together and call the library functions unconditionally. Boot-tested on x86 and compile tested on every cross compiler I have. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-29 08:11:16 -07:00
Christoph Lameter	37487a5652	Add kbuild.h that contains common definitions for kbuild users The same definitions are used for the bounds logic and the asm-offsets.h generation by kbuild. Put them into include/linux/kbuild.h file. Also add a new feature COMMENT("text") which can be used to insert lines of ocmments into asm-offsets.h and bounds.h. Cc: Sam Ravnborg <sam@ravnborg.org> Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Jay Estabrook <jay.estabrook@hp.com> Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru> Cc: Richard Henderson <rth@twiddle.net> Cc: "Luck, Tony" <tony.luck@intel.com> Cc: Russell King <rmk@arm.linux.org.uk> Cc: Chris Zankel <chris@zankel.net> Cc: David S. Miller <davem@davemloft.net> Cc: Haavard Skinnemoen <hskinnemoen@atmel.com> Cc: Bryan Wu <bryan.wu@analog.com> Cc: Mike Frysinger <vapier.adi@gmail.com> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Roman Zippel <zippel@linux-m68k.org> Cc: Greg Ungerer <gerg@uclinux.org> Cc: David Howells <dhowells@redhat.com> Cc: Kyle McMartin <kyle@mcmartin.ca> Cc: Grant Grundler <grundler@parisc-linux.org> Cc: Matthew Wilcox <matthew@wil.cx> Cc: Paul Mackerras <paulus@samba.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Paul Mundt <lethal@linux-sh.org> Cc: Miles Bader <miles@gnu.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-29 08:06:29 -07:00
Harvey Harrison	064106a91b	kernel: add common infrastructure for unaligned access Create a linux/unaligned directory similar in spirit to the linux/byteorder folder to hold generic implementations collected from various arches. Currently there are five implementations: 1) packed_struct.h: C-struct based, from asm-generic/unaligned.h 2) le_byteshift.h: Open coded byte-swapping, heavily based on asm-arm 3) be_byteshift.h: Open coded byte-swapping, heavily based on asm-arm 4) memmove.h: taken from multiple implementations in tree 5) access_ok.h: taken from x86 and others, unaligned access is ok. All of the new implementations checks for sizes not equal to 1,2,4,8 and will fail to link. API additions: get_unaligned_{le16\|le32\|le64\|be16\|be32\|be64}(p) which is meant to replace code of the form: le16_to_cpu(get_unaligned((__le16 )p)); put_unaligned_{le16\|le32\|le64\|be16\|be32\|be64}(val, pointer) which is meant to replace code of the form: put_unaligned(cpu_to_le16(val), (__le16 )p); The headers that arches should include from their asm/unaligned.h: access_ok.h : Wrappers of the byteswapping functions in asm/byteorder Choose a particular implementation for little-endian access: le_byteshift.h le_memmove.h (arch must be LE) le_struct.h (arch must be LE) Choose a particular implementation for big-endian access: be_byteshift.h be_memmove.h (arch must be BE) be_struct.h (arch must be BE) After including as needed from the above, include unaligned/generic.h and define your arch's get/put_unaligned as (for LE): Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com> Cc: <linux-arch@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-29 08:06:27 -07:00
Robert P. J. Day	dddfbaf8f8	sysv fs: remove superfluous check for __GNUC__ compiler Since <linux/sysv_fs.h> isn't exported to userspace, there is little point checking that this is a GNU-compatible compiler. Signed-off-by: Robert P. J. Day <rpjday@crashcourse.ca> Acked-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-29 08:06:27 -07:00
Hitoshi Mitake	c3c52bce69	edac: fix module initialization on several modules 2nd time I implemented opstate_init() as a inline function in linux/edac.h. added calling opstate_init() to: i82443bxgx_edac.c i82860_edac.c i82875p_edac.c i82975x_edac.c I wrote a fixed patch of edac-fix-module-initialization-on-several-modules.patch, and tested building 2.6.25-rc7 with applying this. It was succeed. I think the patch is now correct. Cc: Alan Cox <alan@lxorguk.ukuu.org.uk> Signed-off-by: Hitoshi Mitake <h.mitake@gmail.com> Signed-off-by: Doug Thompson <dougthompson@xmission.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-29 08:06:26 -07:00
Akinobu Mita	199f0ca514	idr: create idr_layer_cache at boot time Avoid a possible kmem_cache_create() failure by creating idr_layer_cache unconditionary at boot time rather than creating it on-demand when idr_init() is called the first time. This change also enables us to eliminate the check every time idr_init() is called. [akpm@linux-foundation.org: rename init_id_cache() to idr_init_cache()] [akpm@linux-foundation.org: fix alpha build] Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-29 08:06:25 -07:00
Robert P. J. Day	098ef1c0ea	nbd: delete superfluous test for __GNUC__ Since <linux/compiler.h> already tests for __GNUC__, there's no point in nbd.h repeating that test. Signed-off-by: Robert P. J. Day <rpjday@crashcourse.ca> Cc: Paul Clements <paul.clements@steeleye.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-29 08:06:24 -07:00
Laurent Vivier	48cf6061b3	NBD: allow nbd to be used locally This patch allows Network Block Device to be mounted locally (nbd-client to nbd-server over 127.0.0.1). It creates a kthread to avoid the deadlock described in NBD tools documentation. So, if nbd-client hangs waiting for pages, the kblockd thread can continue its work and free pages. I have tested the patch to verify that it avoids the hang that always occurs when writing to a localhost nbd connection. I have also tested to verify that no performance degradation results from the additional thread and queue. Patch originally from Laurent Vivier. Signed-off-by: Paul Clements <paul.clements@steeleye.com> Signed-off-by: Laurent Vivier <Laurent.Vivier@bull.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-29 08:06:23 -07:00
Pavel Emelyanov	d7321cd624	sysctl: add the ->permissions callback on the ctl_table_root When reading from/writing to some table, a root, which this table came from, may affect this table's permissions, depending on who is working with the table. The core hunk is at the bottom of this patch. All the rest is just pushing the ctl_table_root argument up to the sysctl_perm() function. This will be mostly (only?) used in the net sysctls. Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Acked-by: David S. Miller <davem@davemloft.net> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Alexey Dobriyan <adobriyan@sw.ru> Cc: Denis V. Lunev <den@openvz.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-29 08:06:23 -07:00
Pavel Emelyanov	2c4c7155f2	sysctl: clean from unneeded extern and forward declarations The do_sysctl_strategy isn't used outside kernel/sysctl.c, so this can be static and without a prototype in header. Besides, move this one and parse_table() above their callers and drop the forward declarations of the latter call. One more "besides" - fix two checkpatch warnings: space before a ( and an extra space at the end of a line. Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Acked-by: David S. Miller <davem@davemloft.net> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Alexey Dobriyan <adobriyan@sw.ru> Cc: Denis V. Lunev <den@openvz.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-29 08:06:23 -07:00
Adrian Bunk	1a46674b99	include/linux/sysctl.h: remove empty #else Remove an empty #else. Signed-off-by: Adrian Bunk <bunk@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-29 08:06:23 -07:00
Denis V. Lunev	59b7435149	proc: introduce proc_create_data to setup de->data This set of patches fixes an proc ->open'less usage due to ->proc_fops flip in the most part of the kernel code. The original OOPS is described in the commit `2d3a4e3666`: Typical PDE creation code looks like: pde = create_proc_entry("foo", 0, NULL); if (pde) pde->proc_fops = &foo_proc_fops; Notice that PDE is first created, only then ->proc_fops is set up to final value. This is a problem because right after creation a) PDE is fully visible in /proc , and b) ->proc_fops are proc_file_operations which do not have ->open callback. So, it's possible to ->read without ->open (see one class of oopses below). The fix is new API called proc_create() which makes sure ->proc_fops are set up before gluing PDE to main tree. Typical new code looks like: pde = proc_create("foo", 0, NULL, &foo_proc_fops); if (!pde) return -ENOMEM; Fix most networking users for a start. In the long run, create_proc_entry() for regular files will go. In addition to this, proc_create_data is introduced to fix reading from proc without PDE->data. The race is basically the same as above. create_proc_entries is replaced in the entire kernel code as new method is also simply better. This patch: The problem is the same as for de->proc_fops. Right now PDE becomes visible without data set. So, the entry could be looked up without data. This, in most cases, will simply OOPS. proc_create_data call is created to address this issue. proc_create now becomes a wrapper around it. Signed-off-by: Denis V. Lunev <den@openvz.org> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: "J. Bruce Fields" <bfields@fieldses.org> Cc: Alessandro Zummo <a.zummo@towertech.it> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Bjorn Helgaas <bjorn.helgaas@hp.com> Cc: Chris Mason <chris.mason@oracle.com> Acked-by: David Howells <dhowells@redhat.com> Cc: Dmitry Torokhov <dtor@mail.ru> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Grant Grundler <grundler@parisc-linux.org> Cc: Greg Kroah-Hartman <gregkh@suse.de> Cc: Haavard Skinnemoen <hskinnemoen@atmel.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: James Bottomley <James.Bottomley@HansenPartnership.com> Cc: Jaroslav Kysela <perex@suse.cz> Cc: Jeff Garzik <jgarzik@pobox.com> Cc: Jeff Mahoney <jeffm@suse.com> Cc: Jesper Nilsson <jesper.nilsson@axis.com> Cc: Karsten Keil <kkeil@suse.de> Cc: Kyle McMartin <kyle@parisc-linux.org> Cc: Len Brown <lenb@kernel.org> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> Cc: Matthew Wilcox <matthew@wil.cx> Cc: Mauro Carvalho Chehab <mchehab@infradead.org> Cc: Mikael Starvik <starvik@axis.com> Cc: Nadia Derbey <Nadia.Derbey@bull.net> Cc: Neil Brown <neilb@suse.de> Cc: Paul Mackerras <paulus@samba.org> Cc: Peter Osterlund <petero2@telia.com> Cc: Pierre Peiffer <peifferp@gmail.com> Cc: Russell King <rmk@arm.linux.org.uk> Cc: Takashi Iwai <tiwai@suse.de> Cc: Tony Luck <tony.luck@intel.com> Cc: Trond Myklebust <trond.myklebust@fys.uio.no> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-29 08:06:20 -07:00
Alexey Dobriyan	8731f14d37	proc: remove ->get_info infrastructure Now that last dozen or so users of ->get_info were removed, ditch it too. Everyone sane shouldd have switched to seq_file interface long ago. P.S.: Co-existing 3 interfaces (->get_info/->read_proc/->proc_fops) for proc is long-standing crap, BTW, thus a) put ->read_proc/->write_proc/read_proc_entry() users on death row, b) new such users should be rejected, c) everyone is encouraged to convert his favourite ->read_proc user or I'll do it, lazy bastards. Signed-off-by: Alexey Dobriyan <adobriyan@sw.ru> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-29 08:06:19 -07:00
Alexey Dobriyan	c74c120a21	proc: remove proc_root from drivers Remove proc_root export. Creation and removal works well if parent PDE is supplied as NULL -- it worked always that way. So, one useless export removed and consistency added, some drivers created PDEs with &proc_root as parent but removed them as NULL and so on. Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-29 08:06:18 -07:00
Alexey Dobriyan	928b4d8c89	proc: remove proc_root_driver Use creation by full path: "driver/foo". Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-29 08:06:18 -07:00
Alexey Dobriyan	36a5aeb878	proc: remove proc_root_fs Use creation by full path instead: "fs/foo". Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-29 08:06:18 -07:00
Alexey Dobriyan	9c37066d88	proc: remove proc_bus Remove proc_bus export and variable itself. Using pathnames works fine and is slightly more understandable and greppable. Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-29 08:06:18 -07:00
Matt Helsley	925d1c401f	procfs task exe symlink The kernel implements readlink of /proc/pid/exe by getting the file from the first executable VMA. Then the path to the file is reconstructed and reported as the result. Because of the VMA walk the code is slightly different on nommu systems. This patch avoids separate /proc/pid/exe code on nommu systems. Instead of walking the VMAs to find the first executable file-backed VMA we store a reference to the exec'd file in the mm_struct. That reference would prevent the filesystem holding the executable file from being unmounted even after unmapping the VMAs. So we track the number of VM_EXECUTABLE VMAs and drop the new reference when the last one is unmapped. This avoids pinning the mounted filesystem. [akpm@linux-foundation.org: improve comments] [yamamoto@valinux.co.jp: fix dup_mmap] Signed-off-by: Matt Helsley <matthltc@us.ibm.com> Cc: Oleg Nesterov <oleg@tv-sign.ru> Cc: David Howells <dhowells@redhat.com> Cc:"Eric W. Biederman" <ebiederm@xmission.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Hugh Dickins <hugh@veritas.com> Signed-off-by: YAMAMOTO Takashi <yamamoto@valinux.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-29 08:06:17 -07:00
David Howells	7249db2c28	keys: make key_serial() a function if CONFIG_KEYS=y Make key_serial() an inline function rather than a macro if CONFIG_KEYS=y. This prevents double evaluation of the key pointer and also provides better type checking. Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-29 08:06:17 -07:00
David Howells	0b77f5bfb4	keys: make the keyring quotas controllable through /proc/sys Make the keyring quotas controllable through /proc/sys files: () /proc/sys/kernel/keys/root_maxkeys /proc/sys/kernel/keys/root_maxbytes Maximum number of keys that root may have and the maximum total number of bytes of data that root may have stored in those keys. () /proc/sys/kernel/keys/maxkeys /proc/sys/kernel/keys/maxbytes Maximum number of keys that each non-root user may have and the maximum total number of bytes of data that each of those users may have stored in their keys. Also increase the quotas as a number of people have been complaining that it's not big enough. I'm not sure that it's big enough now either, but on the other hand, it can now be set in /etc/sysctl.conf. Signed-off-by: David Howells <dhowells@redhat.com> Cc: <kwc@citi.umich.edu> Cc: <arunsr@cse.iitk.ac.in> Cc: <dwalsh@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-29 08:06:17 -07:00
David Howells	69664cf16a	keys: don't generate user and user session keyrings unless they're accessed Don't generate the per-UID user and user session keyrings unless they're explicitly accessed. This solves a problem during a login process whereby set*uid() is called before the SELinux PAM module, resulting in the per-UID keyrings having the wrong security labels. This also cures the problem of multiple per-UID keyrings sometimes appearing due to PAM modules (including pam_keyinit) setuiding and causing user_structs to come into and go out of existence whilst the session keyring pins the user keyring. This is achieved by first searching for extant per-UID keyrings before inventing new ones. The serial bound argument is also dropped from find_keyring_by_name() as it's not currently made use of (setting it to 0 disables the feature). Signed-off-by: David Howells <dhowells@redhat.com> Cc: <kwc@citi.umich.edu> Cc: <arunsr@cse.iitk.ac.in> Cc: <dwalsh@redhat.com> Cc: Stephen Smalley <sds@tycho.nsa.gov> Cc: James Morris <jmorris@namei.org> Cc: Chris Wright <chrisw@sous-sol.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-29 08:06:17 -07:00
Arun Raghavan	6b79ccb514	keys: allow clients to set key perms in key_create_or_update() The key_create_or_update() function provided by the keyring code has a default set of permissions that are always applied to the key when created. This might not be desirable to all clients. Here's a patch that adds a "perm" parameter to the function to address this, which can be set to KEY_PERM_UNDEF to revert to the current behaviour. Signed-off-by: Arun Raghavan <arunsr@cse.iitk.ac.in> Signed-off-by: David Howells <dhowells@redhat.com> Cc: Satyam Sharma <ssatyam@cse.iitk.ac.in> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-29 08:06:16 -07:00
David Howells	70a5bb72b5	keys: add keyctl function to get a security label Add a keyctl() function to get the security label of a key. The following is added to Documentation/keys.txt: () Get the LSM security context attached to a key. long keyctl(KEYCTL_GET_SECURITY, key_serial_t key, char buffer, size_t buflen) This function returns a string that represents the LSM security context attached to a key in the buffer provided. Unless there's an error, it always returns the amount of data it could produce, even if that's too big for the buffer, but it won't copy more than requested to userspace. If the buffer pointer is NULL then no copy will take place. A NUL character is included at the end of the string if the buffer is sufficiently big. This is included in the returned count. If no LSM is in force then an empty string will be returned. A process must have view permission on the key for this function to be successful. [akpm@linux-foundation.org: declare keyctl_get_security()] Signed-off-by: David Howells <dhowells@redhat.com> Acked-by: Stephen Smalley <sds@tycho.nsa.gov> Cc: Paul Moore <paul.moore@hp.com> Cc: Chris Wright <chrisw@sous-sol.org> Cc: James Morris <jmorris@namei.org> Cc: Kevin Coffman <kwc@citi.umich.edu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-29 08:06:16 -07:00
David Howells	4a38e122e2	keys: allow the callout data to be passed as a blob rather than a string Allow the callout data to be passed as a blob rather than a string for internal kernel services that call any request_key_*() interface other than request_key(). request_key() itself still takes a NUL-terminated string. The functions that change are: request_key_with_auxdata() request_key_async() request_key_async_with_auxdata() Signed-off-by: David Howells <dhowells@redhat.com> Cc: Paul Moore <paul.moore@hp.com> Cc: Chris Wright <chrisw@sous-sol.org> Cc: Stephen Smalley <sds@tycho.nsa.gov> Cc: James Morris <jmorris@namei.org> Cc: Kevin Coffman <kwc@citi.umich.edu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-29 08:06:16 -07:00
Cyrill Gorcunov	eb6900fbfa	ELF: Use EI_NIDENT instead of numeric value Signed-off-by: Cyrill Gorcunov <gorcunov@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-29 08:06:16 -07:00
Robert P. J. Day	66ec2d7786	ipmi: make comment match actual preprocessor check Signed-off-by: Robert P. J. Day <rpjday@crashcourse.ca> Signed-off-by: Corey Minyard <cminyard@mvista.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-29 08:06:15 -07:00
Alexey Dobriyan	fa68be0def	ipmi: remove ->write_proc code IPMI code theoretically allows ->write_proc users, but nobody uses this thus far. Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Acked-by: Corey Minyard <minyard@acm.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-29 08:06:15 -07:00
Corey Minyard	c70d749986	ipmi: style fixes in the base code Lots of style fixes for the base IPMI driver. No functional changes. Basically fixes everything reported by checkpatch and fixes the comment style. Signed-off-by: Corey Minyard <cminyard@mvista.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-29 08:06:15 -07:00
Corey Minyard	bda4c30aa6	ipmi: run to completion fixes The "run_to_completion" mode was somewhat broken. Locks need to be avoided in run_to_completion mode, and it shouldn't be used by normal users, just internally for panic situations. This patch removes locks in run_to_completion mode and removes the user call for setting the mode. The only user was the poweroff code, but it was easily converted to use the polling interface. [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Corey Minyard <cminyard@mvista.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-29 08:06:14 -07:00
Zhang, Yanmin	44f564a4bf	ipc: add definitions of USHORT_MAX and others Add definitions of USHORT_MAX and others into kernel. ipc uses it and slub implementation might also use it. [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Zhang Yanmin <yanmin.zhang@intel.com> Reviewed-by: Christoph Lameter <clameter@sgi.com> Cc: Nadia Derbey <Nadia.Derbey@bull.net> Cc: "Pierre Peiffer" <peifferp@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-29 08:06:14 -07:00
Nadia Derbey	6546bc4279	ipc: re-enable msgmni automatic recomputing msgmni if set to negative The enhancement as asked for by Yasunori: if msgmni is set to a negative value, register it back into the ipcns notifier chain. A new interface has been added to the notification mechanism: notifier_chain_cond_register() registers a notifier block only if not already registered. With that new interface we avoid taking care of the states changes in procfs. Signed-off-by: Nadia Derbey <Nadia.Derbey@bull.net> Cc: Yasunori Goto <y-goto@jp.fujitsu.com> Cc: Matt Helsley <matthltc@us.ibm.com> Cc: Mingming Cao <cmm@us.ibm.com> Cc: Pierre Peiffer <pierre.peiffer@bull.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-29 08:06:13 -07:00
Nadia Derbey	e2c284d8a8	ipc: recompute msgmni on ipc namespace creation/removal Introduce a notification mechanism that aims at recomputing msgmni each time an ipc namespace is created or removed. The ipc namespace notifier chain already defined for memory hotplug management is used for that purpose too. Each time a new ipc namespace is allocated or an existing ipc namespace is removed, the ipcns notifier chain is notified. The callback routine for each registered ipc namespace is then activated in order to recompute msgmni for that namespace. Signed-off-by: Nadia Derbey <Nadia.Derbey@bull.net> Cc: Yasunori Goto <y-goto@jp.fujitsu.com> Cc: Matt Helsley <matthltc@us.ibm.com> Cc: Mingming Cao <cmm@us.ibm.com> Cc: Pierre Peiffer <pierre.peiffer@bull.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-29 08:06:12 -07:00
Nadia Derbey	b6b337ad1c	ipc: recompute msgmni on memory add / remove Introduce the registration of a callback routine that recomputes msg_ctlmni upon memory add / remove. A single notifier block is registered in the hotplug memory chain for all the ipc namespaces. Since the ipc namespaces are not linked together, they have their own notification chain: one notifier_block is defined per ipc namespace. Each time an ipc namespace is created (removed) it registers (unregisters) its notifier block in (from) the ipcns chain. The callback routine registered in the memory chain invokes the ipcns notifier chain with the IPCNS_LOWMEM event. Each callback routine registered in the ipcns namespace, in turn, recomputes msgmni for the owning namespace. Signed-off-by: Nadia Derbey <Nadia.Derbey@bull.net> Cc: Yasunori Goto <y-goto@jp.fujitsu.com> Cc: Matt Helsley <matthltc@us.ibm.com> Cc: Mingming Cao <cmm@us.ibm.com> Cc: Pierre Peiffer <pierre.peiffer@bull.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-29 08:06:12 -07:00
Nadia Derbey	0c40ba4fd6	ipc: define the slab_memory_callback priority as a constant This is a trivial patch that defines the priority of slab_memory_callback in the callback chain as a constant. This is to prepare for next patch in the series. Signed-off-by: Nadia Derbey <Nadia.Derbey@bull.net> Cc: Yasunori Goto <y-goto@jp.fujitsu.com> Cc: Matt Helsley <matthltc@us.ibm.com> Cc: Mingming Cao <cmm@us.ibm.com> Cc: Pierre Peiffer <pierre.peiffer@bull.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-29 08:06:12 -07:00
Nadia Derbey	4d89dc6ab2	ipc: scale msgmni to the number of ipc namespaces Since all the namespaces see the same amount of memory (the total one) this patch introduces a new variable that counts the ipc namespaces and divides msg_ctlmni by this counter. Signed-off-by: Nadia Derbey <Nadia.Derbey@bull.net> Cc: Yasunori Goto <y-goto@jp.fujitsu.com> Cc: Matt Helsley <matthltc@us.ibm.com> Cc: Mingming Cao <cmm@us.ibm.com> Cc: Pierre Peiffer <pierre.peiffer@bull.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-29 08:06:12 -07:00
Nadia Derbey	f7bf3df8be	ipc: scale msgmni to the amount of lowmem On large systems we'd like to allow a larger number of message queues. In some cases up to 32K. However simply setting MSGMNI to a larger value may cause problems for smaller systems. The first patch of this series introduces a default maximum number of message queue ids that scales with the amount of lowmem. Since msgmni is per namespace and there is no amount of memory dedicated to each namespace so far, the second patch of this series scales msgmni to the number of ipc namespaces too. Since msgmni depends on the amount of memory, it becomes necessary to recompute it upon memory add/remove. In the 4th patch, memory hotplug management is added: a notifier block is registered into the memory hotplug notifier chain for the ipc subsystem. Since the ipc namespaces are not linked together, they have their own notification chain: one notifier_block is defined per ipc namespace. Each time an ipc namespace is created (removed) it registers (unregisters) its notifier block in (from) the ipcns chain. The callback routine registered in the memory chain invokes the ipcns notifier chain with the IPCNS_MEMCHANGE event. Each callback routine registered in the ipcns namespace, in turn, recomputes msgmni for the owning namespace. The 5th patch makes it possible to keep the memory hotplug notifier chain's lock for a lesser amount of time: instead of directly notifying the ipcns notifier chain upon memory add/remove, a work item is added to the global workqueue. When activated, this work item is the one who notifies the ipcns notifier chain. Since msgmni depends on the number of ipc namespaces, it becomes necessary to recompute it upon ipc namespace creation / removal. The 6th patch uses the ipc namespace notifier chain for that purpose: that chain is notified each time an ipc namespace is created or removed. This makes it possible to recompute msgmni for all the namespaces each time one of them is created or removed. When msgmni is explicitely set from userspace, we should avoid recomputing it upon memory add/remove or ipcns creation/removal. This is what the 7th patch does: it simply unregisters the ipcns callback routine as soon as msgmni has been changed from procfs or sysctl(). Even if msgmni is set by hand, it should be possible to make it back automatically recomputed upon memory add/remove or ipcns creation/removal. This what is achieved in patch 8: if set to a negative value, msgmni is added back to the ipcns notifier chain, making it automatically recomputed again. This patch: Compute msg_ctlmni to make it scale with the amount of lowmem. msg_ctlmni is now set to make the message queues occupy 1/32 of the available lowmem. Some cleaning has also been done for the MSGPOOL constant: the msgctl man page says it's not used, but it also defines it as a size in bytes (the code expresses it in Kbytes). Signed-off-by: Nadia Derbey <Nadia.Derbey@bull.net> Cc: Yasunori Goto <y-goto@jp.fujitsu.com> Cc: Matt Helsley <matthltc@us.ibm.com> Cc: Mingming Cao <cmm@us.ibm.com> Cc: Pierre Peiffer <pierre.peiffer@bull.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-29 08:06:12 -07:00
Arthur Kepner	74bc7ceebf	dma: add dma_map_attrs() interfaces Introduce new interfaces, dma_map_attrs(), for passing architecture-specific attributes when memory is mapped and unmapped for DMA. Give the interfaces default implementations which ignore attributes. Also introduce the dma_{set\|get}_attr() interfaces for setting and retrieving individual attributes. Define one attribute, DMA_ATTR_WRITE_BARRIER, in anticipation of its use by ia64/sn. Select whether architectures implement arch-specific versions of the dma_map_attrs() interfaces via HAVE_DMA_ATTRS in Kconfig. [markn@au1.ibm.com: dma_{set,get}_attr() have to be static inline] Signed-off-by: Arthur Kepner <akepner@sgi.com> Cc: Tony Luck <tony.luck@intel.com> Cc: Jesse Barnes <jbarnes@virtuousgeek.org> Cc: Jes Sorensen <jes@sgi.com> Cc: Randy Dunlap <randy.dunlap@oracle.com> Cc: Roland Dreier <rdreier@cisco.com> Cc: James Bottomley <James.Bottomley@HansenPartnership.com> Cc: David Miller <davem@davemloft.net> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Grant Grundler <grundler@parisc-linux.org> Cc: Michael Ellerman <michael@ellerman.id.au> Signed-off-by: Mark Nelson <markn@au1.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-29 08:06:11 -07:00
Pavel Emelyanov	29f2a4dac8	memcgroup: implement failcounter reset This is a very common requirement from people using the resource accounting facilities (not only memcgroup but also OpenVZ beancounters). They want to put the cgroup in an initial state without re-creating it. For example after re-configuring a group people want to observe how this new configuration fits the group needs without saving the previous failcnt value. Merge two resets into one mem_cgroup_reset() function to demonstrate how multiplexing work. Besides, I have plans to move the files, that correspond to res_counter to the res_counter.c file and somehow "import" them into controller. I don't know how to make it gracefully yet, but merging resets of max_usage and failcnt in one function will be there for sure. [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Paul Menage <menage@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-29 08:06:10 -07:00
Pavel Emelyanov	faebe9fdf3	memcgroups: add a document describing the resource counter abstraction The resource counter is supposed to facilitate the resource accounting of arbitrary resource (and it already does this for memory controller). However, it is about to be used in other resources controllers (swap, kernel memory, networking, etc), so provide a doc describing how to work with it. This will eliminate all the possible future duplications in the appropriate controllers' docs. Fixed errors pointed out by Randy. [akpm@linux-foundation.org: fix documentation tpyo] Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Cc: Randy Dunlap <randy.dunlap@oracle.com> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Li Zefan <lizf@cn.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-29 08:06:10 -07:00
Pavel Emelyanov	c84872e168	memcgroup: add the max_usage member on the res_counter This field is the maximal value of the usage one since the counter creation (or since the latest reset). To reset this to the usage value simply write anything to the appropriate cgroup file. Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-29 08:06:10 -07:00
Balbir Singh	cf475ad28a	cgroups: add an owner to the mm_struct Remove the mem_cgroup member from mm_struct and instead adds an owner. This approach was suggested by Paul Menage. The advantage of this approach is that, once the mm->owner is known, using the subsystem id, the cgroup can be determined. It also allows several control groups that are virtually grouped by mm_struct, to exist independent of the memory controller i.e., without adding mem_cgroup's for each controller, to mm_struct. A new config option CONFIG_MM_OWNER is added and the memory resource controller selects this config option. This patch also adds cgroup callbacks to notify subsystems when mm->owner changes. The mm_cgroup_changed callback is called with the task_lock() of the new task held and is called just prior to changing the mm->owner. I am indebted to Paul Menage for the several reviews of this patchset and helping me make it lighter and simpler. This patch was tested on a powerpc box, it was compiled with both the MM_OWNER config turned on and off. After the thread group leader exits, it's moved to init_css_state by cgroup_exit(), thus all future charges from runnings threads would be redirected to the init_css_set's subsystem. Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: Pavel Emelianov <xemul@openvz.org> Cc: Hugh Dickins <hugh@veritas.com> Cc: Sudhir Kumar <skumar@linux.vnet.ibm.com> Cc: YAMAMOTO Takashi <yamamoto@valinux.co.jp> Cc: Hirokazu Takahashi <taka@valinux.co.jp> Cc: David Rientjes <rientjes@google.com>, Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Pekka Enberg <penberg@cs.helsinki.fi> Reviewed-by: Paul Menage <menage@google.com> Cc: Oleg Nesterov <oleg@tv-sign.ru> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-29 08:06:10 -07:00
Serge E. Hallyn	29486df325	cgroups: introduce cft->read_seq() Introduce a read_seq() helper in cftype, which uses seq_file to print out lists. Use it in the devices cgroup. Also split devices.allow into two files, so now devices.deny and devices.allow are the ones to use to manipulate the whitelist, while devices.list outputs the cgroup's current whitelist. Signed-off-by: Serge E. Hallyn <serue@us.ibm.com> Acked-by: Paul Menage <menage@google.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-29 08:06:10 -07:00
Li Zefan	28fd5dfc12	cgroups: remove the css_set linked-list Now we can run through the hash table instead of running through the linked-list. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Reviewed-by: Paul Menage <menage@google.com> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: Pavel Emelyanov <xemul@openvz.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-29 08:06:10 -07:00
Li Zefan	472b1053f3	cgroups: use a hash table for css_set finding When we attach a process to a different cgroup, the css_set linked-list will be run through to find a suitable existing css_set to use. This patch implements a hash table for better performance. The following benchmarks have been tested: For N in 1, 5, 10, 50, 100, 500, 1000, create N cgroups with one sleeping task in each, and then move an additional task through each cgroup in turn. Here is a test result: N Loop orig - Time(s) hash - Time(s) ---------------------------------------------- 1 10000 1.201231728 1.196311177 5 2000 1.065743872 1.040566424 10 1000 0.991054735 0.986876440 50 200 0.976554203 0.969608733 100 100 0.998504680 0.969218270 500 20 1.157347764 0.962602963 1000 10 1.619521852 1.085140172 Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Reviewed-by: Paul Menage <menage@google.com> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: Pavel Emelyanov <xemul@openvz.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-29 08:06:09 -07:00
Serge E. Hallyn	08ce5f16ee	cgroups: implement device whitelist Implement a cgroup to track and enforce open and mknod restrictions on device files. A device cgroup associates a device access whitelist with each cgroup. A whitelist entry has 4 fields. 'type' is a (all), c (char), or b (block). 'all' means it applies to all types and all major and minor numbers. Major and minor are either an integer or * for all. Access is a composition of r (read), w (write), and m (mknod). The root device cgroup starts with rwm to 'all'. A child devcg gets a copy of the parent. Admins can then remove devices from the whitelist or add new entries. A child cgroup can never receive a device access which is denied its parent. However when a device access is removed from a parent it will not also be removed from the child(ren). An entry is added using devices.allow, and removed using devices.deny. For instance echo 'c 1:3 mr' > /cgroups/1/devices.allow allows cgroup 1 to read and mknod the device usually known as /dev/null. Doing echo a > /cgroups/1/devices.deny will remove the default 'a : mrw' entry. CAP_SYS_ADMIN is needed to change permissions or move another task to a new cgroup. A cgroup may not be granted more permissions than the cgroup's parent has. Any task can move itself between cgroups. This won't be sufficient, but we can decide the best way to adequately restrict movement later. [akpm@linux-foundation.org: coding-style fixes] [akpm@linux-foundation.org: fix may-be-used-uninitialized warning] Signed-off-by: Serge E. Hallyn <serue@us.ibm.com> Acked-by: James Morris <jmorris@namei.org> Looks-good-to: Pavel Emelyanov <xemul@openvz.org> Cc: Daniel Hokka Zakrisson <daniel@hozac.com> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: Paul Menage <menage@google.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-29 08:06:09 -07:00
Pavel Emelyanov	d447ea2f30	cgroups: add the trigger callback to struct cftype Trigger callback can be used to receive a kick-up from the user space. The string written is ignored. The cftype->private is used for multiplexing events. Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Acked-by: Paul Menage <menage@google.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Balbir Singh <balbir@in.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-29 08:06:09 -07:00
Paul Menage	e73d2c61d1	CGroups _s64 files: add cgroups read_s64/write_s64 file methods These patches add cgroups read_s64 and write_s64 control file methods (the signed equivalent of read_u64/write_u64) and use them to implement the cpu.rt_runtime_us control file in the CFS cgroup subsystem. This patch: These are the signed equivalents of the read_u64/write_u64 methods Signed-off-by: Paul Menage <menage@google.com> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-29 08:06:09 -07:00
Paul Menage	3116f0e3df	CGroup API files: move "releasable" to cgroup_debug subsystem The "releasable" control file provided by the cgroup framework exports the state of a per-cgroup flag that's related to the notify-on-release feature. This isn't really generally useful, unless you're trying to debug this particular feature of cgroups. This patch moves the "releasable" file to the cgroup_debug subsystem. Signed-off-by: Paul Menage <menage@google.com> Cc: "Li Zefan" <lizf@cn.fujitsu.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Paul Jackson <pj@sgi.com> Cc: Pavel Emelyanov <xemul@openvz.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: "YAMAMOTO Takashi" <yamamoto@valinux.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-29 08:06:09 -07:00
Paul Menage	9179656961	CGroup API files: add cgroup map data type Adds a new type of supported control file representation, a map from strings to u64 values. Each map entry is printed as a line in a similar format to /proc/vmstat, i.e. "$key $value\n" Signed-off-by: Paul Menage <menage@google.com> Cc: "Li Zefan" <lizf@cn.fujitsu.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Paul Jackson <pj@sgi.com> Cc: Pavel Emelyanov <xemul@openvz.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: "YAMAMOTO Takashi" <yamamoto@valinux.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-29 08:06:08 -07:00
Paul Menage	2c7eabf376	CGroup API files: add res_counter_read_u64() Adds a function for returning the value of a resource counter member, in a form suitable for use in a cgroup read_u64 control file method. Signed-off-by: Paul Menage <menage@google.com> Cc: "Li Zefan" <lizf@cn.fujitsu.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Paul Jackson <pj@sgi.com> Cc: Pavel Emelyanov <xemul@openvz.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: "YAMAMOTO Takashi" <yamamoto@valinux.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-29 08:06:08 -07:00
Paul Menage	f4c753b7ea	CGroup API files: rename read/write_uint methods to read_write_u64 Several people have justifiably complained that the "_uint" suffix is inappropriate for functions that handle u64 values, so this patch just renames all these functions and their users to have the suffic _u64. [peterz@infradead.org: build fix] Signed-off-by: Paul Menage <menage@google.com> Cc: "Li Zefan" <lizf@cn.fujitsu.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Paul Jackson <pj@sgi.com> Cc: Pavel Emelyanov <xemul@openvz.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: "YAMAMOTO Takashi" <yamamoto@valinux.co.jp> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-29 08:06:07 -07:00
Jan Engelhardt	c9e587abfd	vt: fix background color on line feed A command that causes a line feed while a background color is active, such as perl -e 'print "x" x 60, "\e[44m", "x" x 40, "\e[0m\n"' and perl -e 'print "x" x 40, "\e[44m\n", "x" x 40, "\e[0m\n"' causes the line that was started as a result of the line feed to be completely filled with the currently active background color instead of the default color. When scrolling, part of the current screen is memcpy'd/memmove'd to the new region, and the new line(s) that will appear as a result are cleared using memset. However, the lines are cleared with vc->vc_video_erase_char, causing them to be colored with the currently active background color. This is different from X11 terminal emulators which always paint the new lines with the default background color (e.g. `xterm -bg black`). The clear operation (\e[1J and \e[2J) also use vc_video_erase_char, so a new vc->vc_scrl_erase_char is introduced with contains the erase character used for scrolling, which is built from vc->vc_def_color instead of vc->vc_color. Signed-off-by: Jan Engelhardt <jengelh@computergmbh.de> Cc: "Antonino A. Daplas" <adaplas@pol.net> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Alan Cox <alan@lxorguk.ukuu.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-29 08:06:06 -07:00
Dave Young	5f97a5a879	isolate ratelimit from printk.c for other use Due to the rcupreempt.h WARN_ON trigged, I got 2G syslog file. For some serious complaining of kernel, we need repeat the warnings, so here I isolate the ratelimit part of printk.c to a standalone file. Signed-off-by: Dave Young <hidave.darkstar@gmail.com> Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Tested-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-29 08:06:06 -07:00
David Howells	8f0cfa52a1	xattr: add missing consts to function arguments Add missing consts to xattr function arguments. Signed-off-by: David Howells <dhowells@redhat.com> Cc: Andreas Gruenbacher <agruen@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-29 08:06:06 -07:00
Ilpo Järvinen	76308da189	smb.h: uses struct timespec but didn't include linux/time.h Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-29 08:06:05 -07:00
Robert P. J. Day	95d8c365b2	lists: add "const" qualifier to first arg of list_splice() operations Since neither the list_splice() nor __list_splice() routines modify their first argument, might as well declare them "const". [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Robert P. J. Day <rpjday@crashcourse.ca> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-29 08:06:04 -07:00
Robert P. J. Day	8673511845	kbuild: move files that don't check __KERNEL__ Move files that don't check __KERNEL__ from unifdef-y to header-y. Signed-off-by: Robert P. J. Day <rpjday@crashcourse.ca> Cc: Sam Ravnborg <sam@ravnborg.org> Cc: David Woodhouse <dwmw2@infradead.org> Cc: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-29 08:06:04 -07:00
Robert P. J. Day	1a6924f93d	kbuild: remove duplicate, conflicting entry for oom.h oom.h is already tagged for unifdef'ing, so its entry as a simple exportable header should be deleted. Signed-off-by: Robert P. J. Day <rpjday@crashcourse.ca> Cc: Sam Ravnborg <sam@ravnborg.org> Cc: David Woodhouse <dwmw2@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-29 08:06:04 -07:00
Robert P. J. Day	aab3c3b01d	Remove superfluous include of string.h from percpu.h There's nothing in percpu.h that requires an explicit inclusion of string.h. Signed-off-by: Robert P. J. Day <rpjday@crashcourse.ca> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-29 08:06:04 -07:00
Pavel Emelyanov	3a2e7f47d7	binfmt_misc.c: avoid potential kernel stack overflow This can be triggered with root help only, but... Register the ":text:E::txt::/root/cat.txt:' rule in binfmt_misc (by root) and try launching the cat.txt file (by anyone) :) The result is - the endless recursion in the load_misc_binary -> open_exec -> load_misc_binary chain and stack overflow. There's a similar problem with binfmt_script, and there's a sh_bang memner on linux_binprm structure to handle this, but simply raising this in binfmt_misc may break some setups when the interpreter of some misc binaries is a script. So the proposal is to turn sh_bang into a bit, add a new one (the misc_bang) and raise it in load_misc_binary. After this, even if we set up the misc -> script -> misc loop for binfmts one of them will step on its own bang and exit. Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-29 08:06:04 -07:00
Adrian Bunk	7d195a5409	proper extern for late_time_init Add a proper extern for late_time_init in include/linux/init.h Signed-off-by: Adrian Bunk <bunk@kernel.org> Acked-by: Ingo Molnar <mingo@elte.hu> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: john stultz <johnstul@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-29 08:06:03 -07:00
Tetsuo Handa	175a06ae30	exec: remove argv_len from struct linux_binprm I noticed that 2.6.24.2 calculates bprm->argv_len at do_execve(). But it doesn't update bprm->argv_len after "remove_arg_zero() + copy_strings_kernel()" at load_script() etc. audit_bprm() is called from search_binary_handler() and search_binary_handler() is called from load_script() etc. Thus, I think the condition check if (bprm->argv_len > (audit_argv_kb << 10)) return -E2BIG; in audit_bprm() might return wrong result when strlen(removed_arg) != strlen(spliced_args). Why not update bprm->argv_len at load_script() etc. ? By the way, 2.6.25-rc3 seems to not doing the condition check. Is the field bprm->argv_len no longer needed? Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Ollie Wild <aaw@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-29 08:06:03 -07:00
WANG Cong	ecd0fa9825	Remove the macro get_personality Remove the macro get_personality, use ->personality instead. Cc: Christoph Hellwig <hch@infradead.org Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: David Howells <dhowells@redhat.com> Cc: Bryan Wu <bryan.wu@analog.com> Signed-off-by: WANG Cong <xiyou.wangcong@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-29 08:06:02 -07:00
jan sonnek	6e5e8c5085	Misc: phantom, consistent whitespace Make it consistent with the rest of the header. Signed-off-by: jan sonnek <xsonnek@gmail.com> Cc: Jiri Slaby <jirislaby@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-29 08:06:02 -07:00
Jiri Slaby	7e4e8e689f	Misc: phantom, add compat ioctl Openhaptics uses pointers in _IOC() macros, implement compat for them. Also add _IOC alternatives which are not 32/64 bit dependent (structures passed through aren't yet) -- libphantom will use them. [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Jiri Slaby <jirislaby@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-29 08:06:02 -07:00
Adrian Bunk	eb0f1c442d	proper __do_softirq() prototype Add a proper prototype for __do_softirq() in include/linux/interrupt.h Signed-off-by: Adrian Bunk <bunk@kernel.org> Acked-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-29 08:06:02 -07:00
Adrian Bunk	58b250daff	remove mca_is_adapter_used() Remove the no longer used mca_is_adapter_used(). Signed-off-by: Adrian Bunk <bunk@kernel.org> Cc: James Bottomley <James.Bottomley@steeleye.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-29 08:06:01 -07:00
Adrian Bunk	946a57b526	remove generic_commit_write() Remove the obsolete and no longer used generic_commit_write(). Signed-off-by: Adrian Bunk <bunk@kernel.org> Cc: Christoph Hellwig <hch@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-29 08:06:01 -07:00
Adrian Bunk	d5470b596a	fs/aio.c: make 3 functions static Make the following needlessly global functions static: - __put_ioctx() - lookup_ioctx() - io_submit_one() Signed-off-by: Adrian Bunk <bunk@kernel.org> Cc: Zach Brown <zach.brown@oracle.com> Cc: Benjamin LaHaise <bcrl@kvack.org> Cc: Badari Pulavarty <pbadari@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-29 08:06:00 -07:00
Adrian Bunk	07d45da616	fs/drop_caches.c: make 2 functions static Make the following needlessly global functions static: - drop_pagecache() - drop_slab() Signed-off-by: Adrian Bunk <bunk@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-29 08:06:00 -07:00
Adrian Bunk	f11b00f3bd	fs/fs-writeback.c: make 2 functions static Make the following needlessly global functions static: - writeback_acquire() - writeback_release() Signed-off-by: Adrian Bunk <bunk@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-29 08:06:00 -07:00
Adrian Bunk	67cde59537	make vfs_ioctl() static Make the needlessly global vfs_ioctl() static. Signed-off-by: Adrian Bunk <bunk@kernel.org> Acked-by: Christoph Hellwig <hch@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-29 08:06:00 -07:00
Adrian Bunk	6b09ae6692	make __put_super() static Make the needlessly global __put_super() static. Signed-off-by: Adrian Bunk <bunk@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-29 08:06:00 -07:00
Sam Ravnborg	f718e31819	cpu: fix section mismatch warnings in hotcpu_register Fix following warnings: WARNING: vmlinux.o(.data+0x5020): Section mismatch in reference from the variable cpu_vsyscall_notifier_nb.12876 to the function .cpuinit.text:cpu_vsyscall_notifier() WARNING: vmlinux.o(.data+0x9ce0): Section mismatch in reference from the variable profile_cpu_callback_nb.17654 to the function .devinit.text:profile_cpu_callback() WARNING: vmlinux.o(.data+0xd380): Section mismatch in reference from the variable workqueue_cpu_callback_nb.15004 to the function .devinit.text:workqueue_cpu_callback() WARNING: vmlinux.o(.data+0x11d00): Section mismatch in reference from the variable relay_hotcpu_callback_nb.19626 to the function .cpuinit.text:relay_hotcpu_callback() WARNING: vmlinux.o(.data+0x12970): Section mismatch in reference from the variable cpu_callback_nb.24694 to the function .devinit.text:cpu_callback() WARNING: vmlinux.o(.data+0x3fee0): Section mismatch in reference from the variable percpu_counter_hotcpu_callback_nb.10903 to the function .cpuinit.text:percpu_counter_hotcpu_callback() WARNING: vmlinux.o(.data+0x74ce0): Section mismatch in reference from the variable topology_cpu_callback_nb.12506 to the function .cpuinit.text:topology_cpu_callback() Functions used as argument are by definition only used in HOTPLUG_CPU situations so thay are annotated __cpuinit. Annotate the static variable used by hotcpu_register with __cpuinitdata to match this definition. Signed-off-by: Sam Ravnborg <sam@ravnborg.org> Cc: Gautham R Shenoy <ego@in.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-29 08:05:59 -07:00
Sripathi Kodi	679c9cd4ac	add RUSAGE_THREAD Add the RUSAGE_THREAD option for the getrusage system call. This is essentially Roland's patch from http://lkml.org/lkml/2008/1/18/589, but the line about RUSAGE_LWP line has been removed, as suggested by Ulrich and Christoph. Signed-off-by: Roland McGrath <roland@redhat.com> Signed-off-by: Sripathi Kodi <sripathik@in.ibm.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Michael Kerrisk <mtk.manpages@googlemail.com> Cc: Ulrich Drepper <drepper@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-29 08:05:59 -07:00
Nur Hussein	95b570c9ce	Taint kernel after WARN_ON(condition) The kernel is sent to tainted within the warn_on_slowpath() function, and whenever a warning occurs the new taint flag 'W' is set. This is useful to know if a warning occurred before a BUG by preserving the warning as a flag in the taint state. This does not work on architectures where WARN_ON has its own definition. These archs are: 1. s390 2. superh 3. avr32 4. parisc The maintainers of these architectures have been added in the Cc: list in this email to alert them to the situation. The documentation in oops-tracing.txt has been updated to include the new flag. Signed-off-by: Nur Hussein <nurhussein@gmail.com> Cc: Arjan van de Ven <arjan@infradead.org> Cc: "Randy.Dunlap" <rdunlap@xenotime.net> Cc: Ingo Molnar <mingo@elte.hu> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Kyle McMartin <kyle@mcmartin.ca> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Haavard Skinnemoen <hskinnemoen@atmel.com> Cc: Paul Mundt <lethal@linux-sh.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-29 08:05:59 -07:00
Ilpo Järvinen	bd3feb13e1	fs/coda: remove static inline forward declarations They're defined later on in the same file with bodies and nothing in between needs them. Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Reviewed-by: Pekka Enberg <penberg@cs.helsinki.fi> Acked-by: Jan Harkes <jaharkes@cs.cmu.edu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-29 08:05:59 -07:00
Eric Dumazet	ede9c697bc	Avoid divides in BITS_TO_LONGS BITS_PER_LONG is a signed value (32 or 64) DIV_ROUND_UP(nr, BITS_PER_LONG) performs signed arithmetic if "nr" is signed too. Converting BITS_TO_LONGS(nr) to DIV_ROUND_UP(nr, BITS_PER_BYTE * sizeof(long)) makes sure compiler can perform a right shift, even if "nr" is a signed value, instead of an expensive integer divide. Applying this patch saves 141 bytes on x86 when CONFIG_CC_OPTIMIZE_FOR_SIZE=y and speedup bitmap operations. Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-29 08:05:59 -07:00
Nishanth Aravamudan	ab857d0938	mm: fix misleading __GFP_REPEAT related comments The definition and use of __GFP_REPEAT, __GFP_NOFAIL and __GFP_NORETRY in the core VM have somewhat differing comments as to their actual semantics. Annoyingly, the flags definition has inline and header comments, which might be interpreted as not being equivalent. Just add references to the header comments in the inline ones so they don't go out of sync in the future. In their use in __alloc_pages() clarify that the current implementation treats low-order allocations and __GFP_REPEAT allocations as distinct cases. To clarify, the flags' semantics are: __GFP_NORETRY means try no harder than one run through __alloc_pages __GFP_REPEAT means __GFP_NOFAIL __GFP_NOFAIL means repeat forever order <= PAGE_ALLOC_COSTLY_ORDER means __GFP_NOFAIL Signed-off-by: Nishanth Aravamudan <nacc@us.ibm.com> Acked-by: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-29 08:05:58 -07:00
Alan D. Brunelle	ac9fafa124	block: Skip I/O merges when disabled The block I/O + elevator + I/O scheduler code spend a lot of time trying to merge I/Os -- rightfully so under "normal" circumstances. However, if one were to know that the incoming I/O stream was /very/ random in nature, the cycles are wasted. This patch adds a per-request_queue tunable that (when set) disables merge attempts (beyond the simple one-hit cache check), thus freeing up a non-trivial amount of CPU cycles. Signed-off-by: Alan D. Brunelle <alan.brunelle@hp.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2008-04-29 14:48:55 +02:00
FUJITA Tomonori	d7e3c3249e	block: add large command support This patch changes rq->cmd from the static array to a pointer to support large commands. We rarely handle large commands. So for optimization, a struct request still has a static array for a command. rq_init sets rq->cmd pointer to the static array. Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp> Cc: Jens Axboe <jens.axboe@oracle.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2008-04-29 14:48:55 +02:00
FUJITA Tomonori	2a4aa30c5f	block: rename and export rq_init() This rename rq_init() blk_rq_init() and export it. Any path that hands the request to the block layer needs to call it to initialize the request. This is a preparation for large command support, which needs to initialize the request in a proper way (that is, just doing a memset() will not work). Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp> Cc: Jens Axboe <jens.axboe@oracle.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2008-04-29 14:48:55 +02:00
Nick Piggin	75ad23bc0f	block: make queue flags non-atomic We can save some atomic ops in the IO path, if we clearly define the rules of how to modify the queue flags. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2008-04-29 14:48:33 +02:00
Zhang Wei	61b269179d	[RAPIDIO] Add serial RapidIO controller support, which includes MPC8548, MPC8641 Signed-off-by: Zhang Wei <wei.zhang@freescale.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Paul Mackerras <paulus@samba.org>	2008-04-29 19:40:29 +10:00
Zhang Wei	e042323607	[RAPIDIO] Auto-probe the RapidIO system size The RapidIO system size will auto probe in RIO setup. The route table and rionet_active in rionet.c are changed to be allocated dynamically according to the size of the system. Signed-off-by: Zhang Wei <wei.zhang@freescale.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Paul Mackerras <paulus@samba.org>	2008-04-29 19:40:28 +10:00
Zhang Wei	ad1e9380b1	[RAPIDIO] Add RapidIO multi mport support The original RapidIO driver suppose there is only one mpc85xx RIO controller in system. So, some data structures are defined as mpc85xx_rio global, such as 'regs_win', 'dbell_ring', 'msg_tx_ring'. Now, I changed them to mport's private members. And you can define multi RIO OF-nodes in dts file for multi RapidIO controller in one processor, such as PCI/PCI-Ex host controllers in Freescale's silicon. And the mport operation function declaration should be changed to know which RapidIO controller is target. Signed-off-by: Zhang Wei <wei.zhang@freescale.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Paul Mackerras <paulus@samba.org>	2008-04-29 19:40:28 +10:00
FUJITA Tomonori	68154e90c9	block: add dma alignment and padding support to blk_rq_map_kern This patch adds bio_copy_kern similar to bio_copy_user. blk_rq_map_kern uses bio_copy_kern instead of bio_map_kern if necessary. bio_copy_kern uses temporary pages and the bi_end_io callback frees these pages. bio_copy_kern saves the original kernel buffer at bio->bi_private it doesn't use something like struct bio_map_data to store the information about the caller. Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp> Cc: Tejun Heo <htejun@gmail.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2008-04-29 09:50:34 +02:00
Bjorn Helgaas	dfd2e1b4e6	PNPBIOS: remove include/linux/pnpbios.h The contents of include/linux/pnpbios.h are used only inside the PNPBIOS backend, so this file doesn't need to be visible outside PNP. This patch moves the contents into an existing PNPBIOS-specific file, drivers/pnp/pnpbios/pnpbios.h. Signed-off-by: Bjorn Helgaas <bjorn.helgaas@hp.com> Acked-By: Rene Herman <rene.herman@gmail.com> Signed-off-by: Len Brown <len.brown@intel.com>	2008-04-29 03:22:30 -04:00
Bjorn Helgaas	261b20da4b	ISAPNP: remove unused pnp_dev->regs field The "regs" field in struct pnp_dev is set but never read, so remove it. Signed-off-by: Bjorn Helgaas <bjorn.helgaas@hp.com> Acked-By: Rene Herman <rene.herman@gmail.com> Signed-off-by: Len Brown <len.brown@intel.com>	2008-04-29 03:22:30 -04:00
Bjorn Helgaas	62cfb298b9	PNP: make interfaces private to the PNP core The interfaces for registering protocols, devices, cards, and resource options should only be used inside the PNP core. Signed-off-by: Bjorn Helgaas <bjorn.helgaas@hp.com> Acked-By: Rene Herman <rene.herman@gmail.com> Signed-off-by: Len Brown <len.brown@intel.com>	2008-04-29 03:22:30 -04:00
Bjorn Helgaas	02d83b5da3	PNP: make pnp_resource_table private to PNP core There are no remaining references to the PNP_MAX_* constants or the pnp_resource_table structure outside of the PNP core. Make them private to the PNP core. Signed-off-by: Bjorn Helgaas <bjorn.helgaas@hp.com> Signed-off-by: Len Brown <len.brown@intel.com>	2008-04-29 03:22:27 -04:00
Bjorn Helgaas	13575e81bb	PNP: convert resource accessors to use pnp_get_resource(), not pnp_resource_table This removes more direct references to pnp_resource_table. Signed-off-by: Bjorn Helgaas <bjorn.helgaas@hp.com> Signed-off-by: Len Brown <len.brown@intel.com>	2008-04-29 03:22:23 -04:00
Bjorn Helgaas	b90eca0a61	PNP: add pnp_get_resource() interface This adds a pnp_get_resource() that works the same way as platform_get_resource(). This will enable us to consolidate many pnp_resource_table references in one place, which will make it easier to make the table dynamic. Signed-off-by: Bjorn Helgaas <bjorn.helgaas@hp.com> Acked-By: Rene Herman <rene.herman@gmail.com> Signed-off-by: Len Brown <len.brown@intel.com>	2008-04-29 03:22:23 -04:00
Bjorn Helgaas	2cd1393098	PNP: remove unused interfaces using pnp_resource_table Rene Herman <rene.herman@gmail.com> recently removed the only in-tree driver uses of: pnp_init_resource_table() pnp_manual_config_dev() pnp_resource_change() in this change: http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=109c53f840e551d6e99ecfd8b0131a968332c89f These are no longer used in the PNP core either, so we can just remove them completely. It's possible that there are out-of-tree drivers that use these interfaces. They should be changed to either (1) use PNP quirks to work around broken hardware or firmware, or (2) use the sysfs interfaces to control resource usage from userspace. Signed-off-by: Bjorn Helgaas <bjorn.helgaas@hp.com> Acked-By: Rene Herman <rene.herman@gmail.com> Signed-off-by: Len Brown <len.brown@intel.com>	2008-04-29 03:22:22 -04:00
Bjorn Helgaas	f449000209	PNP: add pnp_init_resources(struct pnp_dev ) interface Add pnp_init_resources(struct pnp_dev ) to replace pnp_init_resource_table(), which takes a pointer to the pnp_resource_table itself. Passing only the pnp_dev * reduces the possibility for error in the caller and removes the pnp_resource_table implementation detail from the interface. Even though pnp_init_resource_table() is exported, I did not export pnp_init_resources() because it is used only by the PNP core. Signed-off-by: Bjorn Helgaas <bjorn.helgaas@hp.com> Acked-By: Rene Herman <rene.herman@gmail.com> Signed-off-by: Len Brown <len.brown@intel.com>	2008-04-29 03:22:22 -04:00
Bjorn Helgaas	59284cb409	PNP: remove pnp_resource_table from internal get/set interfaces When we call protocol->get() and protocol->set() methods, we currently supply pointers to both the pnp_dev and the pnp_resource_table even though the pnp_resource_table should always be the one associated with the pnp_dev. This removes the pnp_resource_table arguments to make it clear that these methods only operate on the specified pnp_dev. Signed-off-by: Bjorn Helgaas <bjorn.helgaas@hp.com> Acked-By: Rene Herman <rene.herman@gmail.com> Signed-off-by: Len Brown <len.brown@intel.com>	2008-04-29 03:22:21 -04:00
Bjorn Helgaas	c1caf06ccf	PNP: add debug output to option registration Add debug output to resource option registration functions (enabled by CONFIG_PNP_DEBUG). This uses dev_printk, so I had to add pnp_dev arguments at the same time. Signed-off-by: Bjorn Helgaas <bjorn.helgaas@hp.com> Acked-By: Rene Herman <rene.herman@gmail.com> Signed-off-by: Len Brown <len.brown@intel.com>	2008-04-29 03:22:20 -04:00
Bjorn Helgaas	048825deea	PNP: make pnp_add_card_id() internal to PNP core pnp_add_card_id() doesn't need to be exposed outside the PNP core, so move the declaration to an internal header file. Signed-off-by: Bjorn Helgaas <bjorn.helgaas@hp.com> Acked-By: Rene Herman <rene.herman@gmail.com> Signed-off-by: Len Brown <len.brown@intel.com>	2008-04-29 03:22:17 -04:00
Bjorn Helgaas	1692b27bf3	PNP: make pnp_add_id() internal to PNP core pnp_add_id() doesn't need to be exposed outside the PNP core, so move the declaration to an internal header file. Signed-off-by: Bjorn Helgaas <bjorn.helgaas@hp.com> Acked-By: Rene Herman <rene.herman@gmail.com> Signed-off-by: Len Brown <len.brown@intel.com>	2008-04-29 03:22:15 -04:00
Bjorn Helgaas	ca0e8b6fd2	ISAPNP: move config register addresses out of isapnp.h These are used only in drivers/pnp/isapnp/core.c, so no need to expose them to the world. Signed-off-by: Bjorn Helgaas <bjorn.helgaas@hp.com> Acked-By: Rene Herman <rene.herman@gmail.com> Signed-off-by: Len Brown <len.brown@intel.com>	2008-04-29 03:22:15 -04:00
Zhang Rui	e68b16abd9	thermal: add hwmon sysfs I/F Add hwmon sys I/F for generic thermal driver. Note: we have one hwmon class device for EACH TYPE of the thermal zone device. Signed-off-by: Zhang Rui <rui.zhang@intel.com> Acked-by: Jean Delvare <khali@linux-fr.org> Signed-off-by: Len Brown <len.brown@intel.com>	2008-04-29 02:48:01 -04:00
Zhang, Rui	9ec732ff80	thermal: add new get_crit_temp callback Add a new callback so that the generic thermal can get the critical trip point info of a thermal zone, which is needed for building the tempX_crit hwmon sysfs attribute. Signed-off-by: Zhang Rui <rui.zhang@intel.com> Acked-by: Jean Delvare <khali@linux-fr.org> Signed-off-by: Len Brown <len.brown@intel.com>	2008-04-29 02:45:49 -04:00
Zhang Rui	63c4ec905d	thermal: add the support for building the generic thermal as a module Build the generic thermal driver as module "thermal_sys". Make ACPI thermal, video, processor and fan SELECT the generic thermal driver, as these drivers rely on it to build the sysfs I/F. Signed-off-by: Zhang Rui <rui.zhang@intel.com> Acked-by: Jean Delvare <khali@linux-fr.org> Signed-off-by: Len Brown <len.brown@intel.com>	2008-04-29 02:44:00 -04:00
Badari Pulavarty	9d88a2eb6e	[POWERPC] Provide walk_memory_resource() for powerpc Provide walk_memory_resource() for 64-bit powerpc. PowerPC maintains logical memory region mapping in the lmb.memory structure. Walk through these structures and do the callbacks for the contiguous chunks. Signed-off-by: Badari Pulavarty <pbadari@us.ibm.com> Cc: Yasunori Goto <y-goto@jp.fujitsu.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Paul Mackerras <paulus@samba.org>	2008-04-29 15:57:53 +10:00
Badari Pulavarty	98d5c21c81	[POWERPC] Update lmb data structures for hotplug memory add/remove The powerpc kernel maintains information about logical memory blocks in the lmb.memory structure, which is initialized and updated at boot time, but not when memory is added or removed while the kernel is running. This adds a hotplug memory notifier which updates lmb.memory when memory is added or removed. This information is useful for eHEA driver to find out the memory layout and holes. NOTE: No special locking is needed for lmb_add() and lmb_remove(). Calls to these are serialized by caller. (pSeries_reconfig_chain). Signed-off-by: Badari Pulavarty <pbadari@us.ibm.com> Cc: Yasunori Goto <y-goto@jp.fujitsu.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: "David S. Miller" <davem@davemloft.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Paul Mackerras <paulus@samba.org>	2008-04-29 15:57:53 +10:00
Lennert Buytenhek	ce4e2e4558	mv643xx_eth: inter-mv643xx SMI port sharing There exist chips with up to four mv643xx_eth silicon blocks but only one external SMI (MII management) interface -- the SMI logic of the first block is shared by all the blocks. Handle this by allowing a per-port override of which mv643xx_eth_shared's SMI registers (and spinlock) to use. Signed-off-by: Lennert Buytenhek <buytenh@marvell.com> Acked-by: Nicolas Pitre <nico@marvell.com> Signed-off-by: Dale Farnsworth <dale@farnsworth.org>	2008-04-28 21:17:07 -07:00
Lennert Buytenhek	240e4419e0	mv643xx_eth: shorten shared platform driver name Change the MV643XX_ETH_SHARED_NAME platform driver name to something shorter than 19 characters, so that we can register multiple (otherwise we end up with sysfs conflicts since all instances will map to "mv643xx_eth_shared." as there is a 20-char sysfs file name limit.) Signed-off-by: Lennert Buytenhek <buytenh@marvell.com> Acked-by: Nicolas Pitre <nico@marvell.com> Signed-off-by: Dale Farnsworth <dale@farnsworth.org>	2008-04-28 21:17:07 -07:00
Lennert Buytenhek	c416a41f99	mv643xx_eth: configurable t_clk Make t_clk configurable via platform device data (with the current hardcoded value, 133 MHz, being the default), as it varies across different chip families. Signed-off-by: Lennert Buytenhek <buytenh@marvell.com> Acked-by: Nicolas Pitre <nico@marvell.com> Signed-off-by: Dale Farnsworth <dale@farnsworth.org>	2008-04-28 21:17:07 -07:00
Lennert Buytenhek	f2ce825d2a	mv643xx_eth: mbus decode window support Make it possible to pass mbus_dram_target_info to the mv643xx_eth driver via the platform data, and make the mv643xx_eth driver program the window registers based on this data if it is passed in. Signed-off-by: Lennert Buytenhek <buytenh@marvell.com> Reviewed-by: Tzachi Perelstein <tzachi@marvell.com> Acked-by: Russell King <rmk+kernel@arm.linux.org.uk> Signed-off-by: Dale Farnsworth <dale@farnsworth.org>	2008-04-28 21:17:07 -07:00
Lennert Buytenhek	fa3959f457	mv643xx_eth: get rid of static variables, allow multiple instances Move mv643xx_eth's static state (ethernet register block base address and MII management interface spinlock) into a struct hanging off the shared platform device. This is necessary to support chips that contain multiple mv643xx_eth silicon blocks. Signed-off-by: Lennert Buytenhek <buytenh@marvell.com> Acked-by: Nicolas Pitre <nico@marvell.com> Signed-off-by: Dale Farnsworth <dale@farnsworth.org>	2008-04-28 21:17:07 -07:00
Linus Torvalds	8ab68ab420	Merge git://git.kernel.org/pub/scm/linux/kernel/git/bart/ide-2.6 * git://git.kernel.org/pub/scm/linux/kernel/git/bart/ide-2.6: (35 commits) siimage: coding style cleanup (take 2) ide-cd: clean up cdrom_analyze_sense_data() ide-cd: fix test unsigned var < 0 ide: add TSSTcorp CDDVDW SH-S202H to ivb_list[] piix: add Asus Eee 701 controller to short cable list ARM: always select HAVE_IDE remove the broken ETRAX_IDE driver ide: remove ->dma_prdtable field from ide_hwif_t ide: remove ->dma_vendor{1,3} fields from ide_hwif_t scc_pata: add ->dma_host_set and ->dma_start methods ide: skip "VLB sync" if host uses MMIO ide: add ide_pad_transfer() helper ide: remove ->INW and ->OUTW methods ide: use IDE I/O helpers directly in ide_tf_{load,read}() ns87415: add ->tf_read method scc_pata: add ->tf_{load,read} methods ide-h8300: add ->tf_{load,read} methods ide-cris: add ->tf_{load,read} methods ide: add ->tf_load and ->tf_read methods ide: move ide_tf_{load,read} to ide-iops.c ...	2008-04-28 17:30:26 -07:00
Bartlomiej Zolnierkiewicz	55224bc86a	ide: remove ->dma_prdtable field from ide_hwif_t * Use 'hwif->dma_base + {4,8}' instead of hwif->dma_prdtable in {ide,scc}_dma_setup(). * Remove no longer needed ->dma_prdtable field from ide_hwif_t. While at it: * Use ATA_DMA_TABLE_OFS define. Acked-by: Sergei Shtylyov <sshtylyov@ru.mvista.com> Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>	2008-04-28 23:44:42 +02:00
Bartlomiej Zolnierkiewicz	41051a141d	ide: remove ->dma_vendor{1,3} fields from ide_hwif_t * Use 'hwif->dma_base + {1,3}' instead of hwif->dma_vendor{1,3} in pdc202xx_new host driver. * Remove no longer needed ->dma_vendor{1,3} fields from ide_hwif_t. Acked-by: Sergei Shtylyov <sshtylyov@ru.mvista.com> Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>	2008-04-28 23:44:42 +02:00
Bartlomiej Zolnierkiewicz	9f87abe892	ide: add ide_pad_transfer() helper * Add ide_pad_transfer() helper (which uses ->{in,out}put_data methods internally so the transfer is also padded to drive+host requirements) and use it instead of ide_atapi_{write_zeros,discard_data}(). * Remove no longer needed ide_atapi_{write_zeros,discard_data}(). Cc: Borislav Petkov <petkovbb@gmail.com> Acked-by: Sergei Shtylyov <sshtylyov@ru.mvista.com> Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>	2008-04-28 23:44:41 +02:00
Bartlomiej Zolnierkiewicz	7c0daf2681	ide: remove ->INW and ->OUTW methods * Remove no longer used ->INW and ->OUTW methods. While at it: * scc_pata.c: scc_ide_{out,in}w() is called only in scc_tf_{load,read}() so inline it there. Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>	2008-04-28 23:44:41 +02:00
Bartlomiej Zolnierkiewicz	94cd5b62ff	ide: add ->tf_load and ->tf_read methods * Add ->tf_load and ->tf_read methods to ide_hwif_t and set the default methods in default_hwif_transport(). * Use ->tf_{load,read} instead o calling ide_tf_{load,read}() directly. * Make ide_tf_{load,read}() static. There should be no functional changes caused by this patch. Acked-by: Sergei Shtylyov <sshtylyov@ru.mvista.com> Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>	2008-04-28 23:44:40 +02:00
Bartlomiej Zolnierkiewicz	089c5c7e00	ide: factor out debugging code from ide_tf_load() Factor out debugging code from ide_tf_load() to ide_tf_dump() helper and update ide_tf_load() users accordingly. Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>	2008-04-28 23:44:39 +02:00
Bartlomiej Zolnierkiewicz	1fc142589e	ide: add ide_execute_pkt_cmd() helper Add ide_execute_pkt_cmd() helper for executing PACKET command, then convert ATAPI device drivers to use it. As a nice side-effect this fixes ide-{floppy,tape,scsi} w.r.t. ide_lock taking (ide-cd was OK). Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>	2008-04-28 23:44:39 +02:00
Bartlomiej Zolnierkiewicz	16bb69c14a	ide: remove ->INS{W,L} and ->OUTS{W,L} methods * Use ins{w,l}()/outs{w,l}() and __ide_mm_ins{w,l}()/__ide_mm_outs{w,l}() directly in ata_{in,out}put_data() (by using IDE_HFLAG_MMIO host flag to decide which I/O ops are required). * Remove no longer needed ->INS{W,L} and ->OUTS{W,L} methods (ide-h8300, au1xxx-ide and scc_pata implement their own ->{in,out}put_data methods). There should be no functional changes caused by this patch. Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>	2008-04-28 23:44:37 +02:00
Bartlomiej Zolnierkiewicz	c5dd43ec65	ide: add IDE_HFLAG_MMIO host flag (take 2) * Add IDE_HFLAG_MMIO host flag and set it for hosts which use default_hwif_mmiops(). v2: * Fix kernel panic in pmac host driver (',' should be '\|'). Thanks to Kamalesh for reporting it + testing the fix and to Andrew for hinting me about the source of the issue. Cc: Kamalesh Babulal <kamalesh@linux.vnet.ibm.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Andy Whitcroft <apw@shadowen.org> Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>	2008-04-28 23:44:37 +02:00
Bartlomiej Zolnierkiewicz	9567b349f7	ide: merge ->atapi_put_bytes and ->ata_put_data methods * Merge ->atapi_{in,out}put_bytes and ->ata_{in,out}put_data methods into new ->{in,out}put_data methods which take number of bytes to transfer as an argument and always do padding. While at it: * Use 'hwif' or 'drive->hwif' instead of 'HWIF(drive)'. There should be no functional changes caused by this patch (all users of ->ata_{in,out}put_data methods were using multiply-of-4 word counts). Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>	2008-04-28 23:44:36 +02:00
Bartlomiej Zolnierkiewicz	92d3ab27e8	falconide/q40ide: add ->atapi_put_bytes and ->ata_put_data methods (take 2) * Add ->atapi_{in,out}put_bytes and ->ata_{in,out}put_data methods to falconide and q40ide host drivers (->ata_* methods are implemented on top of ->atapi_* methods so they also do byte-swapping now). * Cleanup atapi_{in,out}put_bytes(). v2: * Add 'struct request *rq' argument to ->ata_{in,out}put_data methods and don't byte-swap disk fs requests (we shouldn't un-swap fs requests because fs itself is stored byte-swapped on the disk) - this is how things were done before the patch (ideally device mapper should be used instead but it would break existing setups and would have some performance impact). Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Michael Schmitz <schmitz@debian.org> Cc: Roman Zippel <zippel@linux-m68k.org> Cc: Alan Cox <alan@lxorguk.ukuu.org.uk> Cc: Richard Zidlicky <rz@linux-m68k.org> Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>	2008-04-28 23:44:36 +02:00
Linus Torvalds	e97e386b12	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/penberg/slab-2.6 * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/penberg/slab-2.6: slub: pack objects denser slub: Calculate min_objects based on number of processors. slub: Drop DEFAULT_MAX_ORDER / DEFAULT_MIN_OBJECTS slub: Simplify any_slab_object checks slub: Make the order configurable for each slab cache slub: Drop fallback to page allocator method slub: Fallback to minimal order during slab page allocation slub: Update statistics handling for variable order slabs slub: Add kmem_cache_order_objects struct slub: for_each_object must be passed the number of objects in a slab slub: Store max number of objects in the page struct. slub: Dump list of objects not freed on kmem_cache_close() slub: free_list() cleanup slub: improve kmem_cache_destroy() error message slob: fix bug - when slob allocates "struct kmem_cache", it does not force alignment.	2008-04-28 14:08:56 -07:00
Alessandro Guido	30d221db44	[CPUFREQ] allow use of the powersave governor as the default one Allow use of the powersave cpufreq governor as the default one for EMBEDDED configs. Signed-off-by: Alessandro Guido <alessandro.guido@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Dave Jones <davej@redhat.com>	2008-04-28 16:27:08 -04:00
Darrick J. Wong	e8628dd06d	[CPUFREQ] expose cpufreq coordination requirements regardless of coordination mechanism Currently, affected_cpus shows which CPUs need to have their frequency coordinated in software. When hardware coordination is in use, the contents of this file appear the same as when no coordination is required. This can lead to some confusion among user-space programs, for example, that do not know that extra coordination is required to force a CPU core to a particular speed to control power consumption. To fix this, create a "related_cpus" attribute that always displays the coordination map regardless of whatever coordination strategy the cpufreq driver uses (sw or hw). If the cpufreq driver does not provide a value, fall back to policy->cpus. Signed-off-by: Darrick J. Wong <djwong@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Dave Jones <davej@redhat.com>	2008-04-28 16:27:08 -04:00
Jesse Barnes	ee69439cc1	PCI: don't expose struct pci_vpd to userspace We just need to forward declare it for struct pci_dev, not expose it outside of __KERNEL__. Signed-off-by: Jesse Barnes <jbarnes@virtuousgeek.org>	2008-04-28 12:30:35 -07:00
Linus Torvalds	cfd299dffe	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/selinux-2.6 * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/selinux-2.6: SELinux: Fix a RCU free problem with the netport cache SELinux: Made netnode cache adds faster SELinux: include/security.h whitespace, syntax, and other cleanups SELinux: policydb.h whitespace, syntax, and other cleanups SELinux: mls_types.h whitespace, syntax, and other cleanups SELinux: mls.h whitespace, syntax, and other cleanups SELinux: hashtab.h whitespace, syntax, and other cleanups SELinux: context.h whitespace, syntax, and other cleanups SELinux: ss/conditional.h whitespace, syntax, and other cleanups SELinux: selinux/include/security.h whitespace, syntax, and other cleanups SELinux: objsec.h whitespace, syntax, and other cleanups SELinux: netlabel.h whitespace, syntax, and other cleanups SELinux: avc_ss.h whitespace, syntax, and other cleanups Fixed up conflict in include/linux/security.h manually	2008-04-28 10:08:49 -07:00
Al Viro	01d7b36988	usbhid endianness annotations and fixes usb_control_msg() converts arguments to little-endian itself, doing that in caller means breakage on big-endian boxen. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-28 10:03:31 -07:00
Andrew Morton	73f20e58b1	FAT_VALID_MEDIA(): remove pointless test The on-disk media specification field in FAT is only 8-bits, so testing for <=0xff is pointless, and can generate a "comparison is always true due to limited range of data type" warning. While we're there, convert FAT_VALID_MEDIA() into a C function - the present implementation is buggy: it generates either one or two references to its argument. Cc: Frank Seidel <fseidel@suse.de> Acked-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-28 08:58:47 -07:00
OGAWA Hirofumi	606e423e43	fat: Update free_clusters even if it is untrusted Currently, free_clusters is not updated until it is trusted, because Windows doesn't update it correctly. But if user is using FAT driver of Linux, it updates free_clusters correctly. Instead, this updates it even if it's untrusted, so if free_clustes is correct, now keep correct value. Signed-off-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-28 08:58:47 -07:00
OGAWA Hirofumi	1ae43f826b	fat: Add allow_utime option Normally utime(2) checks current process is owner of the file, or it has CAP_FOWNER capability. But FAT filesystem doesn't have uid/gid as on disk info, so normal check is too unflexible. With this option you can relax it. Signed-off-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-28 08:58:47 -07:00
OGAWA Hirofumi	1278fdd34b	fat: fat_notify_change() and check_mode() cleanup - Rename fat_notify_change() to fat_setattr() - check_mode() cleanup - Change layout of code Signed-off-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-28 08:58:47 -07:00
Jan Kara	d5dee5c395	reiserfs: unpack tails on quota files Quota files cannot have tails because quota_write and quota_read functions do not support them. So far when quota files did have tail, we just refused to turn quotas on it. Sadly this check has been wrong and so there are now plenty installations where quota files don't have NOTAIL flag set and so now after fixing the check, they suddently fail to turn quotas on. Since it's easy to unpack the tail from kernel, do this from reiserfs_quota_on() which solves the problem and is generally nicer to users anyway. Signed-off-by: Jan Kara <jack@suse.cz> Reported-by: <urhausen@urifabi.net> Cc: Jeff Mahoney <jeffm@suse.com> Cc: Chris Mason <chris.mason@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-28 08:58:46 -07:00
Dan Williams	8b3e6cdc53	md: introduce get_priority_stripe() to improve raid456 write performance Improve write performance by preventing the delayed_list from dumping all its stripes onto the handle_list in one shot. Delayed stripes are now further delayed by being held on the 'hold_list'. The 'hold_list' is bypassed when: * a STRIPE_IO_STARTED stripe is found at the head of 'handle_list' * 'handle_list' is empty and i/o is being done to satisfy full stripe-width write requests * 'bypass_count' is less than 'bypass_threshold'. By default the threshold is 1, i.e. every other stripe handled is a preread stripe provided the top two conditions are false. Benchmark data: System: 2x Xeon 5150, 4x SATA, mem=1GB Baseline: 2.6.24-rc7 Configuration: mdadm --create /dev/md0 /dev/sd[b-e] -n 4 -l 5 --assume-clean Test1: dd if=/dev/zero of=/dev/md0 bs=1024k count=2048 * patched: +33% (stripe_cache_size = 256), +25% (stripe_cache_size = 512) Test2: tiobench --size 2048 --numruns 5 --block 4096 --block 131072 (XFS) * patched: +13% * patched + preread_bypass_threshold = 0: +37% Changes since v1: * reduce bypass_threshold from (chunk_size / sectors_per_chunk) to (1) and make it configurable. This defaults to fairness and modest performance gains out of the box. Changes since v2: * [neilb@suse.de]: kill STRIPE_PRIO_HI and preread_needed as they are not necessary, the important change was clearing STRIPE_DELAYED in add_stripe_bio and this has been moved out to make_request for the hang fix. * [neilb@suse.de]: simplify get_priority_stripe * [dan.j.williams@intel.com]: reset the bypass_count when ->hold_list is sampled empty (+11%) * [dan.j.williams@intel.com]: decrement the bypass_count at the detection of stripes being naturally promoted off of hold_list +2%. Note, resetting bypass_count instead of decrementing on these events yields +4% but that is probably too aggressive. Changes since v3: * cosmetic fixups Tested-by: James W. Laferriere <babydr@baby-dragons.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-28 08:58:42 -07:00
Andres Salomon	b6f448e99c	PM/gxfb: add hook to PM console layer that allows disabling of suspend VT switch Prior to suspend, we allocate and switch to a new VT; after suspend, we switch back to the original VT. This can be slow, and is completely unnecessary if the framebuffer we're using can restore video properly. This adds a hook that allows drivers to select whether or not to do this vt switch, and changes the gxfb driver to call this hook. It also adds a module param to gxfb to allow controlling of the vt switch (defaulting to no switch). (Note: I'm not convinced that console_sem is the best way to protect this, but we should probably have some form of locking..) [akpm@linux-foundation.org: build fix] Signed-off-by: Andres Salomon <dilinger@debian.org> Cc: Jordan Crouse <jordan.crouse@amd.com> Cc: "Antonino A. Daplas" <adaplas@pol.net> Cc: Pavel Machek <pavel@ucw.cz> Cc: "Rafael J. Wysocki" <rjw@sisk.pl> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-28 08:58:36 -07:00
Anton Vorontsov	e4c690e061	fb: add support for foreign endianness Add support for the framebuffers with non-native endianness. This is done via FBINFO_FOREIGN_ENDIAN flag that will be used by the drivers. Depending on the host endianness this flag will be overwritten by FBINFO_BE_MATH internal flag, or cleared. Tested to work on MPC8360E-RDK (BE) + Fujitsu MINT framebuffer (LE). Signed-off-by: Anton Vorontsov <avorontsov@ru.mvista.com> Cc: "Antonino A. Daplas" <adaplas@pol.net> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: <Valdis.Kletnieks@vt.edu> Cc: Clemens Koller <clemens.koller@anagramm.de> Cc: Krzysztof Helt <krzysztof.h1@poczta.fm> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-28 08:58:35 -07:00
Ilpo Järvinen	73fcdc9e15	i2o: remove static inline forward declarations Nothing in between of them and the later declaration with body needs them. Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-28 08:58:34 -07:00
Andrew Morton	50f8c370e7	quota: convert stub functions from macros into inlines Fixes things like this: fs/super.c: In function `deactivate_super': fs/super.c:182: warning: statement with no effect fs/super.c: In function `do_remount_sb': fs/super.c:644: warning: statement with no effect Cc: Jan Kara <jack@ucw.cz> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-28 08:58:33 -07:00
Jan Kara	0ff5af8340	quota: quota core changes for quotaon on remount Currently, we just turn quotas off on remount of filesystem to read-only state. The patch below adds necessary framework so that we can turn quotas off on remount RO but we are able to automatically reenable them again when filesystem is remounted to RW state. All we need to do is to keep references to inodes of quota files when remounting RO and using these references to reenable quotas when remounting RW. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-28 08:58:33 -07:00
Jan Kara	03f6e92bdd	quota: various style cleanups Cleanups in quota code: Change __inline__ to inline. Change some macros to inline functions. Remove vfs_quota_off_mount() macro. DQUOT_OFF() should be (0) is CONFIG_QUOTA is disabled. Move declaration of mark_dquot_dirty and dirty_dquot from quota.h to dquot.c [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-28 08:58:33 -07:00
Andrew Perepechko	338bf9afda	quota: do not allow setting of quota limits to too high values We should check whether quota limits set via Q_SETQUOTA are not exceeding limits which quota format is able to handle. Signed-off-by: Andrew Perepechko <andrew.perepechko@sun.com> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-28 08:58:32 -07:00
Masami Hiramatsu	26b31c1908	kprobes: add (un)register_jprobes for batch registration Introduce unregister_/register_jprobes() for jprobe batch registration. Signed-off-by: Masami Hiramatsu <mhiramat@redhat.com> Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com> Cc: Jim Keniston <jkenisto@us.ibm.com> Cc: Prasanna S Panchamukhi <prasanna@in.ibm.com> Cc: Shaohua Li <shaohua.li@intel.com> Cc: David Miller <davem@davemloft.net> Cc: "Frank Ch. Eigler" <fche@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-28 08:58:32 -07:00
Masami Hiramatsu	4a296e07c3	kprobes: add (un)register_kretprobes for batch registration Introduce unregister_/register_kretprobes() for kretprobe batch registration. Signed-off-by: Masami Hiramatsu <mhiramat@redhat.com> Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com> Cc: Jim Keniston <jkenisto@us.ibm.com> Cc: Prasanna S Panchamukhi <prasanna@in.ibm.com> Cc: Shaohua Li <shaohua.li@intel.com> Cc: David Miller <davem@davemloft.net> Cc: "Frank Ch. Eigler" <fche@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-28 08:58:32 -07:00
Masami Hiramatsu	9861668f74	kprobes: add (un)register_kprobes for batch registration Introduce unregister_/register_kprobes() for kprobe batch registration. This can reduce waiting time for synchronized_sched() when a lot of probes have to be unregistered at once. Signed-off-by: Masami Hiramatsu <mhiramat@redhat.com> Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com> Cc: Jim Keniston <jkenisto@us.ibm.com> Cc: Prasanna S Panchamukhi <prasanna@in.ibm.com> Cc: Shaohua Li <shaohua.li@intel.com> Cc: David Miller <davem@davemloft.net> Cc: "Frank Ch. Eigler" <fche@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-28 08:58:32 -07:00
Masami Hiramatsu	9960257281	list.h: add list_is_singular() Add list_is_singular() to check a list has just one entry. list_is_singular() is useful to check whether a list_head which have been temporarily allocated for listing objects can be released or not. Signed-off-by: Masami Hiramatsu <mhiramat@redhat.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-28 08:58:32 -07:00
Srinivasa Ds	3d8d996e0c	kprobes: prevent probing of preempt_schedule() Prohibit users from probing preempt_schedule(). One way of prohibiting the user from probing functions is by marking such functions with __kprobes. But this method doesn't work for those functions, which are already marked to different section like preempt_schedule() (belongs to __sched section). So we use blacklist approach to refuse user from probing these functions. In blacklist approach we populate the blacklisted function's starting address and its size in kprobe_blacklist structure. Then we verify the user specified address against start and end of the blacklisted function. So any attempt to register probe on blacklisted functions will be rejected. [akpm@linux-foundation.org: build fix] [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Srinivasa DS <srinivasa@in.ibm.com> Signed-off-by: Ananth N Mavinakayanahalli <ananth@in.ibm.com> Signed-off-by: Jim Keniston <jkenisto@us.ibm.com> Cc: Dave Hansen <haveblue@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-28 08:58:32 -07:00
Karl Dahlke	0341a4d0fd	VT notifier extension for accessibility Some accessibility modules need to be able to catch the output on the console before the VT interpretation, and possibly swallow it. [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Samuel Thibault <samuel.thibault@ens-lyon.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-28 08:58:32 -07:00
Magnus Damm	61711f8fd8	sm501: add uart support This patch extends the sm501 mfd with 8250 uart support. We're currently doing this in the board specific r2d-1 code already, but it would be nice to do move things into the mfd since it's more chip specific than board specific. Signed-off-by: Magnus Damm <damm@igel.co.jp> Cc: Ben Dooks <ben-linux@fluff.org> Cc: Paul Mundt <lethal@linux-sh.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-28 08:58:32 -07:00
Thomas Petazzoni	7ae9392c0a	x86: configurable DMI scanning code Turn CONFIG_DMI into a selectable option if EMBEDDED is defined, in order to be able to remove the DMI table scanning code if it's not needed, and then reduce the kernel code size. With CONFIG_DMI (i.e before) : text data bss dec hex filename 1076076 128656 98304 1303036 13e1fc vmlinux Without CONFIG_DMI (i.e after) : text data bss dec hex filename 1068092 126308 98304 1292704 13b9a0 vmlinux Result: text data bss dec hex filename -7984 -2348 0 -10332 -285c vmlinux The new option appears in "Processor type and features", only when CONFIG_EMBEDDED is defined. This patch is part of the Linux Tiny project, and is based on previous work done by Matt Mackall <mpm@selenic.com>. Signed-off-by: Thomas Petazzoni <thomas.petazzoni@free-electrons.com> Acked-by: Ingo Molnar <mingo@elte.hu> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: "H. Anvin" <hpa@zytor.com> Signed-off-by: Matt Mackall <mpm@selenic.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-28 08:58:30 -07:00
Joe Perches	0fab6de09c	synclink drivers bool conversion Remove more TRUE/FALSE defines and uses Remove == TRUE tests Convert BOOLEAN to bool Convert int to bool where appropriate Signed-off-by: Joe Perches <joe@perches.com> Acked-by: Paul Fulghum <paulkf@microgate.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-28 08:58:29 -07:00
Harvey Harrison	cdf8803768	ncpfs: add prototypes to ncp_fs.h Removes some externs from C files, noticed from the sparse warnings: fs/ncpfs/dir.c:90:26: warning: symbol 'ncp_root_dentry_operations' was not declared. Should it be static? fs/ncpfs/symlink.c:107:5: warning: symbol 'ncp_symlink' was not declared. Should it be static? fs/ncpfs/symlink.c:101:39: warning: symbol 'ncp_symlink_aops' was not declared. Should it be static? Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com> Acked-by: Petr Vandrovec <VANDROVE@vc.cvut.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-28 08:58:29 -07:00
Andrew G. Morgan	3898b1b4eb	capabilities: implement per-process securebits Filesystem capability support makes it possible to do away with (set)uid-0 based privilege and use capabilities instead. That is, with filesystem support for capabilities but without this present patch, it is (conceptually) possible to manage a system with capabilities alone and never need to obtain privilege via (set)uid-0. Of course, conceptually isn't quite the same as currently possible since few user applications, certainly not enough to run a viable system, are currently prepared to leverage capabilities to exercise privilege. Further, many applications exist that may never get upgraded in this way, and the kernel will continue to want to support their setuid-0 base privilege needs. Where pure-capability applications evolve and replace setuid-0 binaries, it is desirable that there be a mechanisms by which they can contain their privilege. In addition to leveraging the per-process bounding and inheritable sets, this should include suppressing the privilege of the uid-0 superuser from the process' tree of children. The feature added by this patch can be leveraged to suppress the privilege associated with (set)uid-0. This suppression requires CAP_SETPCAP to initiate, and only immediately affects the 'current' process (it is inherited through fork()/exec()). This reimplementation differs significantly from the historical support for securebits which was system-wide, unwieldy and which has ultimately withered to a dead relic in the source of the modern kernel. With this patch applied a process, that is capable(CAP_SETPCAP), can now drop all legacy privilege (through uid=0) for itself and all subsequently fork()'d/exec()'d children with: prctl(PR_SET_SECUREBITS, 0x2f); This patch represents a no-op unless CONFIG_SECURITY_FILE_CAPABILITIES is enabled at configure time. [akpm@linux-foundation.org: fix uninitialised var warning] [serue@us.ibm.com: capabilities: use cap_task_prctl when !CONFIG_SECURITY] Signed-off-by: Andrew G. Morgan <morgan@kernel.org> Acked-by: Serge Hallyn <serue@us.ibm.com> Reviewed-by: James Morris <jmorris@namei.org> Cc: Stephen Smalley <sds@tycho.nsa.gov> Cc: Paul Moore <paul.moore@hp.com> Signed-off-by: Serge E. Hallyn <serue@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-28 08:58:26 -07:00
KAMEZAWA Hiroyuki	8cece85ec7	mm: fix broken gfp_zone with __GFP_THISNODE This hack, "base = MAX_NR_ZONES", at __GFP_THISNODE was used for old zonliests. Now, new zonelist[] have a list for __GFP_THISNODE and this hack is incorrect. Should be removed. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-28 08:58:26 -07:00
Yasunori Goto	e70260aabe	memory hotplug: make alloc_bootmem_section() alloc_bootmem_section() can allocate specified section's area. This is used for usemap to keep same section with pgdat by later patch. Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com> Cc: Badari Pulavarty <pbadari@us.ibm.com> Cc: Yinghai Lu <yhlu.kernel@gmail.com> Cc: Yasunori Goto <y-goto@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-28 08:58:25 -07:00
Yasunori Goto	0475327876	memory hotplug: register section/node id to free This patch set is to free pages which is allocated by bootmem for memory-hotremove. Some structures of memory management are allocated by bootmem. ex) memmap, etc. To remove memory physically, some of them must be freed according to circumstance. This patch set makes basis to free those pages, and free memmaps. Basic my idea is using remain members of struct page to remember information of users of bootmem (section number or node id). When the section is removing, kernel can confirm it. By this information, some issues can be solved. 1) When the memmap of removing section is allocated on other section by bootmem, it should/can be free. 2) When the memmap of removing section is allocated on the same section, it shouldn't be freed. Because the section has to be logical memory offlined already and all pages must be isolated against page allocater. If it is freed, page allocator may use it which will be removed physically soon. 3) When removing section has other section's memmap, kernel will be able to show easily which section should be removed before it for user. (Not implemented yet) 4) When the above case 2), the page isolation will be able to check and skip memmap's page when logical memory offline (offline_pages()). Current page isolation code fails in this case because this page is just reserved page and it can't distinguish this pages can be removed or not. But, it will be able to do by this patch. (Not implemented yet.) 5) The node information like pgdat has similar issues. But, this will be able to be solved too by this. (Not implemented yet, but, remembering node id in the pages.) Fortunately, current bootmem allocator just keeps PageReserved flags, and doesn't use any other members of page struct. The users of bootmem doesn't use them too. This patch: This is to register information which is node or section's id. Kernel can distinguish which node/section uses the pages allcated by bootmem. This is basis for hot-remove sections or nodes. Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com> Cc: Badari Pulavarty <pbadari@us.ibm.com> Cc: Yinghai Lu <yhlu.kernel@gmail.com> Cc: Yasunori Goto <y-goto@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-28 08:58:25 -07:00
Gerald Schaefer	6d779079bf	hugetlbfs: architecture header cleanup This patch moves all architecture functions for hugetlb to architecture header files (include/asm-foo/hugetlb.h) and converts all macros to inline functions. It also removes (!) ARCH_HAS_HUGEPAGE_ONLY_RANGE, ARCH_HAS_HUGETLB_FREE_PGD_RANGE, ARCH_HAS_PREPARE_HUGEPAGE_RANGE, ARCH_HAS_SETCLEAR_HUGE_PTE and ARCH_HAS_HUGETLB_PREFAULT_HOOK. Getting rid of the ARCH_HAS_xxx #ifdef and macro fugliness should increase readability and maintainability, at the price of some code duplication. An asm-generic common part would have reduced the loc, but we would end up with new ARCH_HAS_xxx defines eventually. Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com> Signed-off-by: Gerald Schaefer <gerald.schaefer@de.ibm.com> Cc: Paul Mundt <lethal@linux-sh.org> Cc: "Luck, Tony" <tony.luck@intel.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: "David S. Miller" <davem@davemloft.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-28 08:58:25 -07:00
Lee Schermerhorn	71fe804b6d	mempolicy: use struct mempolicy pointer in shmem_sb_info This patch replaces the mempolicy mode, mode_flags, and nodemask in the shmem_sb_info struct with a struct mempolicy pointer, initialized to NULL. This removes dependency on the details of mempolicy from shmem.c and hugetlbfs inode.c and simplifies the interfaces. mpol_parse_str() in mempolicy.c is changed to return, via a pointer to a pointer arg, a struct mempolicy pointer on success. For MPOL_DEFAULT, the returned pointer is NULL. Further, mpol_parse_str() now takes a 'no_context' argument that causes the input nodemask to be stored in the w.user_nodemask of the created mempolicy for use when the mempolicy is installed in a tmpfs inode shared policy tree. At that time, any cpuset contextualization is applied to the original input nodemask. This preserves the previous behavior where the input nodemask was stored in the superblock. We can think of the returned mempolicy as "context free". Because mpol_parse_str() is now calling mpol_new(), we can remove from mpol_to_str() the semantic checks that mpol_new() already performs. Add 'no_context' parameter to mpol_to_str() to specify that it should format the nodemask in w.user_nodemask for 'bind' and 'interleave' policies. Change mpol_shared_policy_init() to take a pointer to a "context free" struct mempolicy and to create a new, "contextualized" mempolicy using the mode, mode_flags and user_nodemask from the input mempolicy. Note: we know that the mempolicy passed to mpol_to_str() or mpol_shared_policy_init() from a tmpfs superblock is "context free". This is currently the only instance thereof. However, if we found more uses for this concept, and introduced any ambiguity as to whether a mempolicy was context free or not, we could add another internal mode flag to identify context free mempolicies. Then, we could remove the 'no_context' argument from mpol_to_str(). Added shmem_get_sbmpol() to return a reference counted superblock mempolicy, if one exists, to pass to mpol_shared_policy_init(). We must add the reference under the sb stat_lock to prevent races with replacement of the mpol by remount. This reference is removed in mpol_shared_policy_init(). [akpm@linux-foundation.org: build fix] [akpm@linux-foundation.org: another build fix] [akpm@linux-foundation.org: yet another build fix] Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: Christoph Lameter <clameter@sgi.com> Cc: David Rientjes <rientjes@google.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Andi Kleen <ak@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-28 08:58:25 -07:00
Lee Schermerhorn	095f1fc4eb	mempolicy: rework shmem mpol parsing and display mm/shmem.c currently contains functions to parse and display memory policy strings for the tmpfs 'mpol' mount option. Move this to mm/mempolicy.c with the rest of the mempolicy support. With subsequent patches, we'll be able to remove knowledge of the details [mode, flags, policy, ...] completely from shmem.c 1) replace shmem_parse_mpol() in mm/shmem.c with mpol_parse_str() in mm/mempolicy.c. Rework to use the policy_types[] array [used by mpol_to_str()] to look up mode by name. 2) use mpol_to_str() to format policy for shmem_show_mpol(). mpol_to_str() expects a pointer to a struct mempolicy, so temporarily construct one. This will be replaced with a reference to a struct mempolicy in the tmpfs superblock in a subsequent patch. NOTE 1: I changed mpol_to_str() to use a colon ':' rather than an equal sign '=' as the nodemask delimiter to match mpol_parse_str() and the tmpfs/shmem mpol mount option formatting that now uses mpol_to_str(). This is a user visible change to numa_maps, but then the addition of the mode flags already changed the display. It makes sense to me to have the mounts and numa_maps display the policy in the same format. However, if anyone objects strongly, I can pass the desired nodemask delimeter as an arg to mpol_to_str(). Note 2: Like show_numa_map(), I don't check the return code from mpol_to_str(). I do use a longer buffer than the one provided by show_numa_map(), which seems to have sufficed so far. Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: Christoph Lameter <clameter@sgi.com> Cc: David Rientjes <rientjes@google.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Andi Kleen <ak@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-28 08:58:24 -07:00
Lee Schermerhorn	fc36b8d3d8	mempolicy: use MPOL_F_LOCAL to Indicate Preferred Local Policy Now that we're using "preferred local" policy for system default, we need to make this as fast as possible. Because of the variable size of the mempolicy structure [based on size of nodemasks], the preferred_node may be in a different cacheline from the mode. This can result in accessing an extra cacheline in the normal case of system default policy. Suspect this is the cause of an observed 2-3% slowdown in page fault testing relative to kernel without this patch series. To alleviate this, use an internal mode flag, MPOL_F_LOCAL in the mempolicy flags member which is guaranteed [?] to be in the same cacheline as the mode itself. Verified that reworked mempolicy now performs slightly better on 25-rc8-mm1 for both anon and shmem segments with system default and vma [preferred local] policy. Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: Christoph Lameter <clameter@sgi.com> Cc: David Rientjes <rientjes@google.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Andi Kleen <ak@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-28 08:58:24 -07:00
Lee Schermerhorn	52cd3b0740	mempolicy: rework mempolicy Reference Counting [yet again] After further discussion with Christoph Lameter, it has become clear that my earlier attempts to clean up the mempolicy reference counting were a bit of overkill in some areas, resulting in superflous ref/unref in what are usually fast paths. In other areas, further inspection reveals that I botched the unref for interleave policies. A separate patch, suitable for upstream/stable trees, fixes up the known errors in the previous attempt to fix reference counting. This patch reworks the memory policy referencing counting and, one hopes, simplifies the code. Maybe I'll get it right this time. See the update to the numa_memory_policy.txt document for a discussion of memory policy reference counting that motivates this patch. Summary: Lookup of mempolicy, based on (vma, address) need only add a reference for shared policy, and we need only unref the policy when finished for shared policies. So, this patch backs out all of the unneeded extra reference counting added by my previous attempt. It then unrefs only shared policies when we're finished with them, using the mpol_cond_put() [conditional put] helper function introduced by this patch. Note that shmem_swapin() calls read_swap_cache_async() with a dummy vma containing just the policy. read_swap_cache_async() can call alloc_page_vma() multiple times, so we can't let alloc_page_vma() unref the shared policy in this case. To avoid this, we make a copy of any non-null shared policy and remove the MPOL_F_SHARED flag from the copy. This copy occurs before reading a page [or multiple pages] from swap, so the overhead should not be an issue here. I introduced a new static inline function "mpol_cond_copy()" to copy the shared policy to an on-stack policy and remove the flags that would require a conditional free. The current implementation of mpol_cond_copy() assumes that the struct mempolicy contains no pointers to dynamically allocated structures that must be duplicated or reference counted during copy. Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: Christoph Lameter <clameter@sgi.com> Cc: David Rientjes <rientjes@google.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Andi Kleen <ak@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-28 08:58:24 -07:00
Lee Schermerhorn	a6020ed759	mempolicy: document {set\|get}_policy() vm_ops APIs Document mempolicy return value reference semantics assumed by the rest of the mempolicy code for the set_ and get_policy vm_ops in <linux/mm.h>--where the prototypes are defined--to inform any future mempolicy vm_op writers what the rest of the subsystem expects of them. Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: Christoph Lameter <clameter@sgi.com> Cc: David Rientjes <rientjes@google.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Andi Kleen <ak@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-28 08:58:24 -07:00
Lee Schermerhorn	aab0b1029f	mempolicy: mark shared policies for unref As part of yet another rework of mempolicy reference counting, we want to be able to identify shared policies efficiently, because they have an extra ref taken on lookup that needs to be removed when we're finished using the policy. Note: the extra ref is required because the policies are shared between tasks/processes and can be changed/freed by one task while another task is using them--e.g., for page allocation. Building on David Rientjes mempolicy "mode flags" enhancement, this patch indicates a "shared" policy by setting a new MPOL_F_SHARED flag in the flags member of the struct mempolicy added by David. MPOL_F_SHARED, and any future "internal mode flags" are reserved from bit zero up, as they will never be passed in the upper bits of the mode argument of a mempolicy API. I set the MPOL_F_SHARED flag when the policy is installed in the shared policy rb-tree. Don't need/want to clear the flag when removing from the tree as the mempolicy is freed [unref'd] internally to the sp_delete() function. However, a task could hold another reference on this mempolicy from a prior lookup. We need the MPOL_F_SHARED flag to stay put so that any tasks holding a ref will unref, eventually freeing, the mempolicy. A later patch in this series will introduce a function to conditionally unref [mpol_free] a policy. The MPOL_F_SHARED flag is one reason [currently the only reason] to unref/free a policy via the conditional free. Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: Christoph Lameter <clameter@sgi.com> Cc: David Rientjes <rientjes@google.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Andi Kleen <ak@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-28 08:58:24 -07:00
Lee Schermerhorn	45c4745af3	mempolicy: rename struct mempolicy 'policy' member to 'mode' The terms 'policy' and 'mode' are both used in various places to describe the semantics of the value stored in the 'policy' member of struct mempolicy. Furthermore, the term 'policy' is used to refer to that member, to the entire struct mempolicy and to the more abstract concept of the tuple consisting of a "mode" and an optional node or set of nodes. Recently, we have added "mode flags" that are passed in the upper bits of the 'mode' [or sometimes, 'policy'] member of the numa APIs. I'd like to resolve this confusion, which perhaps only exists in my mind, by renaming the 'policy' member to 'mode' throughout, and fixing up the Documentation. Man pages will be updated separately. Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: Christoph Lameter <clameter@sgi.com> Cc: David Rientjes <rientjes@google.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Andi Kleen <ak@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-28 08:58:24 -07:00
Lee Schermerhorn	846a16bf0f	mempolicy: rename mpol_copy to mpol_dup This patch renames mpol_copy() to mpol_dup() because, well, that's what it does. Like, e.g., strdup() for strings, mpol_dup() takes a pointer to an existing mempolicy, allocates a new one and copies the contents. In a later patch, I want to use the name mpol_copy() to copy the contents from one mempolicy to another like, e.g., strcpy() does for strings. Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: Christoph Lameter <clameter@sgi.com> Cc: David Rientjes <rientjes@google.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Andi Kleen <ak@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-28 08:58:23 -07:00
Lee Schermerhorn	f0be3d32b0	mempolicy: rename mpol_free to mpol_put This is a change that was requested some time ago by Mel Gorman. Makes sense to me, so here it is. Note: I retain the name "mpol_free_shared_policy()" because it actually does free the shared_policy, which is NOT a reference counted object. However, ... The mempolicy object[s] referenced by the shared_policy are reference counted, so mpol_put() is used to release the reference held by the shared_policy. The mempolicy might not be freed at this time, because some task attached to the shared object associated with the shared policy may be in the process of allocating a page based on the mempolicy. In that case, the task performing the allocation will hold a reference on the mempolicy, obtained via mpol_shared_policy_lookup(). The mempolicy will be freed when all tasks holding such a reference have called mpol_put() for the mempolicy. Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: Christoph Lameter <clameter@sgi.com> Cc: David Rientjes <rientjes@google.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Andi Kleen <ak@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-28 08:58:23 -07:00
Adam Litke	3b11630063	Subject: [PATCH] hugetlb: vmstat events for huge page allocations Allocating huge pages directly from the buddy allocator is not guaranteed to succeed. Success depends on several factors (such as the amount of physical memory available and the level of fragmentation). With the addition of dynamic hugetlb pool resizing, allocations can occur much more frequently. For these reasons it is desirable to keep track of huge page allocation successes and failures. Add two new vmstat entries to track huge page allocations that succeed and fail. The presence of the two entries is contingent upon CONFIG_HUGETLB_PAGE being enabled. [akpm@linux-foundation.org: reduced ifdeffery] Signed-off-by: Adam Litke <agl@us.ibm.com> Signed-off-by: Eric Munson <ebmunson@us.ibm.com> Tested-by: Mel Gorman <mel@csn.ul.ie> Reviewed-by: Andy Whitcroft <apw@shadowen.org> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-28 08:58:23 -07:00
Nick Piggin	70688e4dd1	xip: support non-struct page backed memory Convert XIP to support non-struct page backed memory, using VM_MIXEDMAP for the user mappings. This requires the get_xip_page API to be changed to an address based one. Improve the API layering a little bit too, while we're here. This is required in order to support XIP filesystems on memory that isn't backed with struct page (but memory with struct page is still supported too). Signed-off-by: Nick Piggin <npiggin@suse.de> Acked-by: Carsten Otte <cotte@de.ibm.com> Cc: Jared Hulbert <jaredeh@gmail.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-28 08:58:23 -07:00
Jared Hulbert	30afcb4bd2	return pfn from direct_access, for XIP Alter the block device ->direct_access() API to work with the new get_xip_mem() API (that requires both kaddr and pfn are returned). Some architectures will not do the right thing in their virt_to_page() for use by XIP (to translate from the kernel virtual address returned by direct_access(), to a user mappable pfn in XIP's page fault handler. However, we can't switch it to just return the pfn and not the kaddr, because we have no good way to get a kva from a pfn, and XIP requires the kva for its read(2) and write(2) handlers. So we have to return both. Signed-off-by: Jared Hulbert <jaredeh@gmail.com> Signed-off-by: Nick Piggin <npiggin@suse.de> Cc: Carsten Otte <cotte@de.ibm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: linux-mm@kvack.org Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-28 08:58:23 -07:00
Nick Piggin	423bad6004	mm: add vm_insert_mixed vm_insert_mixed will insert either a raw pfn or a refcounted struct page into the page tables, depending on whether vm_normal_page() will return the page or not. With the introduction of the new pte bit, this is now a too tricky for drivers to be doing themselves. filemap_xip uses this in a subsequent patch. Signed-off-by: Nick Piggin <npiggin@suse.de> Cc: Jared Hulbert <jaredeh@gmail.com> Cc: Carsten Otte <cotte@de.ibm.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-28 08:58:23 -07:00
Nick Piggin	7e675137a8	mm: introduce pte_special pte bit s390 for one, cannot implement VM_MIXEDMAP with pfn_valid, due to their memory model (which is more dynamic than most). Instead, they had proposed to implement it with an additional path through vm_normal_page(), using a bit in the pte to determine whether or not the page should be refcounted: vm_normal_page() { ... if (unlikely(vma->vm_flags & (VM_PFNMAP\|VM_MIXEDMAP))) { if (vma->vm_flags & VM_MIXEDMAP) { #ifdef s390 if (!mixedmap_refcount_pte(pte)) return NULL; #else if (!pfn_valid(pfn)) return NULL; #endif goto out; } ... } This is fine, however if we are allowed to use a bit in the pte to determine refcountedness, we can use that to _completely_ replace all the vma based schemes. So instead of adding more cases to the already complex vma-based scheme, we can have a clearly seperate and simple pte-based scheme (and get slightly better code generation in the process): vm_normal_page() { #ifdef s390 if (!mixedmap_refcount_pte(pte)) return NULL; return pte_page(pte); #else ... #endif } And finally, we may rather make this concept usable by any architecture rather than making it s390 only, so implement a new type of pte state for this. Unfortunately the old vma based code must stay, because some architectures may not be able to spare pte bits. This makes vm_normal_page a little bit more ugly than we would like, but the 2 cases are clearly seperate. So introduce a pte_special pte state, and use it in mm/memory.c. It is currently a noop for all architectures, so this doesn't actually result in any compiled code changes to mm/memory.o. BTW: I haven't put vm_normal_page() into arch code as-per an earlier suggestion. The reason is that, regardless of where vm_normal_page is actually implemented, the abstraction is still exactly the same. Also, while it depends on whether the architecture has pte_special or not, that is the only two possible cases, and it really isn't an arch specific function -- the role of the arch code should be to provide primitive functions and accessors with which to build the core code; pte_special does that. We do not want architectures to know or care about vm_normal_page itself, and we definitely don't want them being able to invent something new there out of sight of mm/ code. If we made vm_normal_page an arch function, then we have to make vm_insert_mixed (next patch) an arch function too. So I don't think moving it to arch code fundamentally improves any abstractions, while it does practically make the code more difficult to follow, for both mm and arch developers, and easier to misuse. [akpm@linux-foundation.org: build fix] Signed-off-by: Nick Piggin <npiggin@suse.de> Acked-by: Carsten Otte <cotte@de.ibm.com> Cc: Jared Hulbert <jaredeh@gmail.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-28 08:58:23 -07:00
Jared Hulbert	b379d79019	mm: introduce VM_MIXEDMAP This series introduces some important infrastructure work. The overall result is that: 1. We now support XIP backed filesystems using memory that have no struct page allocated to them. And patches 6 and 7 actually implement this for s390. This is pretty important in a number of cases. As far as I understand, in the case of virtualisation (eg. s390), each guest may mount a readonly copy of the same filesystem (eg. the distro). Currently, guests need to allocate struct pages for this image. So if you have 100 guests, you already need to allocate more memory for the struct pages than the size of the image. I think. (Carsten?) For other (eg. embedded) systems, you may have a very large non- volatile filesystem. If you have to have struct pages for this, then your RAM consumption will go up proportionally to fs size. Even though it is just a small proportion, the RAM can be much more costly eg in terms of power, so every KB less that Linux uses makes it more attractive to a lot of these guys. 2. VM_MIXEDMAP allows us to support mappings where you actually do want to refcount _some_ pages in the mapping, but not others, and support COW on arbitrary (non-linear) mappings. Jared needs this for his NVRAM filesystem in progress. Future iterations of this filesystem will most likely want to migrate pages between pagecache and XIP backing, which is where the requirement for mixed (some refcounted, some not) comes from. 3. pte_special also has a peripheral usage that I need for my lockless get_user_pages patch. That was shown to speed up "oltp" on db2 by 10% on a 2 socket system, which is kind of significant because they scrounge for months to try to find 0.1% improvement on these workloads. I'm hoping we might finally be faster than AIX on pSeries with this :). My reference to lockless get_user_pages is not meant to justify this patchset (which doesn't include lockless gup), but just to show that pte_special is not some s390 specific thing that should be hidden in arch code or xip code: I definitely want to use it on at least x86 and powerpc as well. This patch: Introduce a new type of mapping, VM_MIXEDMAP. This is unlike VM_PFNMAP in that it can support COW mappings of arbitrary ranges including ranges without struct page and ranges with a struct page that we actually want to refcount (PFNMAP can only support COW in those cases where the un-COW-ed translations are mapped linearly in the virtual address, and can only support non refcounted ranges). VM_MIXEDMAP achieves this by refcounting all pfn_valid pages, and not refcounting !pfn_valid pages (which is not an option for VM_PFNMAP, because it needs to avoid refcounting pfn_valid pages eg. for /dev/mem mappings). Signed-off-by: Jared Hulbert <jaredeh@gmail.com> Signed-off-by: Nick Piggin <npiggin@suse.de> Acked-by: Carsten Otte <cotte@de.ibm.com> Cc: Jared Hulbert <jaredeh@gmail.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-28 08:58:22 -07:00
Christoph Lameter	e20b8cca76	PAGEFLAGS_EXTENDED and separate page flags for Head and Tail Having separate page flags for the head and the tail of a compound page allows the compiler to use bitops instead of operations on a word to check for a tail page. That is f.e. important for virt_to_head_page() which is used in various critical code paths (kfree for example): Code for PageTail(page) Before: mov (%rdi),%rdx page->flags mov %rdx,%rax 3 bytes and $0x12000,%eax 5 bytes cmp $0x12000,%rax 6 bytes je 897 <kfree+0xa7> After: mov (%rdi),%rax test $0x40,%ah (3 bytes) jne 887 <kfree+0x97> So we go from 14 bytes to 3 bytes and from 3 instructions to one. From the use of 2 registers we go to none. We can only use page flags for this if we have page flags available. This patch introduces CONFIG_PAGEFLAGS_EXTENDED that is set if pageflags are not scarce due to SPARSEMEM using page flags for its sectionid on 32 bit NUMA platforms. Additional page flag definitions can be added to the CONFIG_PAGEFLAGS_EXTENDED section in page-flags.h if the functionality depends on PAGEFLAGS_EXTENDED or if more page flag overlapping tricks are used for the !PAGEFLAGS_EXTENDED fallback (the upcoming virtual compound patch may hook in here and Rik's/Lee's additional page flags to solve the reclaim issues could also be added there [hint... hint... where are these patchsets?]). Avoiding the overlaying of Pg_reclaim also clears the way for possible use of compound pages for the pagecache or on the LRU. Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-28 08:58:22 -07:00
Christoph Lameter	97965478a6	mm: Get rid of __ZONE_COUNT It was used to compensate because MAX_NR_ZONES was not available to the #ifdefs. Export MAX_NR_ZONES via the new mechanism and get rid of __ZONE_COUNT. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-28 08:58:22 -07:00
Christoph Lameter	ec7cade8c1	page flags: add PAGEFLAGS_FALSE for flags that are always false Turns out that there are a number of times that a flag is simply always returning 0. Define a macro for that. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-28 08:58:22 -07:00
Christoph Lameter	602c4d112f	page flags: handle PG_uncached like all other flags Remove the special setup for PG_uncached and simply make it part of the enum. The page flag will only be allocated when the kernel build includes the uncached allocator. Acked-by: Dean Nelson <dcn@sgi.com> Cc: Jes Sorensen <jes@trained-monkey.org> Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-28 08:58:22 -07:00
Christoph Lameter	0a128b2b1a	pageflags: eliminate PG_xxx aliases Remove aliases of PG_xxx. We can easily drop those now and alias by specifying the PG_xxx flag in the macro that generates the functions. Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: Andy Whitcroft <apw@shadowen.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Jeremy Fitzhardinge <jeremy@goop.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-28 08:58:22 -07:00
Christoph Lameter	d60cd46bbd	pageflags: use proper page flag functions in Xen Xen uses bitops to manipulate page flags. Make it use proper page flag functions. Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: Andy Whitcroft <apw@shadowen.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Jeremy Fitzhardinge <jeremy@goop.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-28 08:58:22 -07:00
Christoph Lameter	6a1e7f777f	pageflags: convert to the use of new macros Replace explicit definitions of page flags through the use of macros. Significantly reduces the size of the definitions and removes a lot of opportunity for errors. Additonal page flags can typically be generated with a single line. Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: Andy Whitcroft <apw@shadowen.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Jeremy Fitzhardinge <jeremy@goop.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-28 08:58:22 -07:00
Christoph Lameter	f94a62e910	pageflags: introduce macros to generate page flag functions Introduce a set of macros that generate functions to handle page flags. A page flag function group typically starts with either SETPAGEFLAG(<part of function name>,<part of PG_ flagname>) to create a set of page flag operations that are atomic. Or __SETPAGEFLAG(<part of function name>,<part of PG_ flagname) to create a set of page flag operations that are not atomic. Then additional operations can be added using the following macros TESTSCFLAG Create additional atomic test-and-set and test-and-clear functions TESTSETFLAG Create additional test and set function TESTCLEARFLAG Create additional test and clear function SETPAGEFLAG Create additional atomic set function CLEARPAGEFLAG Create additional atomic clear function __TESTPAGEFLAG Create additional non atomic set function __SETPAGEFLAG Create additional non atomic clear function Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: Andy Whitcroft <apw@shadowen.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Jeremy Fitzhardinge <jeremy@goop.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-28 08:58:22 -07:00
Christoph Lameter	9223b4190f	pageflags: get rid of FLAGS_RESERVED NR_PAGEFLAGS specifies the number of page flags we are using. From that we can calculate the number of bits leftover that can be used for zone, node (and maybe the sections id). There is no need anymore for FLAGS_RESERVED if we use NR_PAGEFLAGS. Use the new methods to make NR_PAGEFLAGS available via the preprocessor. NR_PAGEFLAGS is used to calculate field boundaries in the page flags fields. These field widths have to be available to the preprocessor. Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: David Miller <davem@davemloft.net> Cc: Andy Whitcroft <apw@shadowen.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Jeremy Fitzhardinge <jeremy@goop.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-28 08:58:21 -07:00
Christoph Lameter	e268318149	pageflags: use an enum for the flags Use an enum to ease the maintenance of page flags. This is going to change the numbering from 0 to 18. Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: Andy Whitcroft <apw@shadowen.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Jeremy Fitzhardinge <jeremy@goop.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-28 08:58:21 -07:00
Andrew Morton	726b801272	page_mapping(): add ifdef around reference to swapper_space This fixes the superh build when the pageflags patches are applied. But it shouldn't unless it's a gcc bug. Cc: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-28 08:58:21 -07:00
Christoph Lameter	308c05e35e	sparsemem: vmemmap does not need section bits A set of patches that attempts to improve page flag handling. First of all a method is introduced to generate the page flag functions using macros. Then the number of page flags used by sparsemem is reduced. All page flag operations will no longer be macros. All flags will use inline function. Then we add a way to export enum constants to the preprocessor which allows us to get rid of __ZONE_COUNT and use the NR_PAGEFLAGS for the dynamic calculation of actually available page flags for fields. This patch: Sparsemem vmemmap does not need any section bits. This patch has the effect of reducing the number of bits used in page->flags by at least 6. Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: Andy Whitcroft <apw@shadowen.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Jeremy Fitzhardinge <jeremy@goop.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-28 08:58:21 -07:00
Christoph Lameter	2301696932	vmallocinfo: add caller information Add caller information so that /proc/vmallocinfo shows where the allocation request for a slice of vmalloc memory originated. Results in output like this: 0xffffc20000000000-0xffffc20000801000 8392704 alloc_large_system_hash+0x127/0x246 pages=2048 vmalloc vpages 0xffffc20000801000-0xffffc20000806000 20480 alloc_large_system_hash+0x127/0x246 pages=4 vmalloc 0xffffc20000806000-0xffffc20000c07000 4198400 alloc_large_system_hash+0x127/0x246 pages=1024 vmalloc vpages 0xffffc20000c07000-0xffffc20000c0a000 12288 alloc_large_system_hash+0x127/0x246 pages=2 vmalloc 0xffffc20000c0a000-0xffffc20000c0c000 8192 acpi_os_map_memory+0x13/0x1c phys=cff68000 ioremap 0xffffc20000c0c000-0xffffc20000c0f000 12288 acpi_os_map_memory+0x13/0x1c phys=cff64000 ioremap 0xffffc20000c10000-0xffffc20000c15000 20480 acpi_os_map_memory+0x13/0x1c phys=cff65000 ioremap 0xffffc20000c16000-0xffffc20000c18000 8192 acpi_os_map_memory+0x13/0x1c phys=cff69000 ioremap 0xffffc20000c18000-0xffffc20000c1a000 8192 acpi_os_map_memory+0x13/0x1c phys=fed1f000 ioremap 0xffffc20000c1a000-0xffffc20000c1c000 8192 acpi_os_map_memory+0x13/0x1c phys=cff68000 ioremap 0xffffc20000c1c000-0xffffc20000c1e000 8192 acpi_os_map_memory+0x13/0x1c phys=cff68000 ioremap 0xffffc20000c1e000-0xffffc20000c20000 8192 acpi_os_map_memory+0x13/0x1c phys=cff68000 ioremap 0xffffc20000c20000-0xffffc20000c22000 8192 acpi_os_map_memory+0x13/0x1c phys=cff68000 ioremap 0xffffc20000c22000-0xffffc20000c24000 8192 acpi_os_map_memory+0x13/0x1c phys=cff68000 ioremap 0xffffc20000c24000-0xffffc20000c26000 8192 acpi_os_map_memory+0x13/0x1c phys=e0081000 ioremap 0xffffc20000c26000-0xffffc20000c28000 8192 acpi_os_map_memory+0x13/0x1c phys=e0080000 ioremap 0xffffc20000c28000-0xffffc20000c2d000 20480 alloc_large_system_hash+0x127/0x246 pages=4 vmalloc 0xffffc20000c2d000-0xffffc20000c31000 16384 tcp_init+0xd5/0x31c pages=3 vmalloc 0xffffc20000c31000-0xffffc20000c34000 12288 alloc_large_system_hash+0x127/0x246 pages=2 vmalloc 0xffffc20000c34000-0xffffc20000c36000 8192 init_vdso_vars+0xde/0x1f1 0xffffc20000c36000-0xffffc20000c38000 8192 pci_iomap+0x8a/0xb4 phys=d8e00000 ioremap 0xffffc20000c38000-0xffffc20000c3a000 8192 usb_hcd_pci_probe+0x139/0x295 [usbcore] phys=d8e00000 ioremap 0xffffc20000c3a000-0xffffc20000c3e000 16384 sys_swapon+0x509/0xa15 pages=3 vmalloc 0xffffc20000c40000-0xffffc20000c61000 135168 e1000_probe+0x1c4/0xa32 phys=d8a20000 ioremap 0xffffc20000c61000-0xffffc20000c6a000 36864 _xfs_buf_map_pages+0x8e/0xc0 vmap 0xffffc20000c6a000-0xffffc20000c73000 36864 _xfs_buf_map_pages+0x8e/0xc0 vmap 0xffffc20000c73000-0xffffc20000c7c000 36864 _xfs_buf_map_pages+0x8e/0xc0 vmap 0xffffc20000c7c000-0xffffc20000c7f000 12288 e1000e_setup_tx_resources+0x29/0xbe pages=2 vmalloc 0xffffc20000c80000-0xffffc20001481000 8392704 pci_mmcfg_arch_init+0x90/0x118 phys=e0000000 ioremap 0xffffc20001481000-0xffffc20001682000 2101248 alloc_large_system_hash+0x127/0x246 pages=512 vmalloc 0xffffc20001682000-0xffffc20001e83000 8392704 alloc_large_system_hash+0x127/0x246 pages=2048 vmalloc vpages 0xffffc20001e83000-0xffffc20002204000 3674112 alloc_large_system_hash+0x127/0x246 pages=896 vmalloc vpages 0xffffc20002204000-0xffffc2000220d000 36864 _xfs_buf_map_pages+0x8e/0xc0 vmap 0xffffc2000220d000-0xffffc20002216000 36864 _xfs_buf_map_pages+0x8e/0xc0 vmap 0xffffc20002216000-0xffffc2000221f000 36864 _xfs_buf_map_pages+0x8e/0xc0 vmap 0xffffc2000221f000-0xffffc20002228000 36864 _xfs_buf_map_pages+0x8e/0xc0 vmap 0xffffc20002228000-0xffffc20002231000 36864 _xfs_buf_map_pages+0x8e/0xc0 vmap 0xffffc20002231000-0xffffc20002234000 12288 e1000e_setup_rx_resources+0x35/0x122 pages=2 vmalloc 0xffffc20002240000-0xffffc20002261000 135168 e1000_probe+0x1c4/0xa32 phys=d8a60000 ioremap 0xffffc20002261000-0xffffc2000270c000 4894720 sys_swapon+0x509/0xa15 pages=1194 vmalloc vpages 0xffffffffa0000000-0xffffffffa0022000 139264 module_alloc+0x4f/0x55 pages=33 vmalloc 0xffffffffa0022000-0xffffffffa0029000 28672 module_alloc+0x4f/0x55 pages=6 vmalloc 0xffffffffa002b000-0xffffffffa0034000 36864 module_alloc+0x4f/0x55 pages=8 vmalloc 0xffffffffa0034000-0xffffffffa003d000 36864 module_alloc+0x4f/0x55 pages=8 vmalloc 0xffffffffa003d000-0xffffffffa0049000 49152 module_alloc+0x4f/0x55 pages=11 vmalloc 0xffffffffa0049000-0xffffffffa0050000 28672 module_alloc+0x4f/0x55 pages=6 vmalloc [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Christoph Lameter <clameter@sgi.com> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Hugh Dickins <hugh@veritas.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-28 08:58:21 -07:00
Christoph Lameter	a10aa57987	vmalloc: show vmalloced areas via /proc/vmallocinfo Implement a new proc file that allows the display of the currently allocated vmalloc memory. It allows to see the users of vmalloc. That is important if vmalloc space is scarce (i386 for example). And it's going to be important for the compound page fallback to vmalloc. Many of the current users can be switched to use compound pages with fallback. This means that the number of users of vmalloc is reduced and page tables no longer necessary to access the memory. /proc/vmallocinfo allows to review how that reduction occurs. If memory becomes fragmented and larger order allocations are no longer possible then /proc/vmallocinfo allows to see which compound page allocations fell back to virtual compound pages. That is important for new users of virtual compound pages. Such as order 1 stack allocation etc that may fallback to virtual compound pages in the future. /proc/vmallocinfo permissions are made readable-only-by-root to avoid possible information leakage. [akpm@linux-foundation.org: coding-style fixes] [akpm@linux-foundation.org: CONFIG_MMU=n build fix] Signed-off-by: Christoph Lameter <clameter@sgi.com> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Hugh Dickins <hugh@veritas.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Arjan van de Ven <arjan@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-28 08:58:21 -07:00
Andrew Morton	b454456841	mm: make early_pfn_to_nid() a C function Fix this (sparc64) mm/sparse-vmemmap.c: In function `vmemmap_verify': mm/sparse-vmemmap.c:64: warning: unused variable `pfn' by switching to a C function which touches its arg. (reason 3,555 why macros are bad) Also, the `nid' arg was misnamed. Reviewed-by: Christoph Lameter <clameter@sgi.com> Acked-by: Andy Whitcroft <apw@shadowen.org> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Andi Kleen <ak@suse.de> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-28 08:58:20 -07:00
Miklos Szeredi	ac6aadb24b	mm: rotate_reclaimable_page() cleanup Clean up messy conditional calling of test_clear_page_writeback() from both rotate_reclaimable_page() and end_page_writeback(). The only user of rotate_reclaimable_page() is end_page_writeback() so this is OK. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-28 08:58:20 -07:00
Andi Kleen	7edf85aa3c	mm: save some bytes in mm_struct by filling holes on 64bit Save some bytes in mm_struct by filling holes Putting int values together for better packing on 64bit shrinks sizeof(struct mm_struct) from 776 bytes to 764 bytes. Signed-off-by: Andi Kleen <ak@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-28 08:58:20 -07:00
David Rientjes	3842b46de6	mempolicy: small header file cleanup Removes forward definition of vm_area_struct in linux/mempolicy.h. We already get it from the linux/slab.h -> linux/gfp.h include. Removes the unused mpol_set_vma_default() macro from linux/mempolicy.h. Removes the extern definition of default_policy since it is only referenced, as it should be, in mm/mempolicy.c. Cc: Paul Jackson <pj@sgi.com> Cc: Christoph Lameter <clameter@sgi.com> Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com> Cc: Andi Kleen <ak@suse.de> Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-28 08:58:20 -07:00
David Rientjes	4c50bc0116	mempolicy: add MPOL_F_RELATIVE_NODES flag Adds another optional mode flag, MPOL_F_RELATIVE_NODES, that specifies nodemasks passed via set_mempolicy() or mbind() should be considered relative to the current task's mems_allowed. When the mempolicy is created, the passed nodemask is folded and mapped onto the current task's mems_allowed. For example, consider a task using set_mempolicy() to pass MPOL_INTERLEAVE \| MPOL_F_RELATIVE_NODES with a nodemask of 1-3. If current's mems_allowed is 4-7, the effected nodemask is 5-7 (the second, third, and fourth node of mems_allowed). If the same task is attached to a cpuset, the mempolicy nodemask is rebound each time the mems are changed. Some possible rebinds and results are: mems result 1-3 1-3 1-7 2-4 1,5-6 1,5-6 1,5-7 5-7 Likewise, the zonelist built for MPOL_BIND acts on the set of zones assigned to the resultant nodemask from the relative remap. In the MPOL_PREFERRED case, the preferred node is remapped from the currently effected nodemask to the relative nodemask. This mempolicy mode flag was conceived of by Paul Jackson <pj@sgi.com>. Cc: Paul Jackson <pj@sgi.com> Cc: Christoph Lameter <clameter@sgi.com> Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com> Cc: Andi Kleen <ak@suse.de> Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-28 08:58:19 -07:00
Paul Jackson	7ea931c9fc	mempolicy: add bitmap_onto() and bitmap_fold() operations The following adds two more bitmap operators, bitmap_onto() and bitmap_fold(), with the usual cpumask and nodemask wrappers. The bitmap_onto() operator computes one bitmap relative to another. If the n-th bit in the origin mask is set, then the m-th bit of the destination mask will be set, where m is the position of the n-th set bit in the relative mask. The bitmap_fold() operator folds a bitmap into a second that has bit m set iff the input bitmap has some bit n set, where m == n mod sz, for the specified sz value. There are two substantive changes between this patch and its predecessor bitmap_relative: 1) Renamed bitmap_relative() to be bitmap_onto(). 2) Added bitmap_fold(). The essential motivation for bitmap_onto() is to provide a mechanism for converting a cpuset-relative CPU or Node mask to an absolute mask. Cpuset relative masks are written as if the current task were in a cpuset whose CPUs or Nodes were just the consecutive ones numbered 0..N-1, for some N. The bitmap_onto() operator is provided in anticipation of adding support for the first such cpuset relative mask, by the mbind() and set_mempolicy() system calls, using a planned flag of MPOL_F_RELATIVE_NODES. These bitmap operators (and their nodemask wrappers, in particular) will be used in code that converts the user specified cpuset relative memory policy to a specific system node numbered policy, given the current mems_allowed of the tasks cpuset. Such cpuset relative mempolicies will address two deficiencies of the existing interface between cpusets and mempolicies: 1) A task cannot at present reliably establish a cpuset relative mempolicy because there is an essential race condition, in that the tasks cpuset may be changed in between the time the task can query its cpuset placement, and the time the task can issue the applicable mbind or set_memplicy system call. 2) A task cannot at present establish what cpuset relative mempolicy it would like to have, if it is in a smaller cpuset than it might have mempolicy preferences for, because the existing interface only allows specifying mempolicies for nodes currently allowed by the cpuset. Cpuset relative mempolicies are useful for tasks that don't distinguish particularly between one CPU or Node and another, but only between how many of each are allowed, and the proper placement of threads and memory pages on the various CPUs and Nodes available. The motivation for the added bitmap_fold() can be seen in the following example. Let's say an application has specified some mempolicies that presume 16 memory nodes, including say a mempolicy that specified MPOL_F_RELATIVE_NODES (cpuset relative) nodes 12-15. Then lets say that application is crammed into a cpuset that only has 8 memory nodes, 0-7. If one just uses bitmap_onto(), this mempolicy, mapped to that cpuset, would ignore the requested relative nodes above 7, leaving it empty of nodes. That's not good; better to fold the higher nodes down, so that some nodes are included in the resulting mapped mempolicy. In this case, the mempolicy nodes 12-15 are taken modulo 8 (the weight of the mems_allowed of the confining cpuset), resulting in a mempolicy specifying nodes 4-7. Signed-off-by: Paul Jackson <pj@sgi.com> Signed-off-by: David Rientjes <rientjes@google.com> Cc: Christoph Lameter <clameter@sgi.com> Cc: Andi Kleen <ak@suse.de> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: <kosaki.motohiro@jp.fujitsu.com> Cc: <ray-lk@madrabbit.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-28 08:58:19 -07:00
David Rientjes	f5b087b52f	mempolicy: add MPOL_F_STATIC_NODES flag Add an optional mempolicy mode flag, MPOL_F_STATIC_NODES, that suppresses the node remap when the policy is rebound. Adds another member to struct mempolicy, nodemask_t user_nodemask, as part of a union with cpuset_mems_allowed: struct mempolicy { ... union { nodemask_t cpuset_mems_allowed; nodemask_t user_nodemask; } w; } that stores the the nodemask that the user passed when he or she created the mempolicy via set_mempolicy() or mbind(). When using MPOL_F_STATIC_NODES, which is passed with any mempolicy mode, the user's passed nodemask intersected with the VMA or task's allowed nodes is always used when determining the preferred node, setting the MPOL_BIND zonelist, or creating the interleave nodemask. This happens whenever the policy is rebound, including when a task's cpuset assignment changes or the cpuset's mems are changed. This creates an interesting side-effect in that it allows the mempolicy "intent" to lie dormant and uneffected until it has access to the node(s) that it desires. For example, if you currently ask for an interleaved policy over a set of nodes that you do not have access to, the mempolicy is not created and the task continues to use the previous policy. With this change, however, it is possible to create the same mempolicy; it is only effected when access to nodes in the nodemask is acquired. It is also possible to mount tmpfs with the static nodemask behavior when specifying a node or nodemask. To do this, simply add "=static" immediately following the mempolicy mode at mount time: mount -o remount mpol=interleave=static:1-3 Also removes mpol_check_policy() and folds its logic into mpol_new() since it is now obsoleted. The unused vma_mpol_equal() is also removed. Cc: Paul Jackson <pj@sgi.com> Cc: Christoph Lameter <clameter@sgi.com> Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com> Cc: Andi Kleen <ak@suse.de> Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-28 08:58:19 -07:00
David Rientjes	028fec414d	mempolicy: support optional mode flags With the evolution of mempolicies, it is necessary to support mempolicy mode flags that specify how the policy shall behave in certain circumstances. The most immediate need for mode flag support is to suppress remapping the nodemask of a policy at the time of rebind. Both the mempolicy mode and flags are passed by the user in the 'int policy' formal of either the set_mempolicy() or mbind() syscall. A new constant, MPOL_MODE_FLAGS, represents the union of legal optional flags that may be passed as part of this int. Mempolicies that include illegal flags as part of their policy are rejected as invalid. An additional member to struct mempolicy is added to support the mode flags: struct mempolicy { ... unsigned short policy; unsigned short flags; } The splitting of the 'int' actual passed by the user is done in sys_set_mempolicy() and sys_mbind() for their respective syscalls. This is done by intersecting the actual with MPOL_MODE_FLAGS, rejecting the syscall of there are additional flags, and storing it in the new 'flags' member of struct mempolicy. The intersection of the actual with ~MPOL_MODE_FLAGS is stored in the 'policy' member of the struct and all current users of pol->policy remain unchanged. The union of the policy mode and optional mode flags is passed back to the user in get_mempolicy(). This combination of mode and flags within the same actual does not break userspace code that relies on get_mempolicy(&policy, ...) and either switch (policy) { case MPOL_BIND: ... case MPOL_INTERLEAVE: ... }; statements or if (policy == MPOL_INTERLEAVE) { ... } statements. Such applications would need to use optional mode flags when calling set_mempolicy() or mbind() for these previously implemented statements to stop working. If an application does start using optional mode flags, it will need to mask the optional flags off the policy in switch and conditional statements that only test mode. An additional member is also added to struct shmem_sb_info to store the optional mode flags. [hugh@veritas.com: shmem mpol: fix build warning] Cc: Paul Jackson <pj@sgi.com> Cc: Christoph Lameter <clameter@sgi.com> Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com> Cc: Andi Kleen <ak@suse.de> Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-28 08:58:19 -07:00
David Rientjes	a3b51e0142	mempolicy: convert MPOL constants to enum The mempolicy mode constants, MPOL_DEFAULT, MPOL_PREFERRED, MPOL_BIND, and MPOL_INTERLEAVE, are better declared as part of an enum since they are sequentially numbered and cannot be combined. The policy member of struct mempolicy is also converted from type short to type unsigned short. A negative policy does not have any legitimate meaning, so it is possible to change its type in preparation for adding optional mode flags later. The equivalent member of struct shmem_sb_info is also changed from int to unsigned short. For compatibility, the policy formal to get_mempolicy() remains as a pointer to an int: int get_mempolicy(int policy, unsigned long nmask, unsigned long maxnode, unsigned long addr, unsigned long flags); although the only possible values is the range of type unsigned short. Cc: Paul Jackson <pj@sgi.com> Cc: Christoph Lameter <clameter@sgi.com> Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com> Cc: Andi Kleen <ak@suse.de> Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-28 08:58:19 -07:00
Pekka Enberg	1b27d05b6e	mm: move cache_line_size() to <linux/cache.h> Not all architectures define cache_line_size() so as suggested by Andrew move the private implementations in mm/slab.c and mm/slob.c to <linux/cache.h>. Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Reviewed-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-28 08:58:19 -07:00
Mel Gorman	19770b3260	mm: filter based on a nodemask as well as a gfp_mask The MPOL_BIND policy creates a zonelist that is used for allocations controlled by that mempolicy. As the per-node zonelist is already being filtered based on a zone id, this patch adds a version of __alloc_pages() that takes a nodemask for further filtering. This eliminates the need for MPOL_BIND to create a custom zonelist. A positive benefit of this is that allocations using MPOL_BIND now use the local node's distance-ordered zonelist instead of a custom node-id-ordered zonelist. I.e., pages will be allocated from the closest allowed node with available memory. [Lee.Schermerhorn@hp.com: Mempolicy: update stale documentation and comments] [Lee.Schermerhorn@hp.com: Mempolicy: make dequeue_huge_page_vma() obey MPOL_BIND nodemask] [Lee.Schermerhorn@hp.com: Mempolicy: make dequeue_huge_page_vma() obey MPOL_BIND nodemask rework] Signed-off-by: Mel Gorman <mel@csn.ul.ie> Acked-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Hugh Dickins <hugh@veritas.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-28 08:58:19 -07:00
Mel Gorman	dd1a239f6f	mm: have zonelist contains structs with both a zone pointer and zone_idx Filtering zonelists requires very frequent use of zone_idx(). This is costly as it involves a lookup of another structure and a substraction operation. As the zone_idx is often required, it should be quickly accessible. The node idx could also be stored here if it was found that accessing zone->node is significant which may be the case on workloads where nodemasks are heavily used. This patch introduces a struct zoneref to store a zone pointer and a zone index. The zonelist then consists of an array of these struct zonerefs which are looked up as necessary. Helpers are given for accessing the zone index as well as the node index. [kamezawa.hiroyu@jp.fujitsu.com: Suggested struct zoneref instead of embedding information in pointers] [hugh@veritas.com: mm-have-zonelist: fix memcg ooms] [hugh@veritas.com: just return do_try_to_free_pages] [hugh@veritas.com: do_try_to_free_pages gfp_mask redundant] Signed-off-by: Mel Gorman <mel@csn.ul.ie> Acked-by: Christoph Lameter <clameter@sgi.com> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Christoph Lameter <clameter@sgi.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-28 08:58:18 -07:00
Mel Gorman	54a6eb5c47	mm: use two zonelist that are filtered by GFP mask Currently a node has two sets of zonelists, one for each zone type in the system and a second set for GFP_THISNODE allocations. Based on the zones allowed by a gfp mask, one of these zonelists is selected. All of these zonelists consume memory and occupy cache lines. This patch replaces the multiple zonelists per-node with two zonelists. The first contains all populated zones in the system, ordered by distance, for fallback allocations when the target/preferred node has no free pages. The second contains all populated zones in the node suitable for GFP_THISNODE allocations. An iterator macro is introduced called for_each_zone_zonelist() that interates through each zone allowed by the GFP flags in the selected zonelist. Signed-off-by: Mel Gorman <mel@csn.ul.ie> Acked-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Christoph Lameter <clameter@sgi.com> Cc: Hugh Dickins <hugh@veritas.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-28 08:58:18 -07:00
Mel Gorman	18ea7e710d	mm: remember what the preferred zone is for zone_statistics On NUMA, zone_statistics() is used to record events like numa hit, miss and foreign. It assumes that the first zone in a zonelist is the preferred zone. When multiple zonelists are replaced by one that is filtered, this is no longer the case. This patch records what the preferred zone is rather than assuming the first zone in the zonelist is it. This simplifies the reading of later patches in this set. Signed-off-by: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Mel Gorman <mel@csn.ul.ie> Reviewed-by: Christoph Lameter <clameter@sgi.com> Cc: Hugh Dickins <hugh@veritas.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-28 08:58:18 -07:00
Mel Gorman	0e88460da6	mm: introduce node_zonelist() for accessing the zonelist for a GFP mask Introduce a node_zonelist() helper function. It is used to lookup the appropriate zonelist given a node and a GFP mask. The patch on its own is a cleanup but it helps clarify parts of the two-zonelist-per-node patchset. If necessary, it can be merged with the next patch in this set without problems. Reviewed-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Christoph Lameter <clameter@sgi.com> Cc: Hugh Dickins <hugh@veritas.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-28 08:58:18 -07:00
Mel Gorman	dac1d27bc8	mm: use zonelists instead of zones when direct reclaiming pages The following patches replace multiple zonelists per node with two zonelists that are filtered based on the GFP flags. The patches as a set fix a bug with regard to the use of MPOL_BIND and ZONE_MOVABLE. With this patchset, the MPOL_BIND will apply to the two highest zones when the highest zone is ZONE_MOVABLE. This should be considered as an alternative fix for the MPOL_BIND+ZONE_MOVABLE in 2.6.23 to the previously discussed hack that filters only custom zonelists. The first patch cleans up an inconsistency where direct reclaim uses zonelist->zones where other places use zonelist. The second patch introduces a helper function node_zonelist() for looking up the appropriate zonelist for a GFP mask which simplifies patches later in the set. The third patch defines/remembers the "preferred zone" for numa statistics, as it is no longer always the first zone in a zonelist. The forth patch replaces multiple zonelists with two zonelists that are filtered. The two zonelists are due to the fact that the memoryless patchset introduces a second set of zonelists for __GFP_THISNODE. The fifth patch introduces helper macros for retrieving the zone and node indices of entries in a zonelist. The final patch introduces filtering of the zonelists based on a nodemask. Two zonelists exist per node, one for normal allocations and one for __GFP_THISNODE. Performance results varied depending on the machine configuration. In real workloads the gain/loss will depend on how much the userspace portion of the benchmark benefits from having more cache available due to reduced referencing of zonelists. These are the range of performance losses/gains when running against 2.6.24-rc4-mm1. The set and these machines are a mix of i386, x86_64 and ppc64 both NUMA and non-NUMA. loss to gain Total CPU time on Kernbench: -0.86% to 1.13% Elapsed time on Kernbench: -0.79% to 0.76% page_test from aim9: -4.37% to 0.79% brk_test from aim9: -0.71% to 4.07% fork_test from aim9: -1.84% to 4.60% exec_test from aim9: -0.71% to 1.08% This patch: The allocator deals with zonelists which indicate the order in which zones should be targeted for an allocation. Similarly, direct reclaim of pages iterates over an array of zones. For consistency, this patch converts direct reclaim to use a zonelist. No functionality is changed by this patch. This simplifies zonelist iterators in the next patch. Signed-off-by: Mel Gorman <mel@csn.ul.ie> Acked-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Christoph Lameter <clameter@sgi.com> Cc: Hugh Dickins <hugh@veritas.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-28 08:58:18 -07:00
Nick Piggin	3c18ddd160	mm: remove nopage Nothing in the tree uses nopage any more. Remove support for it in the core mm code and documentation (and a few stray references to it in comments). Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-28 08:58:18 -07:00
Harvey Harrison	ddc81ed2c5	remove sparse warning for mmzone.h include/linux/mmzone.h:640:22: warning: potentially expensive pointer subtraction Calculate the offset into the node_zones array rather than the index using casts to (char ) and comparing against the index sizeof(struct zone). On X86_32 this saves a sar, but code size increases by one byte per is_highmem() use due to 32-bit cmps rather than 16 bit cmps. Before: 207: 2b 80 8c 07 00 00 sub 0x78c(%eax),%eax 20d: c1 f8 0b sar $0xb,%eax 210: 83 f8 02 cmp $0x2,%eax 213: 74 16 je 22b <kmap_atomic_prot+0x144> 215: 83 f8 03 cmp $0x3,%eax 218: 0f 85 8f 00 00 00 jne 2ad <kmap_atomic_prot+0x1c6> 21e: 83 3d 00 00 00 00 02 cmpl $0x2,0x0 225: 0f 85 82 00 00 00 jne 2ad <kmap_atomic_prot+0x1c6> 22b: 64 a1 00 00 00 00 mov %fs:0x0,%eax After: 207: 2b 80 8c 07 00 00 sub 0x78c(%eax),%eax 20d: 3d 00 10 00 00 cmp $0x1000,%eax 212: 74 18 je 22c <kmap_atomic_prot+0x145> 214: 3d 00 18 00 00 cmp $0x1800,%eax 219: 0f 85 8f 00 00 00 jne 2ae <kmap_atomic_prot+0x1c7> 21f: 83 3d 00 00 00 00 02 cmpl $0x2,0x0 226: 0f 85 82 00 00 00 jne 2ae <kmap_atomic_prot+0x1c7> 22c: 64 a1 00 00 00 00 mov %fs:0x0,%eax [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-28 08:58:17 -07:00
Christoph Lameter	488514d179	Remove set_migrateflags() Migrate flags must be set on slab creation as agreed upon when the antifrag logic was reviewed. Otherwise some slabs of a slabcache will end up in the unmovable and others in the reclaimable section depending on which flag was active when a new slab page was allocated. This likely slid in somehow when antifrag was merged. Remove it. The buffer_heads are always allocated with __GFP_RECLAIMABLE because the SLAB_RECLAIM_ACCOUNT option is set. The set_migrateflags() never had any effect there. Radix tree allocations are not directly reclaimable but they are allocated with __GFP_RECLAIMABLE set on each allocation. We now set SLAB_RECLAIM_ACCOUNT on radix tree slab creation making sure that radix tree slabs are consistently placed in the reclaimable section. Radix tree slabs will also be accounted as such. There is then no user left of set_migratepages. So remove it. Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-28 08:58:17 -07:00
Badari Pulavarty	ea01ea937d	hotplug memory remove: generic __remove_pages() support Generic helper function to remove section mappings and sysfs entries for the section of the memory we are removing. offline_pages() correctly adjusted zone and marked the pages reserved. TODO: Yasunori Goto is working on patches to free up allocations from bootmem. Signed-off-by: Badari Pulavarty <pbadari@us.ibm.com> Acked-by: Yasunori Goto <y-goto@jp.fujitsu.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-28 08:58:17 -07:00
Al Viro	8b67dca942	[PATCH] new predicate - AUDIT_FILETYPE Argument is S_IF... \| <index>, where index is normally 0 or 1. Triggers if chosen element of ctx->names[] is present and the mode of object in question matches the upper bits of argument. I.e. for things like "is the argument of that chmod a directory", etc. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2008-04-28 06:28:37 -04:00
Miloslav Trmac	41126226e1	[patch 1/2] audit: let userspace fully control TTY input auditing Remove the code that automatically disables TTY input auditing in processes that open TTYs when they have no other TTY open; this heuristic was intended to automatically handle daemons, but it has false positives (e.g. with sshd) that make it impossible to control TTY input auditing from a PAM module. With this patch, TTY input auditing is controlled from user-space only. On the other hand, not even for daemons does it make sense to audit "input" from PTY masters; this data was produced by a program writing to the PTY slave, and does not represent data entered by the user. Signed-off-by: Miloslav Trmac <mitr@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: David Woodhouse <dwmw2@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2008-04-28 06:28:24 -04:00
Eric Paris	a42da93c86	Audit: increase the maximum length of the key field Key lengths were arbitrarily limited to 32 characters. If userspace is going to start using the single kernel key field as multiple virtual key fields (example key=key1,key2,key3,key4) we should give them enough room to work. Signed-off-by: Eric Paris <eparis@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2008-04-28 06:19:29 -04:00
Eric Paris	b556f8ad58	Audit: standardize string audit interfaces This patch standardized the string auditing interfaces. No userspace changes will be visible and this is all just cleanup and consistancy work. We have the following string audit interfaces to use: void audit_log_n_hex(struct audit_buffer ab, const unsigned char buf, size_t len); void audit_log_n_string(struct audit_buffer ab, const char buf, size_t n); void audit_log_string(struct audit_buffer ab, const char buf); void audit_log_n_untrustedstring(struct audit_buffer ab, const char string, size_t n); void audit_log_untrustedstring(struct audit_buffer ab, const char string); This may be the first step to possibly fixing some of the issues that people have with the string output from the kernel audit system. But we still don't have an agreed upon solution to that problem. Signed-off-by: Eric Paris <eparis@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2008-04-28 06:19:22 -04:00
Eric Paris	2532386f48	Audit: collect sessionid in netlink messages Previously I added sessionid output to all audit messages where it was available but we still didn't know the sessionid of the sender of netlink messages. This patch adds that information to netlink messages so we can audit who sent netlink messages. Signed-off-by: Eric Paris <eparis@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2008-04-28 06:18:03 -04:00
Eric Paris	7b41b1733c	SELinux: include/security.h whitespace, syntax, and other cleanups This patch changes include/security.h to fix whitespace and syntax issues. Things that are fixed may include (does not not have to include) whitespace at end of lines spaces followed by tabs spaces used instead of tabs spacing around parenthesis location of { around structs and else clauses location of * in pointer declarations removal of initialization of static data to keep it in the right section useless {} in if statemetns useless checking for NULL before kfree fixing of the indentation depth of switch statements no assignments in if statements include spaces around , in function calls and any number of other things I forgot to mention Signed-off-by: Eric Paris <eparis@redhat.com> Signed-off-by: James Morris <jmorris@namei.org>	2008-04-28 09:29:08 +10:00
Linus Torvalds	064922a805	Merge git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi-misc-2.6 * git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi-misc-2.6: (40 commits) [SCSI] jazz_esp, sgiwd93, sni_53c710, sun3x_esp: fix platform driver hotplug/coldplug [SCSI] aic7xxx: add const [SCSI] aic7xxx: add static [SCSI] aic7xxx: Update _shipped files [SCSI] aic7xxx: teach aicasm to not emit unused debug code/data [SCSI] qla2xxx: Update version number to 8.02.01-k2. [SCSI] qla2xxx: Correct regression in relogin code. [SCSI] qla2xxx: Correct misc. endian and byte-ordering issues. [SCSI] qla2xxx: make qla2x00_issue_iocb_timeout() static [SCSI] qla2xxx: qla_os.c, make 2 functions static [SCSI] qla2xxx: Re-register FDMI information after a LIP. [SCSI] qla2xxx: Correct SRB usage-after-completion/free issues. [SCSI] qla2xxx: Correct ISP84XX verify-chip response handling. [SCSI] qla2xxx: Wakeup DPC thread to process any deferred-work requests. [SCSI] qla2xxx: Collapse RISC-RAM retrieval code during a firmware-dump. [SCSI] m68k: new mac_esp scsi driver [SCSI] zfcp: Add some statistics provided by the FCP adapter to the sysfs [SCSI] zfcp: Print some messages only during ERP [SCSI] zfcp: Wait for free SBAL during exchange config [SCSI] scsi_transport_fc: fc_user_scan correction ...	2008-04-27 11:25:00 -07:00
Linus Torvalds	42cadc8600	Merge branch 'kvm-updates-2.6.26' of git://git.kernel.org/pub/scm/linux/kernel/git/avi/kvm * 'kvm-updates-2.6.26' of git://git.kernel.org/pub/scm/linux/kernel/git/avi/kvm: (147 commits) KVM: kill file->f_count abuse in kvm KVM: MMU: kvm_pv_mmu_op should not take mmap_sem KVM: SVM: remove selective CR0 comment KVM: SVM: remove now obsolete FIXME comment KVM: SVM: disable CR8 intercept when tpr is not masking interrupts KVM: SVM: sync V_TPR with LAPIC.TPR if CR8 write intercept is disabled KVM: export kvm_lapic_set_tpr() to modules KVM: SVM: sync TPR value to V_TPR field in the VMCB KVM: ppc: PowerPC 440 KVM implementation KVM: Add MAINTAINERS entry for PowerPC KVM KVM: ppc: Add DCR access information to struct kvm_run ppc: Export tlb_44x_hwater for KVM KVM: Rename debugfs_dir to kvm_debugfs_dir KVM: x86 emulator: fix lea to really get the effective address KVM: x86 emulator: fix smsw and lmsw with a memory operand KVM: x86 emulator: initialize src.val and dst.val for register operands KVM: SVM: force a new asid when initializing the vmcb KVM: fix kvm_vcpu_kick vs __vcpu_run race KVM: add ioctls to save/store mpstate KVM: Rename VCPU_MP_STATE_* to KVM_MP_STATE_* ...	2008-04-27 10:13:52 -07:00

... 4 5 6 7 8 ...

10808 commits