This is a minor bug fix in kexec to resolve the problem of loading panic
kernel with initrd.
o Problem: Loading a capture kenrel fails if initrd is also being loaded.
This has been observed for vmlinux image for kexec on panic case.
o This patch fixes the problem. In segment location and size verification
logic, minor correction has been done. Segment memory end (mend) should be
mstart + memsz - 1. This one byte offset was source of failure for initrd
loading which was being loaded at hole boundary.
Signed-off-by: Vivek Goyal <vgoyal@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This patch introduces the architecture independent implementation the
sys_kexec_load, the compat_sys_kexec_load system calls.
Kexec on panic support has been integrated into the core patch and is
relatively clean.
In addition the hopefully architecture independent option
crashkernel=size@location has been docuemented. It's purpose is to reserve
space for the panic kernel to live, and where no DMA transfer will ever be
setup to access.
Signed-off-by: Eric Biederman <ebiederm@xmission.com>
Signed-off-by: Alexander Nyberg <alexn@telia.com>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Vivek Goyal <vgoyal@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This patch adds a new preemption model: 'Voluntary Kernel Preemption'. The
3 models can be selected from a new menu:
(X) No Forced Preemption (Server)
( ) Voluntary Kernel Preemption (Desktop)
( ) Preemptible Kernel (Low-Latency Desktop)
we still default to the stock (Server) preemption model.
Voluntary preemption works by adding a cond_resched()
(reschedule-if-needed) call to every might_sleep() check. It is lighter
than CONFIG_PREEMPT - at the cost of not having as tight latencies. It
represents a different latency/complexity/overhead tradeoff.
It has no runtime impact at all if disabled. Here are size stats that show
how the various preemption models impact the kernel's size:
text data bss dec hex filename
3618774 547184 179896 4345854 424ffe vmlinux.stock
3626406 547184 179896 4353486 426dce vmlinux.voluntary +0.2%
3748414 548640 179896 4476950 445016 vmlinux.preempt +3.5%
voluntary-preempt is +0.2% of .text, preempt is +3.5%.
This feature has been tested for many months by lots of people (and it's
also included in the RHEL4 distribution and earlier variants were in Fedora
as well), and it's intended for users and distributions who dont want to
use full-blown CONFIG_PREEMPT for one reason or another.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The only sane way to clean up the current 3 lock_kernel() variants seems to
be to remove the spinlock-based BKL implementations altogether, and to keep
the semaphore-based one only. If we dont want to do that for whatever
reason then i'm afraid we have to live with the current complexity. (but
i'm open for other cleanup suggestions as well.)
To explore this possibility we'll (at a minimum) have to know whether the
semaphore-based BKL works fine on plain SMP too. The patch below enables
this.
The patch may make sense in isolation as well, as it might bring
performance benefits: code that would formerly spin on the BKL spinlock
will now schedule away and give up the CPU. It might introduce performance
regressions as well, if any performance-critical code uses the BKL heavily
and gets overscheduled due to the semaphore. I very much hope there is no
such performance-critical codepath left though.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This patch consolidates the CONFIG_PREEMPT and CONFIG_PREEMPT_BKL
preemption options into kernel/Kconfig.preempt. This, besides reducing
source-code, also enables more centralized tweaking of preemption related
options.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Adds the core update_cpu_domains code and updated cpusets documentation
Signed-off-by: Dinakar Guniguntala <dino@in.ibm.com>
Acked-by: Paul Jackson <pj@sgi.com>
Acked-by: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The following patches add dynamic sched domains functionality that was
extensively discussed on lkml and lse-tech. I would like to see this added to
-mm
o The main advantage with this feature is that it ensures that the scheduler
load balacing code only balances against the cpus that are in the sched
domain as defined by an exclusive cpuset and not all of the cpus in the
system. This removes any overhead due to load balancing code trying to
pull tasks outside of the cpu exclusive cpuset only to be prevented by
the tasks' cpus_allowed mask.
o cpu exclusive cpusets are useful for servers running orthogonal
workloads such as RT applications requiring low latency and HPC
applications that are throughput sensitive
o It provides a new API partition_sched_domains in sched.c
that makes dynamic sched domains possible.
o cpu_exclusive cpusets sets are now associated with a sched domain.
Which means that the users can dynamically modify the sched domains
through the cpuset file system interface
o ia64 sched domain code has been updated to support this feature as well
o Currently, this does not support hotplug. (However some of my tests
indicate hotplug+preempt is currently broken)
o I have tested it extensively on x86.
o This should have very minimal impact on performance as none of
the fast paths are affected
Signed-off-by: Dinakar Guniguntala <dino@in.ibm.com>
Acked-by: Paul Jackson <pj@sgi.com>
Acked-by: Nick Piggin <nickpiggin@yahoo.com.au>
Acked-by: Matthew Dobson <colpatch@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Presently, a process without the capability CAP_SYS_NICE can not change
its own policy, which is OK.
But it can also not decrease its RT priority (if scheduled with policy
SCHED_RR or SCHED_FIFO), which is what this patch changes.
The rationale is the same as for the nice value: a process should be
able to require less priority for itself. Increasing the priority is
still not allowed.
This is for example useful if you give a multithreaded user process a RT
priority, and the process would like to organize its internal threads
using priorities also. Then you can give the process the highest
priority needed N, and the process starts its threads with lower
priorities: N-1, N-2...
The POSIX norm says that the permissions are implementation specific, so
I think we can do that.
In a sense, it makes the permissions consistent whatever the policy is:
with this patch, process scheduled by SCHED_FIFO, SCHED_RR and
SCHED_OTHER can all decrease their priority.
From: Ingo Molnar <mingo@elte.hu>
cleaned up and merged to -mm.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The maximum rebalance interval allowed by the multiprocessor balancing
backoff is often not large enough to handle corner cases where there are
lots of tasks pinned on a CPU. Suresh reported:
I see system livelock's if for example I have 7000 processes
pinned onto one cpu (this is on the fastest 8-way system I
have access to).
After this patch, the machine is reported to go well above this number.
Signed-off-by: Nick Piggin <nickpiggin@yahoo.com.au>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Consolidate balance-on-exec with balance-on-fork. This is made easy by the
sched-domains RCU patches.
As well as the general goodness of code reduction, this allows the runqueues
to be unlocked during balance-on-fork.
schedstats is a problem. Maybe just have balance-on-event instead of
distinguishing fork and exec?
Signed-off-by: Nick Piggin <nickpiggin@yahoo.com.au>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
One of the problems with the multilevel balance-on-fork/exec is that it needs
to jump through hoops to satisfy sched-domain's locking semantics (that is,
you may traverse your own domain when not preemptable, and you may traverse
others' domains when holding their runqueue lock).
balance-on-exec had to potentially migrate between more than one CPU before
finding a final CPU to migrate to, and balance-on-fork needed to potentially
take multiple runqueue locks.
So bite the bullet and make sched-domains go completely RCU. This actually
simplifies the code quite a bit.
From: Ingo Molnar <mingo@elte.hu>
schedstats RCU fix, and a nice comment on for_each_domain, from Ingo.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Nick Piggin <nickpiggin@yahoo.com.au>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The fundamental problem that Suresh has with balance on exec and fork is that
it only tries to balance the top level domain with the flag set.
This was worked around by removing degenerate domains, but is still a problem
if people want to start using more complex sched-domains, especially
multilevel NUMA that ia64 is already using.
This patch makes balance on fork and exec try balancing over not just the top
most domain with the flag set, but all the way down the domain tree.
Signed-off-by: Nick Piggin <nickpiggin@yahoo.com.au>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Remove degenerate scheduler domains during the sched-domain init.
For example on x86_64, we always have NUMA configured in. On Intel EM64T
systems, top most sched domain will be of NUMA and with only one sched_group
in it.
With fork/exec balances(recent Nick's fixes in -mm tree), we always endup
taking wrong decisions because of this topmost domain (as it contains only one
group and find_idlest_group always returns NULL). We will endup loading HT
package completely first, letting active load balance kickin and correct it.
In general, this patch also makes sense with out recent Nick's fixes in -mm.
From: Nick Piggin <nickpiggin@yahoo.com.au>
Modified to account for more than just sched_groups when scanning for
degenerate domains by Nick Piggin. And allow a runqueue's sd to go NULL
rather than keep a single degenerate domain around (this happens when you run
with maxcpus=1).
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Signed-off-by: Nick Piggin <nickpiggin@yahoo.com.au>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Fix the last 2 places that directly access a runqueue's sched-domain and
assume it cannot be NULL.
That allows the use of NULL for domain, instead of a dummy domain, to signify
no balancing is to happen. No functional changes.
Signed-off-by: Nick Piggin <nickpiggin@yahoo.com.au>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Instead of requiring architecture code to interact with the scheduler's
locking implementation, provide a couple of defines that can be used by the
architecture to request runqueue unlocked context switches, and ask for
interrupts to be enabled over the context switch.
Also replaces the "switch_lock" used by these architectures with an oncpu
flag (note, not a potentially slow bitflag). This eliminates one bus
locked memory operation when context switching, and simplifies the
task_running function.
Signed-off-by: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Reimplement the balance on exec balancing to be sched-domains aware. Use this
to also do balance on fork balancing. Make x86_64 do balance on fork over the
NUMA domain.
The problem that the non sched domains aware blancing became apparent on dual
core, multi socket opterons. What we want is for the new tasks to be sent to
a different socket, but more often than not, we would first load up our
sibling core, or fill two cores of a single remote socket before selecting a
new one.
This gives large improvements to STREAM on such systems.
Signed-off-by: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Remove the very aggressive idle stuff that has recently gone into 2.6 - it is
going against the direction we are trying to go. Hopefully we can regain
performance through other methods.
Signed-off-by: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Do less affine wakeups. We're trying to reduce dbt2-pgsql idle time
regressions here... make sure we don't don't move tasks the wrong way in an
imbalance condition. Also, remove the cache coldness requirement from the
calculation - this seems to induce sharp cutoff points where behaviour will
suddenly change on some workloads if the load creeps slightly over or under
some point. It is good for periodic balancing because in that case have
otherwise have no other context to determine what task to move.
But also make a minor tweak to "wake balancing" - the imbalance tolerance is
now set at half the domain's imbalance, so we get the opportunity to do wake
balancing before the more random periodic rebalancing gets preformed.
Signed-off-by: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Do CPU load averaging over a number of different intervals. Allow each
interval to be chosen by sending a parameter to source_load and target_load.
0 is instantaneous, idx > 0 returns a decaying average with the most recent
sample weighted at 2^(idx-1). To a maximum of 3 (could be easily increased).
So generally a higher number will result in more conservative balancing.
Signed-off-by: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Remove the special casing for idle CPU balancing. Things like this are
hurting for example on SMT, where are single sibling being idle doesn't really
warrant a really aggressive pull over the NUMA domain, for example.
Signed-off-by: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
These conditions should now be impossible, and we need to fix them if they
happen.
Signed-off-by: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
SMT balancing has a couple of problems. Firstly, active_load_balance is too
complex - basically it should be a dumb helper for when the periodic balancer
has determined there is an imbalance, but gets stuck because the task is
running.
So rip out all its "smarts", and just make it move one task to the target CPU.
Second, the busy CPU's sched-domain tree was being used for active balancing.
This means that it may not see that nr_balance_failed has reached a critical
level. So use the target CPU's sched-domain tree for this. We can do this
because we hold its runqueue lock.
Lastly, reset nr_balance_failed to a point where we allow cache hot migration.
This will help ensure active load balancing is successful.
Thanks to Suresh Siddha for pointing out these issues.
Signed-off-by: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Fix up active load balancing a bit so it doesn't get called when it shouldn't.
Reset the nr_balance_failed counter at more points where we have found
conditions to be balanced. This reduces too aggressive active balancing seen
on some workloads.
Signed-off-by: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
John Hawkes explained the problem best:
A large number of processes that are pinned to a single CPU results
in every other CPU's load_balance() seeing this overloaded CPU as
"busiest", yet move_tasks() never finds a task to pull-migrate. This
condition occurs during module unload, but can also occur as a
denial-of-service using sys_sched_setaffinity(). Several hundred
CPUs performing this fruitless load_balance() will livelock on the
busiest CPU's runqueue lock. A smaller number of CPUs will livelock
if the pinned task count gets high.
Expanding slightly on John's patch, this one attempts to work out whether the
balancing failure has been due to too many tasks pinned on the runqueue. This
allows it to be basically invisible to the regular blancing paths (ie. when
there are no pinned tasks). We can use this extra knowledge to shut down the
balancing faster, and ensure the migration threads don't start running which
is another problem observed in the wild.
Signed-off-by: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
New sched-domains code means we don't get spans with offline CPUs in
them.
Signed-off-by: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Show swsuspend only on .config where it can compile. I got this on PPC32 &&
SMP:
kernel/power/smp.c:24: error: storage size of `ctxt' isn't known
Also mark swsusp as no longer experimental.
Signed-off-by: Olaf Hering <olh@suse.de>
Signed-off-by: Pavel Machek <pavel@suse.cz>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
In the cpu hotplug case, per-cpu data possibly isn't initialized even the
system state is 'running'. As the comments say in the original code, some
console drivers assume per-cpu resources have been allocated. radeon fb is
one such driver, which uses kmalloc. After a CPU is down, the per-cpu data
of slab is freed, so the system crashed when printing some info.
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The following patch moves the recalculation of nr_copy_pages so that the right
number is used in the calculation of the size of memory and swap needed.
It prevents swsusp from attempting to suspend if there is not enough memory
and/or swap (which is unlikely anyway).
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Signed-off-by: Pavel Machek <pavel@suse.cz>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The following patch cleans up whitespace in swsusp.c (a bit):
- removes any trailing whitespace
- adds spaces after if, for, for_each_pbe, for_each_zone etc., wherever
necessary.
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Signed-off-by: Pavel Machek <pavel@suse.cz>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The following patch removes the unnecessary function does_collide_order().
This function is no longer necessary, as currently there are only 0-order
allocations in swsusp, and the use of it is confusing.
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Signed-off-by: Pavel Machek <pavel@suse.cz>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Without this patch, Linux provokes emergency disk shutdowns and
similar nastiness. It was in SuSE kernels for some time, IIRC.
Signed-off-by: Pavel Machek <pavel@suse.cz>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Using CPU hotplug to support suspend/resume SMP. Both S3 and S4 use
disable/enable_nonboot_cpus API. The S4 part is based on Pavel's original S4
SMP patch.
Signed-off-by: Li Shaohua<shaohua.li@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
(The i386 CPU hotplug patch provides infrastructure for some work which Pavel
is doing as well as for ACPI S3 (suspend-to-RAM) work which Li Shaohua
<shaohua.li@intel.com> is doing)
The following provides i386 architecture support for safely unregistering and
registering processors during runtime, updated for the current -mm tree. In
order to avoid dumping cpu hotplug code into kernel/irq/* i dropped the
cpu_online check in do_IRQ() by modifying fixup_irqs(). The difference being
that on cpu offline, fixup_irqs() is called before we clear the cpu from
cpu_online_map and a long delay in order to ensure that we never have any
queued external interrupts on the APICs. There are additional changes to s390
and ppc64 to account for this change.
1) Add CONFIG_HOTPLUG_CPU
2) disable local APIC timer on dead cpus.
3) Disable preempt around irq balancing to prevent CPUs going down.
4) Print irq stats for all possible cpus.
5) Debugging check for interrupts on offline cpus.
6) Hacky fixup_irqs() to redirect irqs when cpus go off/online.
7) play_dead() for offline cpus to spin inside.
8) Handle offline cpus set in flush_tlb_others().
9) Grab lock earlier in smp_call_function() to prevent CPUs going down.
10) Implement __cpu_disable() and __cpu_die().
11) Enable local interrupts in cpu_enable() after fixup_irqs()
12) Don't fiddle with NMI on dead cpu, but leave intact on other cpus.
13) Program IRQ affinity whilst cpu is still in cpu_online_map on offline.
Signed-off-by: Zwane Mwaikambo <zwane@linuxpower.ca>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Another rollup of patches which give various symbols static scope
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This patch adds version and srcversion files to
/sys/module/${modulename} containing the version and srcversion fields
of the module's modinfo section (if present).
/sys/module/e1000
|-- srcversion
`-- version
This patch differs slightly from the version posted in January, as it
now uses the new kstrdup() call in -mm.
Why put this in sysfs?
a) Tools like DKMS, which deal with changing out individual kernel
modules without replacing the whole kernel, can behave smarter if they
can tell the version of a given module. The autoinstaller feature, for
example, which determines if your system has a "good" version of a
driver (i.e. if the one provided by DKMS has a newer verson than that
provided by the kernel package installed), and to automatically compile
and install a newer version if DKMS has it but your kernel doesn't yet
have that version.
b) Because sysadmins manually, or with tools like DKMS, can switch out
modules on the file system, you can't count on 'modinfo foo.ko', which
looks at /lib/modules/${kernelver}/... actually matching what is loaded
into the kernel already. Hence asking sysfs for this.
c) as the unbind-driver-from-device work takes shape, it will be
possible to rebind a driver that's built-in (no .ko to modinfo for the
version) to a newly loaded module. sysfs will have the
currently-built-in version info, for comparison.
d) tech support scripts can then easily grab the version info for what's
running presently - a question I get often.
There has been renewed interest in this patch on linux-scsi by driver
authors.
As the idea originated from GregKH, I leave his Signed-off-by: intact,
though the implementation is nearly completely new. Compiled and run on
x86 and x86_64.
From: Matthew Dobson <colpatch@us.ibm.com>
build fix
From: Thierry Vignaud <tvignaud@mandriva.com>
build fix
From: Matthew Dobson <colpatch@us.ibm.com>
warning fix
Signed-off-by: Greg Kroah-Hartman <greg@kroah.com>
Signed-off-by: Matt Domsch <Matt_Domsch@dell.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The attached patch makes the following changes:
(1) There's a new special key type called ".request_key_auth".
This is an authorisation key for when one process requests a key and
another process is started to construct it. This type of key cannot be
created by the user; nor can it be requested by kernel services.
Authorisation keys hold two references:
(a) Each refers to a key being constructed. When the key being
constructed is instantiated the authorisation key is revoked,
rendering it of no further use.
(b) The "authorising process". This is either:
(i) the process that called request_key(), or:
(ii) if the process that called request_key() itself had an
authorisation key in its session keyring, then the authorising
process referred to by that authorisation key will also be
referred to by the new authorisation key.
This means that the process that initiated a chain of key requests
will authorise the lot of them, and will, by default, wind up with
the keys obtained from them in its keyrings.
(2) request_key() creates an authorisation key which is then passed to
/sbin/request-key in as part of a new session keyring.
(3) When request_key() is searching for a key to hand back to the caller, if
it comes across an authorisation key in the session keyring of the
calling process, it will also search the keyrings of the process
specified therein and it will use the specified process's credentials
(fsuid, fsgid, groups) to do that rather than the calling process's
credentials.
This allows a process started by /sbin/request-key to find keys belonging
to the authorising process.
(4) A key can be read, even if the process executing KEYCTL_READ doesn't have
direct read or search permission if that key is contained within the
keyrings of a process specified by an authorisation key found within the
calling process's session keyring, and is searchable using the
credentials of the authorising process.
This allows a process started by /sbin/request-key to read keys belonging
to the authorising process.
(5) The magic KEY_SPEC_*_KEYRING key IDs when passed to KEYCTL_INSTANTIATE or
KEYCTL_NEGATE will specify a keyring of the authorising process, rather
than the process doing the instantiation.
(6) One of the process keyrings can be nominated as the default to which
request_key() should attach new keys if not otherwise specified. This is
done with KEYCTL_SET_REQKEY_KEYRING and one of the KEY_REQKEY_DEFL_*
constants. The current setting can also be read using this call.
(7) request_key() is partially interruptible. If it is waiting for another
process to finish constructing a key, it can be interrupted. This permits
a request-key cycle to be broken without recourse to rebooting.
Signed-Off-By: David Howells <dhowells@redhat.com>
Signed-Off-By: Benoit Boissinot <benoit.boissinot@ens-lyon.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The attached patch makes it possible to pass a session keyring through to the
process spawned by call_usermodehelper(). This allows patch 3/3 to pass an
authorisation key through to /sbin/request-key, thus permitting better access
controls when doing just-in-time key creation.
Signed-Off-By: David Howells <dhowells@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
In the upcoming aio_down patch, it is useful to store a private data
pointer in the kiocb's wait_queue. Since we provide our own wake up
function and do not require the task_struct pointer, it makes sense to
convert the task pointer into a generic private pointer.
Signed-off-by: Benjamin LaHaise <benjamin.c.lahaise@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Avoid taking the tasklist_lock in sys_times if the process is single
threaded. In a NUMA system taking the tasklist_lock may cause a bouncing
cacheline if multiple independent processes continually call sys_times to
measure their performance.
Signed-off-by: Christoph Lameter <christoph@lameter.com>
Signed-off-by: Shai Fultheim <shai@scalex86.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This patch fixes recalc_sigpending() to work correctly with tasks which are
being freezed.
The problem is that freeze_processes() sets PF_FREEZE and TIF_SIGPENDING
flags on tasks, but recalc_sigpending() called from e.g.
sys_rt_sigtimedwait or any other kernel place will clear TIF_SIGPENDING due
to no pending signals queued and the tasks won't be freezed until it
recieves a real signal or freezed_processes() fail due to timeout.
Signed-Off-By: Kirill Korotaev <dev@sw.ru>
Signed-Off-By: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Add a new `suid_dumpable' sysctl:
This value can be used to query and set the core dump mode for setuid
or otherwise protected/tainted binaries. The modes are
0 - (default) - traditional behaviour. Any process which has changed
privilege levels or is execute only will not be dumped
1 - (debug) - all processes dump core when possible. The core dump is
owned by the current user and no security is applied. This is intended
for system debugging situations only. Ptrace is unchecked.
2 - (suidsafe) - any binary which normally would not be dumped is dumped
readable by root only. This allows the end user to remove such a dump but
not access it directly. For security reasons core dumps in this mode will
not overwrite one another or other files. This mode is appropriate when
adminstrators are attempting to debug problems in a normal environment.
(akpm:
> > +EXPORT_SYMBOL(suid_dumpable);
>
> EXPORT_SYMBOL_GPL?
No problem to me.
> > if (current->euid == current->uid && current->egid == current->gid)
> > current->mm->dumpable = 1;
>
> Should this be SUID_DUMP_USER?
Actually the feedback I had from last time was that the SUID_ defines
should go because its clearer to follow the numbers. They can go
everywhere (and there are lots of places where dumpable is tested/used
as a bool in untouched code)
> Maybe this should be renamed to `dump_policy' or something. Doing that
> would help us catch any code which isn't using the #defines, too.
Fair comment. The patch was designed to be easy to maintain for Red Hat
rather than for merging. Changing that field would create a gigantic
diff because it is used all over the place.
)
Signed-off-by: Alan Cox <alan@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Presently either multiple kprobes or only one jprobe could be inserted.
This patch removes the above limitation and allows one jprobe and multiple
kprobes to coexist at the same address. However multiple jprobes cannot
coexist with multiple kprobes. Currently I am working on the prototype to
allow multiple jprobes coexist with multiple kprobes.
Signed-off-by: Ananth N Mavinakayanhalli <amavin@redhat.com>
Signed-off-by: Prasanna S Panchamukhi <prasanna@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
In situations where a kprobes handler calls a routine which has a probe on it,
then kprobes_handler() disarms the new probe forever. This patch removes the
above limitation by temporarily disarming the new probe. When the another
probe hits while handling the old probe, the kprobes_handler() saves previous
kprobes state and handles the new probe without calling the new kprobes
registered handlers. kprobe_post_handler() restores back the previous kprobes
state and the normal execution continues.
However on x86_64 architecture, re-rentrancy is provided only through
pre_handler(). If a routine having probe is referenced through
post_handler(), then the probes on that routine are disarmed forever, since
the exception stack is gets changed after the processor single steps the
instruction of the new probe.
This patch includes generic changes to support temporary disarming on
reentrancy of probes.
Signed-of-by: Prasanna S Panchamukhi <prasanna@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This patch moves the lock/unlock of the arch specific kprobe_flush_task()
to the non-arch specific kprobe_flusk_task().
Signed-off-by: Hien Nguyen <hien@us.ibm.com>
Acked-by: Prasanna S Panchamukhi <prasanna@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The architecture independent code of the current kprobes implementation is
arming and disarming kprobes at registration time. The problem is that the
code is assuming that arming and disarming is a just done by a simple write
of some magic value to an address. This is problematic for ia64 where our
instructions look more like structures, and we can not insert break points
by just doing something like:
*p->addr = BREAKPOINT_INSTRUCTION;
The following patch to 2.6.12-rc4-mm2 adds two new architecture dependent
functions:
* void arch_arm_kprobe(struct kprobe *p)
* void arch_disarm_kprobe(struct kprobe *p)
and then adds the new functions for each of the architectures that already
implement kprobes (spar64/ppc64/i386/x86_64).
I thought arch_[dis]arm_kprobe was the most descriptive of what was really
happening, but each of the architectures already had a disarm_kprobe()
function that was really a "disarm and do some other clean-up items as
needed when you stumble across a recursive kprobe." So... I took the
liberty of changing the code that was calling disarm_kprobe() to call
arch_disarm_kprobe(), and then do the cleanup in the block of code dealing
with the recursive kprobe case.
So far this patch as been tested on i386, x86_64, and ppc64, but still
needs to be tested in sparc64.
Signed-off-by: Rusty Lynch <rusty.lynch@intel.com>
Signed-off-by: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This patch adds function-return probes to kprobes for the i386
architecture. This enables you to establish a handler to be run when a
function returns.
1. API
Two new functions are added to kprobes:
int register_kretprobe(struct kretprobe *rp);
void unregister_kretprobe(struct kretprobe *rp);
2. Registration and unregistration
2.1 Register
To register a function-return probe, the user populates the following
fields in a kretprobe object and calls register_kretprobe() with the
kretprobe address as an argument:
kp.addr - the function's address
handler - this function is run after the ret instruction executes, but
before control returns to the return address in the caller.
maxactive - The maximum number of instances of the probed function that
can be active concurrently. For example, if the function is non-
recursive and is called with a spinlock or mutex held, maxactive = 1
should be enough. If the function is non-recursive and can never
relinquish the CPU (e.g., via a semaphore or preemption), NR_CPUS should
be enough. maxactive is used to determine how many kretprobe_instance
objects to allocate for this particular probed function. If maxactive <=
0, it is set to a default value (if CONFIG_PREEMPT maxactive=max(10, 2 *
NR_CPUS) else maxactive=NR_CPUS)
For example:
struct kretprobe rp;
rp.kp.addr = /* entrypoint address */
rp.handler = /*return probe handler */
rp.maxactive = /* e.g., 1 or NR_CPUS or 0, see the above explanation */
register_kretprobe(&rp);
The following field may also be of interest:
nmissed - Initialized to zero when the function-return probe is
registered, and incremented every time the probed function is entered but
there is no kretprobe_instance object available for establishing the
function-return probe (i.e., because maxactive was set too low).
2.2 Unregister
To unregiter a function-return probe, the user calls
unregister_kretprobe() with the same kretprobe object as registered
previously. If a probed function is running when the return probe is
unregistered, the function will return as expected, but the handler won't
be run.
3. Limitations
3.1 This patch supports only the i386 architecture, but patches for
x86_64 and ppc64 are anticipated soon.
3.2 Return probes operates by replacing the return address in the stack
(or in a known register, such as the lr register for ppc). This may
cause __builtin_return_address(0), when invoked from the return-probed
function, to return the address of the return-probes trampoline.
3.3 This implementation uses the "Multiprobes at an address" feature in
2.6.12-rc3-mm3.
3.4 Due to a limitation in multi-probes, you cannot currently establish
a return probe and a jprobe on the same function. A patch to remove
this limitation is being tested.
This feature is required by SystemTap (http://sourceware.org/systemtap),
and reflects ideas contributed by several SystemTap developers, including
Will Cohen and Ananth Mavinakayanahalli.
Signed-off-by: Hien Nguyen <hien@us.ibm.com>
Signed-off-by: Prasanna S Panchamukhi <prasanna@in.ibm.com>
Signed-off-by: Frederik Deweerdt <frederik.deweerdt@laposte.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Prevent recursive faults in do_exit() by leaving the task alone and wait
for reboot. This may allow a more graceful shutdown and possibly save the
original oops.
Signed-off-by: Alexander Nyberg <alexn@telia.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Various filesystem drivers have grown a get_dentry() function that's a
duplicate of lookup_one_len, except that it doesn't take a maximum length
argument and doesn't check for \0 or / in the passed in filename.
Switch all these places to use lookup_one_len.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Cc: Greg KH <greg@kroah.com>
Cc: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
In kernel/sched.c the return value from preempt_count() is cast to an int.
That made sense when preempt_count was defined as different types on is not
needed and should go away. The patch removes the cast.
In kernel/timer.c the return value from preempt_count() is assigned to a
variable of type u32 and then that unsigned value is later compared to
preempt_count(). Since preempt_count() returns an int, an int is what
should be used to store its return value. Storing the result in an
unsigned 32bit integer made a tiny bit of sense back when preempt_count was
different types on different archs, but no more - let's not play signed vs
unsigned comparison games when we don't have to. The patch modifies the
code to use an int to hold the value. While I was around that bit of code
I also made two changes to a nearby (related) printk() - I modified it to
specify the loglevel explicitly and also broke the line into a few pieces
to avoid it being longer than 80 chars and clarified the text a bit.
Signed-off-by: Jesper Juhl <juhl-lkml@dif.dk>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
According to include/linux/console.h, CON_CONSDEV flag should be set on
the last console specified on the boot command line:
86 #define CON_PRINTBUFFER (1)
87 #define CON_CONSDEV (2) /* Last on the command line */
88 #define CON_ENABLED (4)
89 #define CON_BOOT (8)
This does not currently happen if there is more than one console specified
on the boot commandline. Instead, it gets set on the first console on the
command line. This can cause problems for things like kdb that look for
the CON_CONSDEV flag to see if the console is valid.
Additionaly, it doesn't look like CON_CONSDEV is reassigned to the next
preferred console at unregister time if the console being unregistered
currently has that bit set.
Example (from sn2 ia64):
elilo vmlinuz root=<dev> console=ttyS0 console=ttySG0
in this case, the flags on ttySG console struct will be 0x4 (should be
0x6).
Attached patch against bk fixes both issues for the cases I looked at. It
uses selected_console (which gets incremented for each console specified on
the command line) as the indicator of which console to set CON_CONSDEV on.
When adding the console to the list, if the previous one had CON_CONSDEV
set, it masks it out. Tested on ia64 and x86.
The problem with the current behavior is it breaks overriding the default from
the boot line. In the ia64 case, there may be a global append line defining
console=a in elilo.conf. Then you want to boot your kernel, and want to
override the default by passing console=b on the boot line. elilo constructs
the kernel cmdline by starting with the value of the global append line, then
tacks on whatever else you specify, which puts console=b last.
Signed-off-by: Greg Edwards <edwardsg@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
sys_timer_settime/sys_timer_delete needs to delete k_itimer->real.timer
synchronously while holding ->it_lock, which is also locked in
posix_timer_fn.
This patch removes timer_active/set_timer_inactive which plays with
timer_list's internals in favour of using try_to_del_timer_sync(), which
was introduced in the previous patch.
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This patch splits del_timer_sync() into 2 functions. The new one,
try_to_del_timer_sync(), returns -1 when it hits executing timer.
It can be used in interrupt context, or when the caller hold locks which
can prevent completion of the timer's handler.
NOTE. Currently it can't be used in interrupt context in UP case, because
->running_timer is used only with CONFIG_SMP.
Should the need arise, it is possible to kill #ifdef CONFIG_SMP in
set_running_timer(), it is cheap.
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This patch tries to solve following problems:
1. del_timer_sync() is racy. The timer can be fired again after
del_timer_sync have checked all cpus and before it will recheck
timer_pending().
2. It has scalability problems. All cpus are scanned to determine
if the timer is running on that cpu.
With this patch del_timer_sync is O(1) and no slower than plain
del_timer(pending_timer), unless it has to actually wait for
completion of the currently running timer.
The only restriction is that the recurring timer should not use
add_timer_on().
3. The timers are not serialized wrt to itself.
If CPU_0 does mod_timer(jiffies+1) while the timer is currently
running on CPU 1, it is quite possible that local interrupt on
CPU_0 will start that timer before it finished on CPU_1.
4. The timers locking is suboptimal. __mod_timer() takes 3 locks
at once and still requires wmb() in del_timer/run_timers.
The new implementation takes 2 locks sequentially and does not
need memory barriers.
Currently ->base != NULL means that the timer is pending. In that case
->base.lock is used to lock the timer. __mod_timer also takes timer->lock
because ->base can be == NULL.
This patch uses timer->entry.next != NULL as indication that the timer is
pending. So it does __list_del(), entry->next = NULL instead of list_del()
when the timer is deleted.
The ->base field is used for hashed locking only, it is initialized
in init_timer() which sets ->base = per_cpu(tvec_bases). When the
tvec_bases.lock is locked, it means that all timers which are tied
to this base via timer->base are locked, and the base itself is locked
too.
So __run_timers/migrate_timers can safely modify all timers which could
be found on ->tvX lists (pending timers).
When the timer's base is locked, and the timer removed from ->entry list
(which means that _run_timers/migrate_timers can't see this timer), it is
possible to set timer->base = NULL and drop the lock: the timer remains
locked.
This patch adds lock_timer_base() helper, which waits for ->base != NULL,
locks the ->base, and checks it is still the same.
__mod_timer() schedules the timer on the local CPU and changes it's base.
However, it does not lock both old and new bases at once. It locks the
timer via lock_timer_base(), deletes the timer, sets ->base = NULL, and
unlocks old base. Then __mod_timer() locks new_base, sets ->base = new_base,
and adds this timer. This simplifies the code, because AB-BA deadlock is not
possible. __mod_timer() also ensures that the timer's base is not changed
while the timer's handler is running on the old base.
__run_timers(), del_timer() do not change ->base anymore, they only clear
pending flag.
So del_timer_sync() can test timer->base->running_timer == timer to detect
whether it is running or not.
We don't need timer_list->lock anymore, this patch kills it.
We also don't need barriers. del_timer() and __run_timers() used smp_wmb()
before clearing timer's pending flag. It was needed because __mod_timer()
did not lock old_base if the timer is not pending, so __mod_timer()->list_add()
could race with del_timer()->list_del(). With this patch these functions are
serialized through base->lock.
One problem. TIMER_INITIALIZER can't use per_cpu(tvec_bases). So this patch
adds global
struct timer_base_s {
spinlock_t lock;
struct timer_list *running_timer;
} __init_timer_base;
which is used by TIMER_INITIALIZER. The corresponding fields in tvec_t_base_s
struct are replaced by struct timer_base_s t_base.
It is indeed ugly. But this can't have scalability problems. The global
__init_timer_base.lock is used only when __mod_timer() is called for the first
time AND the timer was compile time initialized. After that the timer migrates
to the local CPU.
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Renaud Lienhart <renaud.lienhart@free.fr>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Make the timer frequency selectable. The timer interrupt may cause bus
and memory contention in large NUMA systems since the interrupt occurs
on each processor HZ times per second.
Signed-off-by: Christoph Lameter <christoph@lameter.com>
Signed-off-by: Shai Fultheim <shai@scalex86.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
With Chris Wedgwood <cw@f00f.org>
As suggested by Chris, we can make the "just added" method ->release
conditional to UML only (better: to archs requesting it, i.e. only UML
currently), so that other archs don't get this unneeded crud, and if UML
won't need it any more we can kill this.
Signed-off-by: Paolo 'Blaisorblade' Giarrusso <blaisorblade@yahoo.it>
CC: Ingo Molnar <mingo@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
With Chris Wedgwood <cw@f00f.org>
Currently UML must explicitly call the UML-specific
free_irq_by_irq_and_dev() for each free_irq call it's done.
This is needed because ->shutdown and/or ->disable are only called when the
last "action" for that irq is removed.
Instead, for UML shared IRQs (UML IRQs are very often, if not always,
shared), for each dev_id some setup is done, which must be cleared on the
release of that fd. For instance, for each open console a new instance
(i.e. new dev_id) of the same IRQ is requested().
Exactly, a fd is stored in an array (pollfds), which is after read by a
host thread and passed to poll(). Each event registered by poll() triggers
an interrupt. So, for each free_irq() we must remove the corresponding
host fd from the table, which we do via this -release() method.
In this patch we add an appropriate hook for this, and remove all uses of
it by pointing the hook to the said procedure; this is safe to do since the
said procedure.
Also some cosmetic improvements are included.
This is heavily based on some work by Chris Wedgwood, which however didn't
get the patch merged for something I'd call a "misunderstanding" (the need
for this patch wasn't cleanly explained, thus adding the generic hook was
felt as undesirable).
Signed-off-by: Paolo 'Blaisorblade' Giarrusso <blaisorblade@yahoo.it>
CC: Ingo Molnar <mingo@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Remove part of comment on linking new vma in dup_mmap: since anon_vma rmap
came in, try_to_unmap_one knows the vma without needing find_vma. But add
a comment to note that here vma is inserted without mmap_sem.
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Ingo recently introduced a great speedup for allocating new mmaps using the
free_area_cache pointer which boosts the specweb SSL benchmark by 4-5% and
causes huge performance increases in thread creation.
The downside of this patch is that it does lead to fragmentation in the
mmap-ed areas (visible via /proc/self/maps), such that some applications
that work fine under 2.4 kernels quickly run out of memory on any 2.6
kernel.
The problem is twofold:
1) the free_area_cache is used to continue a search for memory where
the last search ended. Before the change new areas were always
searched from the base address on.
So now new small areas are cluttering holes of all sizes
throughout the whole mmap-able region whereas before small holes
tended to close holes near the base leaving holes far from the base
large and available for larger requests.
2) the free_area_cache also is set to the location of the last
munmap-ed area so in scenarios where we allocate e.g. five regions of
1K each, then free regions 4 2 3 in this order the next request for 1K
will be placed in the position of the old region 3, whereas before we
appended it to the still active region 1, placing it at the location
of the old region 2. Before we had 1 free region of 2K, now we only
get two free regions of 1K -> fragmentation.
The patch addresses thes issues by introducing yet another cache descriptor
cached_hole_size that contains the largest known hole size below the
current free_area_cache. If a new request comes in the size is compared
against the cached_hole_size and if the request can be filled with a hole
below free_area_cache the search is started from the base instead.
The results look promising: Whereas 2.6.12-rc4 fragments quickly and my
(earlier posted) leakme.c test program terminates after 50000+ iterations
with 96 distinct and fragmented maps in /proc/self/maps it performs nicely
(as expected) with thread creation, Ingo's test_str02 with 20000 threads
requires 0.7s system time.
Taking out Ingo's patch (un-patch available per request) by basically
deleting all mentions of free_area_cache from the kernel and starting the
search for new memory always at the respective bases we observe: leakme
terminates successfully with 11 distinctive hardly fragmented areas in
/proc/self/maps but thread creating is gringdingly slow: 30+s(!) system
time for Ingo's test_str02 with 20000 threads.
Now - drumroll ;-) the appended patch works fine with leakme: it ends with
only 7 distinct areas in /proc/self/maps and also thread creation seems
sufficiently fast with 0.71s for 20000 threads.
Signed-off-by: Wolfgang Wander <wwc@rentec.com>
Credit-to: "Richard Purdie" <rpurdie@rpsys.net>
Signed-off-by: Ken Chen <kenneth.w.chen@intel.com>
Acked-by: Ingo Molnar <mingo@elte.hu> (partly)
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This is the core of the (much simplified) early reclaim. The goal of this
patch is to reclaim some easily-freed pages from a zone before falling back
onto another zone.
One of the major uses of this is NUMA machines. With the default allocator
behavior the allocator would look for memory in another zone, which might be
off-node, before trying to reclaim from the current zone.
This adds a zone tuneable to enable early zone reclaim. It is selected on a
per-zone basis and is turned on/off via syscall.
Adding some extra throttling on the reclaim was also required (patch
4/4). Without the machine would grind to a crawl when doing a "make -j"
kernel build. Even with this patch the System Time is higher on
average, but it seems tolerable. Here are some numbers for kernbench
runs on a 2-node, 4cpu, 8Gig RAM Altix in the "make -j" run:
wall user sys %cpu ctx sw. sleeps
---- ---- --- ---- ------ ------
No patch 1009 1384 847 258 298170 504402
w/patch, no reclaim 880 1376 667 288 254064 396745
w/patch & reclaim 1079 1385 926 252 291625 548873
These numbers are the average of 2 runs of 3 "make -j" runs done right
after system boot. Run-to-run variability for "make -j" is huge, so
these numbers aren't terribly useful except to seee that with reclaim
the benchmark still finishes in a reasonable amount of time.
I also looked at the NUMA hit/miss stats for the "make -j" runs and the
reclaim doesn't make any difference when the machine is thrashing away.
Doing a "make -j8" on a single node that is filled with page cache pages
takes 700 seconds with reclaim turned on and 735 seconds without reclaim
(due to remote memory accesses).
The simple zone_reclaim syscall program is at
http://www.bork.org/~mort/sgi/zone_reclaim.c
Signed-off-by: Martin Hicks <mort@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This patch implements a number of smp_processor_id() cleanup ideas that
Arjan van de Ven and I came up with.
The previous __smp_processor_id/_smp_processor_id/smp_processor_id API
spaghetti was hard to follow both on the implementational and on the
usage side.
Some of the complexity arose from picking wrong names, some of the
complexity comes from the fact that not all architectures defined
__smp_processor_id.
In the new code, there are two externally visible symbols:
- smp_processor_id(): debug variant.
- raw_smp_processor_id(): nondebug variant. Replaces all existing
uses of _smp_processor_id() and __smp_processor_id(). Defined
by every SMP architecture in include/asm-*/smp.h.
There is one new internal symbol, dependent on DEBUG_PREEMPT:
- debug_smp_processor_id(): internal debug variant, mapped to
smp_processor_id().
Also, i moved debug_smp_processor_id() from lib/kernel_lock.c into a new
lib/smp_processor_id.c file. All related comments got updated and/or
clarified.
I have build/boot tested the following 8 .config combinations on x86:
{SMP,UP} x {PREEMPT,!PREEMPT} x {DEBUG_PREEMPT,!DEBUG_PREEMPT}
I have also build/boot tested x64 on UP/PREEMPT/DEBUG_PREEMPT. (Other
architectures are untested, but should work just fine.)
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Arjan van de Ven <arjan@infradead.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
sysfs: fix the rest of the kernel so if an attribute doesn't
implement show or store method read/write will return
-EIO instead of 0 or -EINVAL or -EPERM.
Signed-off-by: Dmitry Torokhov <dtor@mail.ru>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
On one path, cond_resched_lock() fails to return true if it dropped the lock.
We think this might be causing the crashes in JBD's log_do_checkpoint().
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
flush_icache_range() is used in two different situation - in binfmt_elf.c &
co for user space mappings and module.c for kernel modules. On m68k
flush_icache_range() doesn't know which data to flush, as it has separate
address spaces and the pointer argument can be valid in either address
space.
First I considered splitting flush_icache_range(), but this patch is
simpler. Setting the correct context gives flush_icache_range() enough
information to flush the correct data.
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The "unhandled interrupts" catcher, note_interrupt(), increments a global
desc->irq_count and grossly damages scaling of very large systems, e.g.,
>192p ia64 Altix, because of this highly contented cacheline, especially
for timer interrupts. 384p is severely crippled, and 512p is unuseable.
All calls to note_interrupt() can be disabled by booting with "noirqdebug",
but this disables the useful interrupt checking for all interrupts.
I propose eliminating note_interrupt() for all per-CPU interrupts. This
was the behavior of linux-2.6.10 and earlier, but in 2.6.11 a code
restructuring added a call to note_interrupt() for per-CPU interrupts.
Besides, note_interrupt() is a bit racy for concurrent CPU calls anyway, as
the desc->irq_count++ increment isn't atomic (which, if done, would make
scaling even worse).
Signed-off-by: John Hawkes <hawkes@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
There is a race in the kernel cpuset code, between the code
to handle notify_on_release, and the code to remove a cpuset.
The notify_on_release code can end up trying to access a
cpuset that has been removed. In the most common case, this
causes a NULL pointer dereference from the routine cpuset_path.
However all manner of bad things are possible, in theory at least.
The existing code decrements the cpuset use count, and if the
count goes to zero, processes the notify_on_release request,
if appropriate. However, once the count goes to zero, unless we
are holding the global cpuset_sem semaphore, there is nothing to
stop another task from immediately removing the cpuset entirely,
and recycling its memory.
The obvious fix would be to always hold the cpuset_sem
semaphore while decrementing the use count and dealing with
notify_on_release. However we don't want to force a global
semaphore into the mainline task exit path, as that might create
a scaling problem.
The actual fix is almost as easy - since this is only an issue
for cpusets using notify_on_release, which the top level big
cpusets don't normally need to use, only take the cpuset_sem
for cpusets using notify_on_release.
This code has been run for hours without a hiccup, while running
a cpuset create/destroy stress test that could crash the existing
kernel in seconds. This patch applies to the current -linus
git kernel.
Signed-off-by: Paul Jackson <pj@sgi.com>
Acked-by: Simon Derr <simon.derr@bull.net>
Acked-by: Dinakar Guniguntala <dino@in.ibm.com>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
While they were all just simple blobs it made sense to just free them
as we walked through and logged them. Now that there are pointers to
other objects which need refcounting, we might as well revert to
_only_ logging them in audit_log_exit(), and put the code to free them
properly in only one place -- in audit_free_aux().
Signed-off-by: David Woodhouse <dwmw2@infradead.org>
----------------------------------------------------------
If SIGKILL does not have priority, we cannot instantly kill task before it
makes some unexpected job. It can be critical, but we were unable to
reproduce this easily until Heiko Carstens <Heiko.Carstens@de.ibm.com>
reported this problem on LKML.
Signed-Off-By: Kirill Korotaev <dev@sw.ru>
Signed-Off-By: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
These changes make processing of audit logs easier. Based on a patch
from Steve Grubb <sgrubb@redhat.com>
Signed-off-by: David Woodhouse <dwmw2@infradead.org>
Move audit_serial() into audit.c and use it to generate serial numbers
on messages even when there is no audit context from syscall auditing.
This allows us to disambiguate audit records when more than one is
generated in the same millisecond.
Based on a patch by Steve Grubb after he observed the problem.
Signed-off-by: David Woodhouse <dwmw2@infradead.org>
In _spin_unlock_bh(lock):
do { \
_raw_spin_unlock(lock); \
preempt_enable(); \
local_bh_enable(); \
__release(lock); \
} while (0)
there is no reason for using preempt_enable() instead of a simple
preempt_enable_no_resched()
Since we know bottom halves are disabled, preempt_schedule() will always
return at once (preempt_count!=0), and hence preempt_check_resched() is
useless here...
This fixes it by using "preempt_enable_no_resched()" instead of the
"preempt_enable()", and thus avoids the useless preempt_check_resched()
just before re-enabling bottom halves.
Signed-off-by: Samuel Thibault <samuel.thibault@ens-lyon.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The attached patch changes all occurrences of loginuid to auid. It also
changes everything to %u that is an unsigned type.
Signed-off-by: Steve Grubb <sgrubb@redhat.com>
Signed-off-by: David Woodhouse <dwmw2@infradead.org>
The original AVC_USER message wasn't consolidated with the new range of
user messages. The attached patch fixes the kernel so the old messages
work again.
Signed-off-by: Steve Grubb <sgrubb@redhat.com>
Signed-off-by: David Woodhouse <dwmw2@infradead.org>
This patch changes the SELinux AVC to defer logging of paths to the audit
framework upon syscall exit, by saving a reference to the (dentry,vfsmount)
pair in an auxiliary audit item on the current audit context for processing
by audit_log_exit.
Signed-off-by: Stephen Smalley <sds@tycho.nsa.gov>
Signed-off-by: David Woodhouse <dwmw2@infradead.org>
This patch removes the entwining of cpusets and hotplug code in the "No
more Mr. Nice Guy" case of sched.c move_task_off_dead_cpu().
Since the hotplug code is holding a spinlock at this point, we cannot take
the cpuset semaphore, cpuset_sem, as would seem to be required either to
update the tasks cpuset, or to scan up the nested cpuset chain, looking for
the nearest cpuset ancestor that still has some CPUs that are online. So
we just punt and blast the tasks cpus_allowed with all bits allowed.
This reverts these lines of code to what they were before the cpuset patch.
And it updates the cpuset Doc file, to match.
The one known alternative to this that seems to work came from Dinakar
Guniguntala, and required the hotplug code to take the cpuset_sem semaphore
much earlier in its processing. So far as we know, the increased locking
entanglement between cpusets and hot plug of this alternative approach is
not worth doing in this case.
Signed-off-by: Paul Jackson <pj@sgi.com>
Acked-by: Nathan Lynch <ntl@pobox.com>
Acked-by: Dinakar Guniguntala <dino@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The limit on the number of outstanding audit messages was inadvertently
removed with the switch to queuing skbs directly for sending by a kernel
thread. Put it back again.
Signed-off-by: David Woodhouse <dwmw2@infradead.org>
netlink_unicast() will attempt to reallocate and will free messages if
the socket's rcvbuf limit is reached unless we give it an infinite
timeout. So do that, from a kernel thread which is dedicated to spewing
stuff up the netlink socket.
Signed-off-by: David Woodhouse <dwmw2@infradead.org>
* If vsnprintf returns -1, it will mess up the sk buffer space accounting.
This is fixed by not calling skb_put with bogus len values.
* audit_log_hex was a loop that called audit_log_vformat with %02X for each
character. This is very inefficient since conversion from unsigned character
to Ascii representation is essentially masking, shifting, and byte lookups.
Also, the length of the converted string is well known - it's twice the
original. Fixed by rewriting the function.
* audit_log_untrustedstring had no comments. This makes it hard for
someone to understand what the string format will be.
* audit_log_d_path was never fixed to use untrustedstring. This could mess
up user space parsers. This was fixed to make a temp buffer, call d_path,
and log temp buffer using untrustedstring.
From: Steve Grubb <sgrubb@redhat.com>
Signed-off-by: David Woodhouse <dwmw2@infradead.org>
It's silly to have to add explicit entries for new userspace messages
as we invent them. Just treat all messages in the user range the same.
Signed-off-by: David Woodhouse <dwmw2@infradead.org>
This patch includes various tweaks in the messaging that appears during
system pm state transitions:
* Warn about certain illegal calls in the device tree, like resuming
child before parent or suspending parent before child. This could
happen easily enough through sysfs, or in some cases when drivers
use device_pm_set_parent().
* Be more consistent about dev_dbg() tracing ... do it for resume() and
shutdown() too, and never if the driver doesn't have that method.
* Say which type of system sleep state is being entered.
Except for the warnings, these only affect debug messaging.
Signed-off-by: David Brownell <dbrownell@users.sourceforge.net>
Acked-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
profile=schedule parsing is not quite what it should be. First, str[7] is
'e', not ',', but then even if it did fall through, prof_on =
SCHED_PROFILING would be clobbered inside if (get_option(...)) So a small
amount of rearrangement is done in this patch to correct it.
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Move add_preferred_console out of CONFIG_PRINTK so serial console does the
right thing.
Signed-off-by: Matt Mackall <mpm@selenic.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
On my IA64 machine, after kernel 2.6.12-rc3 boots, an edge-triggered
interrupt (IRQ 46) keeps triggered over and over again. There is no IRQ 46
interrupt action handler. It has lots of impact on performance.
Kernel 2.6.10 and its prior versions have no the problem. Basically,
kernel 2.6.10 will mask the spurious edge interrupt if the interrupt is
triggered for the second time and its status includes
IRQ_DISABLE|IRQ_PENDING.
Originally, IA64 kernel has its own specific _irq_desc definitions in file
arch/ia64/kernel/irq.c. The definition initiates _irq_desc[irq].status to
IRQ_DISABLE. Since kernel 2.6.11, it was moved to architecture independent
codes, i.e. kernel/irq/handle.c, but kernel/irq/handle.c initiates
_irq_desc[irq].status to 0 instead of IRQ_DISABLE.
Signed-off-by: Zhang Yanmin <yanmin.zhang@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Der... if you use max_t it helps if you give it a type.
Note to self: Always just apply the tested patches, don't try to port
them by hand. You're not clever enough.
Signed-off-by: David Woodhouse <dwmw2@infradead.org>
I'm going through the kernel code and have a patch that corrects
several spelling errors in comments.
From: Steve Grubb <sgrubb@redhat.com>
Signed-off-by: David Woodhouse <dwmw2@infradead.org>
This patch adds more messages types to the audit subsystem so that audit
analysis is quicker, intuitive, and more useful.
Signed-off-by: Steve Grubb <sgrubb@redhat.com>
---
I forgot one type in the big patch. I need to add one for user space
originating SE Linux avc messages. This is used by dbus and nscd.
-Steve
---
Updated to 2.6.12-rc4-mm1.
-dwmw2
Signed-off-by: David Woodhouse <dwmw2@infradead.org>
Otherwise, we will be repeatedly reallocating, even if we're only
adding a few bytes at a time. Pointed out by Steve Grubb.
Signed-off-by: David Woodhouse <dwmw2@infradead.org>
Add audit_log_type to allow callers to specify type and pid when logging.
Convert audit_log to wrapper around audit_log_type. Could have
converted all audit_log callers directly, but common case is default
of type AUDIT_KERNEL and pid 0. Update audit_log_start to take type
and pid values when creating a new audit_buffer. Move sequences that
did audit_log_start, audit_log_format, audit_set_type, audit_log_end,
to simply call audit_log_type directly. This obsoletes audit_set_type
and audit_set_pid, so remove them.
Signed-off-by: Chris Wright <chrisw@osdl.org>
Signed-off-by: David Woodhouse <dwmw2@infradead.org>
Remove code conditionally dependent on CONFIG_AUDITSYSCALL from audit.c.
Move these dependencies to audit.h with the rest.
Signed-off-by: Chris Wright <chrisw@osdl.org>
Signed-off-by: David Woodhouse <dwmw2@infradead.org>
Audit now actually requires netlink. So make it depend on CONFIG_NET,
and remove the inline dependencies on CONFIG_NET.
Signed-off-by: Chris Wright <chrisw@osdl.org>
Signed-off-by: David Woodhouse <dwmw2@infradead.org>