This patch adds a feature to dynamically replace the ftrace code
with the jmps to allow a kernel with ftrace configured to run
as fast as it can without it configured.
The way this works, is on bootup (if ftrace is enabled), a ftrace
function is registered to record the instruction pointer of all
places that call the function.
Later, if there's still any code to patch, a kthread is awoken
(rate limited to at most once a second) that performs a stop_machine,
and replaces all the code that was called with a jmp over the call
to ftrace. It only replaces what was found the previous time. Typically
the system reaches equilibrium quickly after bootup and there's no code
patching needed at all.
e.g.
call ftrace /* 5 bytes */
is replaced with
jmp 3f /* jmp is 2 bytes and we jump 3 forward */
3:
When we want to enable ftrace for function tracing, the IP recording
is removed, and stop_machine is called again to replace all the locations
of that were recorded back to the call of ftrace. When it is disabled,
we replace the code back to the jmp.
Allocation is done by the kthread. If the ftrace recording function is
called, and we don't have any record slots available, then we simply
skip that call. Once a second a new page (if needed) is allocated for
recording new ftrace function calls. A large batch is allocated at
boot up to get most of the calls there.
Because we do this via stop_machine, we don't have to worry about another
CPU executing a ftrace call as we modify it. But we do need to worry
about NMI's so all functions that might be called via nmi must be
annotated with notrace_nmi. When this code is configured in, the NMI code
will not call notrace.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Add preempt off timings. A lot of kernel core code is taken from the RT patch
latency trace that was written by Ingo Molnar.
This adds "preemptoff" and "preemptirqsoff" to /debugfs/tracing/available_tracers
Now instead of just tracing irqs off, preemption off can be selected
to be recorded.
When this is selected, it shares the same files as irqs off timings.
One can either trace preemption off, irqs off, or one or the other off.
By echoing "preemptoff" into /debugfs/tracing/current_tracer, recording
of preempt off only is performed. "irqsoff" will only record the time
irqs are disabled, but "preemptirqsoff" will take the total time irqs
or preemption are disabled. Runtime switching of these options is now
supported by simpling echoing in the appropriate trace name into
/debugfs/tracing/current_tracer.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
This patch adds the latency tracer infrastructure. This patch
does not add anything that will select and turn it on, but will
be used by later patches.
If it were to be compiled, it would add the following files
to the debugfs:
The root tracing directory:
/debugfs/tracing/
This patch also adds the following files:
available_tracers
list of available tracers. Currently no tracers are
available. Looking into this file only shows
"none" which is used to unregister all tracers.
current_tracer
The trace that is currently active. Empty on start up.
To switch to a tracer simply echo one of the tracers that
are listed in available_tracers:
example: (used with later patches)
echo function > /debugfs/tracing/current_tracer
To disable the tracer:
echo disable > /debugfs/tracing/current_tracer
tracing_enabled
echoing "1" into this file starts the ftrace function tracing
(if sysctl kernel.ftrace_enabled=1)
echoing "0" turns it off.
latency_trace
This file is readonly and holds the result of the trace.
trace
This file outputs a easier to read version of the trace.
iter_ctrl
Controls the way the output of traces look.
So far there's two controls:
echoing in "symonly" will only show the kallsyms variables
without the addresses (if kallsyms was configured)
echoing in "verbose" will change the output to show
a lot more data, but not very easy to understand by
humans.
echoing in "nosymonly" turns off symonly.
echoing in "noverbose" turns off verbose.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
If CONFIG_FTRACE is selected and /proc/sys/kernel/ftrace_enabled is
set to a non-zero value the ftrace routine will be called everytime
we enter a kernel function that is not marked with the "notrace"
attribute.
The ftrace routine will then call a registered function if a function
happens to be registered.
[ This code has been highly hacked by Steven Rostedt and Ingo Molnar,
so don't blame Arnaldo for all of this ;-) ]
Update:
It is now possible to register more than one ftrace function.
If only one ftrace function is registered, that will be the
function that ftrace calls directly. If more than one function
is registered, then ftrace will call a function that will loop
through the functions to call.
Signed-off-by: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
The tracer wants to be able to convert the state number
into a user visible character. This patch pulls that conversion
string out the scheduler into the header. This way if it were to
ever change, other parts of the kernel will know.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
add 3 lightweight callbacks to the tracer backend.
zero impact if tracing is turned off.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
On kvm I have seen some rare hangs in stop_machine when I used more guest
cpus than hosts cpus. e.g. 32 guest cpus on 1 host cpu triggered the
hang quite often. I could also reproduce the problem on a 4 way z/VM host with
a 64 way guest.
It turned out that the guest was consuming all available cpus mostly for
spinning on scheduler locks like rq->lock. This is expected as the threads are
calling yield all the time.
The problem is now, that the host scheduling decisings together with the guest
scheduling decisions and spinlocks not being fair managed to create an
interesting scenario similar to a live lock. (Sometimes the hang resolved
itself after some minutes)
Changing stop_machine to yield the cpu to the hypervisor when yielding inside
the guest fixed the problem for me. While I am not completely happy with this
patch, I think it causes no harm and it really improves the situation for me.
I used cpu_relax for yielding to the hypervisor, does that work on all
architectures?
p.s.: If you want to reproduce the problem, cpu hotplug and kprobes use
stop_machine_run and both triggered the problem after some retries.
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
CC: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
All uses of list_for_each_rcu() can be profitably replaced by the
easier-to-use list_for_each_entry_rcu(). This patch makes this change
for the Audit system, in preparation for removing the list_for_each_rcu()
API entirely. This time with well-formed SOB.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Addresses http://bugzilla.kernel.org/show_bug.cgi?id=10663
Reporter: Daniel Marjamki <danielm77@spray.se>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Limit sysctl_nr_open - we don't want ->max_fds to exceed MAX_INT and
we don't want size calculation for ->fd[] to overflow.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Add a common hex array in hexdump.c so everyone can use it.
Add a common hi/lo helper to avoid the shifting masking that is
done to get the upper and lower nibbles of a byte value.
Pull the pack_hex_byte helper from kgdb as it is opencoded many
places in the tree that will be consolidated.
Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com>
Acked-by: Paul Mundt <lethal@linux-sh.org>
Cc: Jason Wessel <jason.wessel@windriver.com>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Return type of cpu_rt_runtime_write() should be int instead of ssize_t.
Signed-off-by: Mirco Tischler <mt-ml@gmx.de>
Acked-by: Paul Menage <menage@google.com>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
It acts exactly like a regular 'cond_resched()', but will not get
optimized away when CONFIG_PREEMPT is set.
Normal kernel code is already preemptable in the presense of
CONFIG_PREEMPT, so cond_resched() is optimized away (see commit
02b67cc3ba "sched: do not do
cond_resched() when CONFIG_PREEMPT").
But when wanting to conditionally reschedule while holding a lock, you
need to use "cond_sched_lock(lock)", and the new function is the BKL
equivalent of that.
Also make fs/locks.c use it.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The generic semaphore rewrite had a huge performance regression on AIM7
(and potentially other BKL-heavy benchmarks) because the generic
semaphores had been rewritten to be simple to understand and fair. The
latter, in particular, turns a semaphore-based BKL implementation into a
mess of scheduling.
The attempt to fix the performance regression failed miserably (see the
previous commit 00b41ec261 'Revert
"semaphore: fix"'), and so for now the simple and sane approach is to
instead just go back to the old spinlock-based BKL implementation that
never had any issues like this.
This patch also has the advantage of being reported to fix the
regression completely according to Yanmin Zhang, unlike the semaphore
hack which still left a couple percentage point regression.
As a spinlock, the BKL obviously has the potential to be a latency
issue, but it's not really any different from any other spinlock in that
respect. We do want to get rid of the BKL asap, but that has been the
plan for several years.
These days, the biggest users are in the tty layer (open/release in
particular) and Alan holds out some hope:
"tty release is probably a few months away from getting cured - I'm
afraid it will almost certainly be the very last user of the BKL in
tty to get fixed as it depends on everything else being sanely locked."
so while we're not there yet, we do have a plan of action.
Tested-by: Yanmin Zhang <yanmin_zhang@linux.intel.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Matthew Wilcox <matthew@wil.cx>
Cc: Alexander Viro <viro@ftp.linux.org.uk>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This reverts commit bf726eab37, as it has
been reported to cause a regression with processes stuck in __down(),
apparently because some missing wakeup.
Quoth Sven Wegener:
"I'm currently investigating a regression that has showed up with my
last git pull yesterday. Bisecting the commits showed bf726e
"semaphore: fix" to be the culprit, reverting it fixed the issue.
Symptoms: During heavy filesystem usage (e.g. a kernel compile) I get
several compiler processes in uninterruptible sleep, blocking all i/o
on the filesystem. System is an Intel Core 2 Quad running a 64bit
kernel and userspace. Filesystem is xfs on top of lvm. See below for
the output of sysrq-w."
See
http://lkml.org/lkml/2008/5/10/45
for full report.
In the meantime, we can just fix the BKL performance regression by
reverting back to the good old BKL spinlock implementation instead,
since any sleeping lock will generally perform badly, especially if it
tries to be fair.
Reported-by: Sven Wegener <sven.wegener@stealer.net>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Linus found a logic bug: we ignore the version number in a module's
vermagic string if we have CONFIG_MODVERSIONS set, but modversions
also lets through a module with no __versions section for modprobe
--force (with tainting, but still).
We should only ignore the start of the vermagic string if the module
actually *has* crcs to check. Rather than (say) having an
entertaining hissy fit and creating a config option to work around the
buggy code.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We allow missing __versions sections, because modprobe --force strips
it. It makes less sense to allow sections where there's no version
for a specific symbol the module uses, so disallow that.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* 'for-linus' of git://git.kernel.dk/linux-2.6-block:
Revert "relay: fix splice problem"
docbook: fix bio missing parameter
block: use unitialized_var() in bio_alloc_bioset()
block: avoid duplicate calls to get_part() in disk stat code
cfq-iosched: make io priorities inherit CPU scheduling class as well as nice
block: optimize generic_unplug_device()
block: get rid of likely/unlikely predictions in merge logic
vfs: splice remove_suid() cleanup
cfq-iosched: fix RCU race in the cfq io_context destructor handling
block: adjust tagging function queue bit locking
block: sysfs store function needs to grab queue_lock and use queue_flag_*()
Due to a merge conflict, the sched_relax_domain_level control file was marked
as being handled by cpuset_read/write_u64, but the code to handle it was
actually in cpuset_common_file_read/write.
Since the value being written/read is in fact a signed integer, it should be
treated as such; this patch adds cpuset_read/write_s64 functions, and uses
them to handle the sched_relax_domain_level file.
With this patch, the sched_relax_domain_level can be read and written, and the
correct contents seen/updated.
Signed-off-by: Paul Menage <menage@google.com>
Cc: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Ingo Molnar <mingo@elte.hu>
Reviewed-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The conversion between virtual and real time is as follows:
dvt = rw/w * dt <=> dt = w/rw * dvt
Since we want the fair sleeper granularity to be in real time, we actually
need to do:
dvt = - rw/w * l
This bug could be related to the regression reported by Yanmin Zhang:
| Comparing with kernel 2.6.25, sysbench+mysql(oltp, readonly) has lots
| of regressions with 2.6.26-rc1:
|
| 1) 8-core stoakley: 28%;
| 2) 16-core tigerton: 20%;
| 3) Itanium Montvale: 50%.
Reported-by: "Zhang, Yanmin" <yanmin_zhang@linux.intel.com>
Signed-off-by: Mike Galbraith <efault@gmx.de>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Yanmin Zhang reported:
| Comparing with kernel 2.6.25, AIM7 (use tmpfs) has more th
| regression under 2.6.26-rc1 on my 8-core stoakley, 16-core tigerton,
| and Itanium Montecito. Bisect located the patch below:
|
| 64ac24e738 is first bad commit
| commit 64ac24e738
| Author: Matthew Wilcox <matthew@wil.cx>
| Date: Fri Mar 7 21:55:58 2008 -0500
|
| Generic semaphore implementation
|
| After I manually reverted the patch against 2.6.26-rc1 while fixing
| lots of conflicts/errors, aim7 regression became less than 2%.
i reproduced the AIM7 workload and can confirm Yanmin's findings that
-.26-rc1 regresses over .25 - by over 67% here.
Looking at the workload i found and fixed what i believe to be the real
bug causing the AIM7 regression: it was inefficient wakeup / scheduling
/ locking behavior of the new generic semaphore code, causing suboptimal
performance.
The problem comes from the following code. The new semaphore code does
this on down():
spin_lock_irqsave(&sem->lock, flags);
if (likely(sem->count > 0))
sem->count--;
else
__down(sem);
spin_unlock_irqrestore(&sem->lock, flags);
and this on up():
spin_lock_irqsave(&sem->lock, flags);
if (likely(list_empty(&sem->wait_list)))
sem->count++;
else
__up(sem);
spin_unlock_irqrestore(&sem->lock, flags);
where __up() does:
list_del(&waiter->list);
waiter->up = 1;
wake_up_process(waiter->task);
and where __down() does this in essence:
list_add_tail(&waiter.list, &sem->wait_list);
waiter.task = task;
waiter.up = 0;
for (;;) {
[...]
spin_unlock_irq(&sem->lock);
timeout = schedule_timeout(timeout);
spin_lock_irq(&sem->lock);
if (waiter.up)
return 0;
}
the fastpath looks good and obvious, but note the following property of
the contended path: if there's a task on the ->wait_list, the up() of
the current owner will "pass over" ownership to that waiting task, in a
wake-one manner, via the waiter->up flag and by removing the waiter from
the wait list.
That is all and fine in principle, but as implemented in
kernel/semaphore.c it also creates a nasty, hidden source of contention!
The contention comes from the following property of the new semaphore
code: the new owner owns the semaphore exclusively, even if it is not
running yet.
So if the old owner, even if just a few instructions later, does a
down() [lock_kernel()] again, it will be blocked and will have to wait
on the new owner to eventually be scheduled (possibly on another CPU)!
Or if another task gets to lock_kernel() sooner than the "new owner"
scheduled, it will be blocked unnecessarily and for a very long time
when there are 2000 tasks running.
I.e. the implementation of the new semaphores code does wake-one and
lock ownership in a very restrictive way - it does not allow
opportunistic re-locking of the lock at all and keeps the scheduler from
picking task order intelligently.
This kind of scheduling, with 2000 AIM7 processes running, creates awful
cross-scheduling between those 2000 tasks, causes reduced parallelism, a
throttled runqueue length and a lot of idle time. With increasing number
of CPUs it causes an exponentially worse behavior in AIM7, as the chance
for a newly woken new-owner task to actually run anytime soon is less
and less likely.
Note that it takes just a tiny bit of contention for the 'new-semaphore
catastrophy' to happen: the wakeup latencies get added to whatever small
contention there is, and quickly snowball out of control!
I believe Yanmin's findings and numbers support this analysis too.
The best fix for this problem is to use the same scheduling logic that
the kernel/mutex.c code uses: keep the wake-one behavior (that is OK and
wanted because we do not want to over-schedule), but also allow
opportunistic locking of the lock even if a wakee is already "in
flight".
The patch below implements this new logic. With this patch applied the
AIM7 regression is largely fixed on my quad testbox:
# v2.6.25 vanilla:
..................
Tasks Jobs/Min JTI Real CPU Jobs/sec/task
2000 56096.4 91 207.5 789.7 0.4675
2000 55894.4 94 208.2 792.7 0.4658
# v2.6.26-rc1-166-gc0a1811 vanilla:
...................................
Tasks Jobs/Min JTI Real CPU Jobs/sec/task
2000 33230.6 83 350.3 784.5 0.2769
2000 31778.1 86 366.3 783.6 0.2648
# v2.6.26-rc1-166-gc0a1811 + semaphore-speedup:
...............................................
Tasks Jobs/Min JTI Real CPU Jobs/sec/task
2000 55707.1 92 209.0 795.6 0.4642
2000 55704.4 96 209.0 796.0 0.4642
i.e. a 67% speedup. We are now back to within 1% of the v2.6.25
performance levels and have zero idle time during the test, as expected.
Btw., interactivity also improved dramatically with the fix - for
example console-switching became almost instantaneous during this
workload (which after all is running 2000 tasks at once!), without the
patch it was stuck for a minute at times.
There's another nice side-effect of this speedup patch, the new generic
semaphore code got even smaller:
text data bss dec hex filename
1241 0 0 1241 4d9 semaphore.o.before
1207 0 0 1207 4b7 semaphore.o.after
(because the waiter.up complication got removed.)
Longer-term we should look into using the mutex code for the generic
semaphore code as well - but i's not easy due to legacies and it's
outside of the scope of v2.6.26 and outside the scope of this patch as
well.
Bisected-by: "Zhang, Yanmin" <yanmin_zhang@linux.intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
this replaces the rq->clock stuff (and possibly cpu_clock()).
- architectures that have an 'imperfect' hardware clock can set
CONFIG_HAVE_UNSTABLE_SCHED_CLOCK
- the 'jiffie' window might be superfulous when we update tick_gtod
before the __update_sched_clock() call in sched_clock_tick()
- cpu_clock() might be implemented as:
sched_clock_cpu(smp_processor_id())
if the accuracy proves good enough - how far can TSC drift in a
single jiffie when considering the filtering and idle hooks?
[ mingo@elte.hu: various fixes and cleanups ]
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
David Miller pointed it out that nothing in cpu_clock() sets
prev_cpu_time. This caused __sync_cpu_clock() to be called
all the time - against the intention of this code.
The result was that in practice we hit a global spinlock every
time cpu_clock() is called - which - even though cpu_clock()
is used for tracing and debugging, is suboptimal.
While at it, also:
- move the irq disabling to the outest layer,
this should make cpu_clock() warp-free when called with irqs
enabled.
- use long long instead of cycles_t - for platforms where cycles_t
is 32-bit.
Reported-by: David Miller <davem@davemloft.net>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
When I echoed 0 into the "cpu.shares" file, a Div0 error occured.
We found it is caused by the following calling.
sched_group_set_shares(tg, shares)
set_se_shares(tg->se[i], shares/nr_cpu_ids)
__set_se_shares(se, shares)
div64_64((1ULL<<32), shares)
When the echoed value was less than the number of processores, the result of the
sentence "shares/nr_cpu_ids" was 0, and then the system called div64() to divide
the result, the Div0 error occured.
It is unnecessary that the shares value is divided by nr_cpu_ids, I think.
Because in the function __update_group_shares_cpu() and init_tg_cfs_entry(),
the shares value isn't divided by nr_cpu_ids when setting shares of the sched
entity.
This patch fixes this bug. And echoing ULONG_MAX value into cpu.shares also
causes Div0 error, so we set a macro MAX_SHARES to limit the max value of
shares.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Concurrent calls to detach_destroy_domains and arch_init_sched_domains
were prevented by the old scheduler subsystem cpu hotplug mutex. When
this got converted to get_online_cpus() the locking got broken.
Unlike before now several processes can concurrently enter the critical
sections that were protected by the old lock.
So use the already present doms_cur_mutex to protect these sections again.
Cc: Gautham R Shenoy <ego@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Revert debugging commit 7ba2e74ab5.
print_cfs_rq_tasks() can induce live-lock if a task is dequeued
during list traversal.
Signed-off-by: Mike Galbraith <efault@gmx.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
http://bugzilla.kernel.org/show_bug.cgi?id=10545
sched_stats.h says that __sched_info_switch is "called when prev !=
next" in the comment. sched.c should therefore do that.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Gautham R Shenoy reported:
> While running the usual CPU-Hotplug stress tests on linux-2.6.25,
> I noticed the following in the console logs.
>
> This is a wee bit difficult to reproduce. In the past 10 runs I hit this
> only once.
>
> ------------[ cut here ]------------
>
> WARNING: at kernel/sched.c:962 hrtick+0x2e/0x65()
>
> Just wondering if we are doing a good job at handling the cancellation
> of any per-cpu scheduler timers during CPU-Hotplug.
This looks like its indeed not cancelled at all and migrates the it to
another cpu. Fix it via a proper hotplug notifier mechanism.
Reported-by: Gautham R Shenoy <ego@in.ibm.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: stable@kernel.org
Signed-off-by: Ingo Molnar <mingo@elte.hu>
We currently use an optimization to skip the overhead of wake-idle
processing if more than one task is assigned to a run-queue. The
assumption is that the system must already be load-balanced or we
wouldnt be overloaded to begin with.
The problem is that we are looking at rq->nr_running, which may include
RT tasks in addition to CFS tasks. Since the presence of RT tasks
really has no bearing on the balance status of CFS tasks, this throws
the calculation off.
This patch changes the logic to only consider the number of CFS tasks
when making the decision to optimze the wake-idle.
Signed-off-by: Gregory Haskins <ghaskins@novell.com>
CC: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Dmitry Adamushko pointed out a logic error in task_wake_up_rt() where we
will always evaluate to "true". You can find the thread here:
http://lkml.org/lkml/2008/4/22/296
In reality, we only want to try to push tasks away when a wake up request is
not going to preempt the current task. So lets fix it.
Note: We introduce test_tsk_need_resched() instead of open-coding the flag
check so that the merge-conflict with -rt should help remind us that we
may need to support NEEDS_RESCHED_DELAYED in the future, too.
Signed-off-by: Gregory Haskins <ghaskins@novell.com>
CC: Dmitry Adamushko <dmitry.adamushko@gmail.com>
CC: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Noticed by sparse:
kernel/sched.c:760:20: warning: symbol 'sched_feat_names' was not declared. Should it be static?
kernel/sched.c:767:5: warning: symbol 'sched_feat_open' was not declared. Should it be static?
kernel/sched_fair.c:845:3: warning: returning void-valued expression
kernel/sched.c:4386:3: warning: returning void-valued expression
Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The C files are included directly in sched.c, so they are
effectively static.
Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Joel noticed that the !lw->inv_weight contition isn't unlikely anymore so
remove the unlikely annotation. Also, remove the two div64_u64() inv_weight
calculations, which makes them rely on the calc_delta_mine() path as well.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
CC: Joel Schopp <jschopp@austin.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Normalized sleeper uses calc_delta*() which requires that the rq load is
already updated, so move account_entity_enqueue() before place_entity()
Tested-by: Frans Pop <elendil@planet.nl>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jwessel/linux-2.6-kgdb:
kgdb: kconfig fix xconfig/menuconfig element
kgdb: fix signedness mixmatches, add statics, add declaration to header
kgdb: 1000 loops for the single step test in kgdbts
kgdb: trivial sparse fixes in kgdb test-suite
kgdb: minor documentation fixes
Since FUTEX_FD was scheduled for removal in June 2007 lets remove it.
Google Code search found no users for it and NGPT was abandoned in 2003
according to IBM. futex.h is left untouched to make sure the id does
not get reassigned. Since queue_me() has no users left it is commented
out to avoid a warning, i didnt remove it completely since it is part of
the internal api (matching unqueue_me())
Signed-off-by: Eric Sesterhenn <snakebyte@gmx.de>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> (removed rest)
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Noticed by sparse:
arch/x86/kernel/kgdb.c:556:15: warning: symbol 'kgdb_arch_pc' was not declared. Should it be static?
kernel/kgdb.c:149:8: warning: symbol 'kgdb_do_roundup' was not declared. Should it be static?
kernel/kgdb.c:193:22: warning: symbol 'kgdb_arch_pc' was not declared. Should it be static?
kernel/kgdb.c:712:5: warning: symbol 'remove_all_break' was not declared. Should it be static?
Related to kgdb_hex2long:
arch/x86/kernel/kgdb.c:371:28: warning: incorrect type in argument 2 (different signedness)
arch/x86/kernel/kgdb.c:371:28: expected long *long_val
arch/x86/kernel/kgdb.c:371:28: got unsigned long *<noident>
kernel/kgdb.c:469:27: warning: incorrect type in argument 2 (different signedness)
kernel/kgdb.c:469:27: expected long *long_val
kernel/kgdb.c:469:27: got unsigned long *<noident>
kernel/kgdb.c:470:27: warning: incorrect type in argument 2 (different signedness)
kernel/kgdb.c:470:27: expected long *long_val
kernel/kgdb.c:470:27: got unsigned long *<noident>
kernel/kgdb.c:894:27: warning: incorrect type in argument 2 (different signedness)
kernel/kgdb.c:894:27: expected long *long_val
kernel/kgdb.c:894:27: got unsigned long *<noident>
kernel/kgdb.c:895:27: warning: incorrect type in argument 2 (different signedness)
kernel/kgdb.c:895:27: expected long *long_val
kernel/kgdb.c:895:27: got unsigned long *<noident>
kernel/kgdb.c:1127:28: warning: incorrect type in argument 2 (different signedness)
kernel/kgdb.c:1127:28: expected long *long_val
kernel/kgdb.c:1127:28: got unsigned long *<noident>
kernel/kgdb.c:1132:25: warning: incorrect type in argument 2 (different signedness)
kernel/kgdb.c:1132:25: expected long *long_val
kernel/kgdb.c:1132:25: got unsigned long *<noident>
Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com>
Signed-off-by: Jason Wessel <jason.wessel@windriver.com>
The kernel module loader used to be much too happy to allow loading of
modules for the wrong kernel version by default. For example, if you
had MODVERSIONS enabled, but tried to load a module with no version
info, it would happily load it and taint the kernel - whether it was
likely to actually work or not!
Generally, such forced module loading should be considered a really
really bad idea, so make it conditional on a new config option
(MODULE_FORCE_LOAD), and make it default to off.
If somebody really wants to force module loads, that's their problem,
but we should not encourage it. Especially as it happened to me by
mistake (ie regular unversioned Fedora modules getting loaded) causing
lots of strange behavior.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
There is no harm, when users can read the info and we ask often enough
during debugging for this kind of information.
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: John Stultz <johnstul@us.ibm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
File permissions for
/sys/devices/system/clocksource/clocksource0/available_clocksource
are 600 which allows write access. But this is in fact a read only
file. So change permissions to 400.
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: John Stultz <johnstul@us.ibm.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
The helper function hrtimer_callback_running() is used in
kernel/hrtimer.c as well as in the updated net/can/bcm.c which now
supports hrtimers. Moving the helper function to hrtimer.h removes the
duplicate definition in the C-files.
Signed-off-by: Oliver Hartkopp <oliver@hartkopp.net>
Cc: David Miller <davem@davemloft.net>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Force constants in kernel/timeconst.h (except shift counts) to be 64 bits,
using U64_C() constructor macros, and eliminate constants that cannot
be represented at all in 64 bits. This avoids warnings with some gcc
versions.
Drop generating 64-bit constants, since we have no real hope of
getting a full set (operation on 64-bit values requires a 128-bit
intermediate result, which gcc only supports on 64-bit platforms, and
only with libgcc support on some.) Note that the use of these
constants does not depend on if we are on a 32- or 64-bit architecture.
This resolves Bugzilla 10153.
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Uwe Kleine-Koenig has some strange hardware where one of the shared
interrupts can be asserted during boot before the appropriate driver
loads. Requesting the shared irq line from another driver result in a
spurious interrupt storm which finally disables the interrupt line.
I have seen similar behaviour on resume before (the hardware does not
work anymore so I can not verify).
Change the spurious disable logic to increment the disable depth and
mark the interrupt with an extra flag which allows us to reenable the
interrupt when a new driver arrives which requests the same irq
line. In the worst case this will disable the irq again via the
spurious trap, but there is a decent chance that the new driver is the
one which can handle the already asserted interrupt and makes the box
usable again.
Eric Biederman said further: This case also happens on a regular basis
in kdump kernels where we deliberately don't shutdown the hardware
before starting the new kernel. This patch should reduce the need for
using irqpoll in that situation by a small amount.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-and-Acked-by: Uwe Kleine-König <Uwe.Kleine-Koenig@digi.com>
With s390 the last arch switched to the generic sys_ptrace yesterday so
we can now kill the ifdef around it to enforce every new port it using
it instead of introducing new weirdo versions.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
timer_stats_timer_set_start_info is invoked twice, additionally, the
invocation of this function can be moved to where it is only called when a
delay is really required.
Signed-off-by: Andrew Liu <shengping.liu@windriver.com>
Cc: Pavel Machek <pavel@ucw.cz>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The extended crashkernel syntax is a little confusing in the way it handles
ranges. eg:
crashkernel=512M-2G:64M,2G-:128M
Means if the machine has between 512M and 2G of memory the crash region should
be 64M, and if the machine has 2G of memory the region should be 64M. Only if
the machine has more than 2G memory will 128M be allocated.
Although that semantic is correct, it is somewhat baffling. Instead I propose
that the end of the range means the first address past the end of the range,
ie: 512M up to but not including 2G.
[bwalle@suse.de: clarify inclusive/exclusive in crashkernel commandline in documentation]
Signed-off-by: Michael Ellerman <michael@ellerman.id.au>
Acked-by: Bernhard Walle <bwalle@suse.de>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Simon Horman <horms@verge.net.au>
Signed-off-by: Bernhard Walle <bwalle@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Remove the leap second handling from second_overflow(), which doesn't have to
check for it every second anymore. With CONFIG_NO_HZ this also makes sure the
leap second is handled close to the full second. Additionally this makes it
possible to abort a leap second properly by resetting the STA_INS/STA_DEL
status bits.
Signed-off-by: Roman Zippel <zippel@linux-m68k.org>
Cc: john stultz <johnstul@us.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
current_tick_length used to do a little more, but now it just returns
tick_length, which we can also access directly at the few places, where it's
needed.
Signed-off-by: Roman Zippel <zippel@linux-m68k.org>
Cc: john stultz <johnstul@us.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
As TICK_LENGTH_SHIFT is used for more than just the tick length, the name
isn't quite approriate anymore, so this renames it to NTP_SCALE_SHIFT.
Signed-off-by: Roman Zippel <zippel@linux-m68k.org>
Cc: john stultz <johnstul@us.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This adds support for setting the TAI value (International Atomic Time). The
value is reported back to userspace via timex (as we don't have a
ntp_gettime() syscall).
Signed-off-by: Roman Zippel <zippel@linux-m68k.org>
Cc: john stultz <johnstul@us.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
time_offset is already a 64bit value but its resolution barely used, so this
makes better use of it by replacing SHIFT_UPDATE with TICK_LENGTH_SHIFT.
Side note: the SHIFT_HZ in SHIFT_UPDATE was incorrect for CONFIG_NO_HZ and the
primary reason for changing time_offset to 64bit to avoid the overflow.
Signed-off-by: Roman Zippel <zippel@linux-m68k.org>
Cc: john stultz <johnstul@us.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This changes time_freq to a 64bit value and makes it static (the only outside
user had no real need to modify it). Intermediate values were already 64bit,
so the change isn't that big, but it saves a little in shifts by replacing
SHIFT_NSEC with TICK_LENGTH_SHIFT. PPM_SCALE is then used to convert between
user space and kernel space representation.
Signed-off-by: Roman Zippel <zippel@linux-m68k.org>
Cc: john stultz <johnstul@us.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This adds a few more things from the ntp nanokernel related to user space.
It's now possible to select the resolution used of some values via STA_NANO
and the kernel reports in which mode it works (pll/fll).
If some values for adjtimex() are outside the acceptable range, they are now
simply normalized instead of letting the syscall fail. I removed
MOD_CLKA/MOD_CLKB as the mapping didn't really makes any sense, the kernel
doesn't support setting the clock.
Signed-off-by: Roman Zippel <zippel@linux-m68k.org>
Cc: john stultz <johnstul@us.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This is mostly a style cleanup of ntp.c and extracts part of do_adjtimex as
ntp_update_offset(). Otherwise the functionality is still the same as before.
Signed-off-by: Roman Zippel <zippel@linux-m68k.org>
Cc: john stultz <johnstul@us.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
x86 is the only arch right now, which provides an optimized for
div_long_long_rem and it has the downside that one has to be very careful that
the divide doesn't overflow.
The API is a little akward, as the arguments for the unsigned divide are
signed. The signed version also doesn't handle a negative divisor and
produces worse code on 64bit archs.
There is little incentive to keep this API alive, so this converts the few
users to the new API.
Signed-off-by: Roman Zippel <zippel@linux-m68k.org>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: john stultz <johnstul@us.ibm.com>
Cc: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Rename div64_64 to div64_u64 to make it consistent with the other divide
functions, so it clearly includes the type of the divide. Move its definition
to math64.h as currently no architecture overrides the generic implementation.
They can still override it of course, but the duplicated declarations are
avoided.
Signed-off-by: Roman Zippel <zippel@linux-m68k.org>
Cc: Avi Kivity <avi@qumranet.com>
Cc: Russell King <rmk@arm.linux.org.uk>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: David Howells <dhowells@redhat.com>
Cc: Jeff Dike <jdike@addtoit.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Patrick McHardy <kaber@trash.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This converts a few users of do_div to div_[su]64 and this demonstrates nicely
how it can reduce some expressions to one-liners.
Signed-off-by: Roman Zippel <zippel@linux-m68k.org>
Cc: john stultz <johnstul@us.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
currently cpu hotplug (unplug) seems broken on s390 and likely others. On cpu
unplug the system starts to behave very strange and hangs.
I bisected the problem to the following commit:
commit 48f20a9a94
Author: Olof Johansson <olof@lixom.net>
Date: Tue Mar 4 15:23:25 2008 -0800
tasklets: execute tasklets in the same order they were queued
Reverting this patch seems to fix the problem. I looked into takeover_tasklet
and it seems that there is a way to corrupt the tail pointer of the current
cpu. If the tasklet list of the frozen cpu is empty, the tail pointer of the
current cpu points to the address of the head pointer of the stopped cpu and
not to the next pointer of a tasklet_struct.
This patch avoids the list splice of the list is empty and cpu hotplug seems
to work as the tail pointer is not corrupted. Olof, can you look into that
patch and ACK/NACK it so Andrew can push this to Linus, if appropriate?
Please note that some lines are longer than 80 chars, but line-wrapping looked
worse that this version.
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Acked-by: Olof Johansson <olof@lixom.net>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Provide module unload callback. Required by the gcov profiling
infrastructure to keep track of profiling data structures.
Signed-off-by: Peter Oberparleiter <peter.oberparleiter@de.ibm.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Make verify_export_symbols check the modules unused, unused_gpl and
gpl_future syms.
Inspired by Jan Beulich's fix, but table-driven.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Obvious typo, but I don't know of any modules with unused GPL exports,
and then it would take someone noticing that the version shouldn't
have matched in a dependent module.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
__find_symbol() has grown over time: there are now 5 different arrays
of symbols it traverses. It also shouldn't print out a warning on
some calls (ie. verify_symbol which simply checks for name clashes,
and __symbol_put which checks for bugs).
1) Rename to find_symbol: no need for underscores.
2) Use bool and add "warn" parameter to suppress warnings.
3) Make table-driven rather than open coded.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
* 'release' of git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux-acpi-2.6: (179 commits)
ACPI: Fix acpi_processor_idle and idle= boot parameters interaction
acpi: fix section mismatch warning in pnpacpi
intel_menlo: fix build warning
ACPI: Cleanup: Remove unneeded, multiple local dummy variables
ACPI: video - fix permissions on some proc entries
ACPI: video - properly handle errors when registering proc elements
ACPI: video - do not store invalid entries in attached_array list
ACPI: re-name acpi_pm_ops to acpi_suspend_ops
ACER_WMI/ASUS_LAPTOP: fix build bug
thinkpad_acpi: fix possible NULL pointer dereference if kstrdup failed
ACPI: check a return value correctly in acpi_power_get_context()
#if 0 acpi/bay.c:eject_removable_drive()
eeepc-laptop: add hwmon fan control
eeepc-laptop: add backlight
eeepc-laptop: add base driver
ACPI: thinkpad-acpi: bump up version to 0.20
ACPI: thinkpad-acpi: fix selects in Kconfig
ACPI: thinkpad-acpi: use a private workqueue
ACPI: thinkpad-acpi: fluff really minor fix
ACPI: thinkpad-acpi: use uppercase for "LED" on user documentation
...
Fixed conflicts in drivers/acpi/video.c and drivers/misc/intel_menlow.c
manually.
hrtimers have now dynamic users in the network code. Put them under
debugobjects surveillance as well.
Add calls to the generic object debugging infrastructure and provide fixup
functions which allow to keep the system alive when recoverable problems have
been detected by the object debugging core code.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Greg KH <greg@kroah.com>
Cc: Randy Dunlap <randy.dunlap@oracle.com>
Cc: Kay Sievers <kay.sievers@vrfy.org>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add calls to the generic object debugging infrastructure and provide fixup
functions which allow to keep the system alive when recoverable problems have
been detected by the object debugging core code.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Ingo Molnar <mingo@elte.hu>
Cc: Greg KH <greg@kroah.com>
Cc: Randy Dunlap <randy.dunlap@oracle.com>
Cc: Kay Sievers <kay.sievers@vrfy.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Use kmem_cache_zalloc(), remove large amounts of initialisation code and
ifdeffery.
Note: this assumes that memset(*atomic_t, 0) correctly initialises the
atomic_t. This is true for all present archtiectures and if it becomes false
for a future architecture then we'll need to make large changes all over the
place anyway.
Acked-by: David Howells <dhowells@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Fix update_console_cmdline() not to to read beyond the terminating zero of its
name argument.
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This adds a minimalistic braille screen reader support. This is meant to
be used by blind people e.g. on boot failures or when / cannot be mounted
etc and thus the userland screen readers can not work.
[akpm@linux-foundation.org: fix exports]
Signed-off-by: Samuel Thibault <samuel.thibault@ens-lyon.org>
Cc: Jiri Kosina <jikos@jikos.cz>
Cc: Dmitry Torokhov <dtor@mail.ru>
Acked-by: Alan Cox <alan@redhat.com>
Cc: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add a new BDI capability flag: BDI_CAP_NO_ACCT_WB. If this flag is
set, then don't update the per-bdi writeback stats from
test_set_page_writeback() and test_clear_page_writeback().
Misc cleanups:
- convert bdi_cap_writeback_dirty() and friends to static inline functions
- create a flag that includes all three dirty/writeback related flags,
since almst all users will want to have them toghether
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
These values represent the nesting level of a namespace and pids living in it,
and it's always non-negative.
Turning this from int to unsigned int saves some space in pid.c (11 bytes on
x86 and 64 on ia64) by letting the compiler optimize the pid_nr_ns a bit.
E.g. on ia64 this removes the sign extension calls, which compiler adds to
optimize access to pid->nubers[ns->level].
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
With the needlessly global marker_debug being static gcc can optimize the
unused code away.
Signed-off-by: Adrian Bunk <bunk@kernel.org>
Acked-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
1. sys_getpgid() needs rcu_read_lock() to derive the pgrp _nr, even if
the task is current, otherwise we can race with another thread which
does sys_setpgid().
2. Use rcu_read_lock() instead of tasklist_lock when pid != 0, make sure
that we don't use the NULL pid if the task exits right after successful
find_task_by_vpid().
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
1. sys_getsid() needs rcu_read_lock() to derive the session _nr, even if
the task is current, otherwise we can race with another thread which
does sys_setsid().
2. The task can exit between find_task_by_vpid() and task_session_vnr(),
in that unlikely case sys_getsid() returns 0 instead of -ESRCH.
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Use change_pid() instead of detach_pid() + attach_pid() in
__set_special_pids().
This way task_session() is not NULL in between.
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Pavel Emelyanov <xemul@openvz.org>
Cc: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Use change_pid() instead of detach_pid() + attach_pid() in sys_setpgid().
This way task_pgrp() is not NULL in between.
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Pavel Emelyanov <xemul@openvz.org>
Cc: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Based on Eric W. Biederman's idea.
Without tasklist_lock held task_session()/task_pgrp() can return NULL if the
caller races with setprgp()/setsid() which does detach_pid() + attach_pid().
This can happen even if task == current.
Intoduce the new helper, change_pid(), which should be used instead. This way
the caller always sees the special pid != NULL, either old or new.
Also change the prototype of attach_pid(), it always returns 0 and nobody
check the returned value.
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Pavel Emelyanov <xemul@openvz.org>
Cc: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Based on Eric W. Biederman's idea.
Unless task == current, without tasklist_lock held task_session()/task_pgrp()
can return NULL if the caller races with de_thread() which switches the group
leader.
Change transfer_pid() to not clear old->pids[type].pid for the old leader.
This means that its .pid can point to "nowhere", but this is already true for
sub-threads, and the old leader is not group_leader() any longer. IOW, with
or without this change we can't trust task's special pids unless it is the
group leader.
With this change the following code
rcu_read_lock();
task = find_task_by_xxx();
do_something(task_pgrp(task), task_session(task));
rcu_read_unlock();
can't race with exec and hit the NULL pid.
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Pavel Emelyanov <xemul@openvz.org>
Cc: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
There are some places that are known to operate on tasks'
global pids only:
* the rest_init() call (called on boot)
* the kgdb's getthread
* the create_kthread() (since the kthread is run in init ns)
So use the find_task_by_pid_ns(..., &init_pid_ns) there
and schedule the find_task_by_pid for removal.
[sukadev@us.ibm.com: Fix warning in kernel/pid.c]
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Sukadev Bhattiprolu <sukadev@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The pid to lookup a task by is passed inside taskstats code via genetlink
message.
Since netlink packets are now processed in the context of the sending task,
this is correct to lookup the task with find_task_by_vpid() here.
Besides, I fix the call to fill_pid() from taskstats_exit(), since the
tsk->pid is not required in fill_pid() in this case, and the pid field on
task_struct is going to be deprecated as well.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Jay Lan <jlan@engr.sgi.com>
Cc: Jonathan Lim <jlim@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The callers of free_pidmap() pass 2 members of "struct upid", we can just
pass "struct upid *" instead. Shaves off 10 bytes from pid.o.
Also, simplify the alloc_pid's "out_free:" error path a little bit. This
way it looks more clear which subset of pid->numbers[] we are freeing.
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Pavel Emelyanov <xemul@openvz.org>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc :Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
- Operations are now a shared const function block as with most other Linux
objects
- Introduce wrappers for some optional functions to get consistent behaviour
- Wrap put_char which used to be patched by the tty layer
- Document which functions are needed/optional
- Make put_char report success/fail
- Cache the driver->ops pointer in the tty as tty->ops
- Remove various surplus lock calls we no longer need
- Remove proc_write method as noted by Alexey Dobriyan
- Introduce some missing sanity checks where certain driver/ldisc
combinations would oops as they didn't check needed methods were present
[akpm@linux-foundation.org: fix fs/compat_ioctl.c build]
[akpm@linux-foundation.org: fix isicom]
[akpm@linux-foundation.org: fix arch/ia64/hp/sim/simserial.c build]
[akpm@linux-foundation.org: fix kgdb]
Signed-off-by: Alan Cox <alan@redhat.com>
Acked-by: Greg Kroah-Hartman <gregkh@suse.de>
Cc: Jason Wessel <jason.wessel@windriver.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Afaics, currently there are no kernel problems with ptracing init, it can't
lose SIGNAL_UNKILLABLE flag and be killed/stopped by accident.
The ability to strace/debug init can be very useful if you try to figure out
why it does not work as expected.
However, admin should know what he does, "gdb /sbin/init 1" stops init, it
can't reap orphaned zombies or take care of /etc/inittab until continued. It
is even possible to crash init (and thus the whole system) if you wish,
ptracer has full control.
See also the long discussion: http://marc.info/?t=120628018600001
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Acked-by: Roland McGrath <roland@redhat.com>
Acked-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Nobody can block/ignore SIGSTOP, no need to use force_sig_specific() in
ptrace_attach. Use the "regular" send_sig_info().
With this patch stracing of /sbin/init doesn't clear its SIGNAL_UNKILLABLE,
but not that this makes ptracing of init safe.
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently __ptrace_unlink() checks list_empty(->ptrace_list) to figure out
whether the child was reparented. Change the code to use ptrace_reparented()
to make this check more explicit and consistent.
No functional changes.
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Acked-by: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add another trivial helper for the sake of grep. It also auto-documents the
fact that ->parent != real_parent implies ->ptrace.
No functional changes.
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Acked-by: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add a couple of small comments, it is not easy to see what this code does.
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
exit.c has numerous "->exit_signal == -1" comparisons, this check is subtle
and deserves a helper. Imho makes the code more parseable for humans. At
least it's surely more greppable.
Also, a couple of whitespace cleanups. No functional changes.
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This adds the set_restore_sigmask() inline in <linux/thread_info.h> and
replaces every set_thread_flag(TIF_RESTORE_SIGMASK) with a call to it. No
change, but abstracts the details of the flag protocol from all the calls.
Signed-off-by: Roland McGrath <roland@redhat.com>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: "Luck, Tony" <tony.luck@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently the buggy /sbin/init hangs if SIGSEGV/etc happens. The kernel sends
the signal, init dequeues it and ignores, returns from the exception, repeats
the faulting instruction, and so on forever.
Imho, such a behaviour is not good. I think that the explicit loud death of
the buggy /sbin/init is better than the silent hang.
Change force_sig_info() to clear SIGNAL_UNKILLABLE when the task should be
really killed.
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The global init has a lot of long standing problems with the unhandled fatal
signals.
- The "is_global_init(current)" check in get_signal_to_deliver()
protects only the main thread. Sub-thread can dequee the fatal
signal and shutdown the whole thread group except the main thread.
If it dequeues SIGSTOP /sbin/init will be stopped, this is not
right too. Note that we can't use is_global_init(->group_leader),
this breaks exec and this can't solve other problems we have.
- Even if afterwards ignored, the fatal signals sets SIGNAL_GROUP_EXIT
on delivery. This breaks exec, has other bad implications, and this
is just wrong.
Introduce the new SIGNAL_UNKILLABLE flag to fix these problems. It also helps
to solve some other problems addressed by the subsequent patches.
Currently we use this flag for the global init only, but it could also be used
by kthreads and (perhaps) by the sub-namespace inits.
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Now that task_session() can't return a false NULL, check_kill_permission()
doesn't need tasklist_lock.
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This wasn't documented, but as Atsushi Tsuji pointed out
check_kill_permission() needs tasklist_lock for task_session_nr(). I missed
this fact when removed tasklist from the callers.
Change check_kill_permission() to take tasklist_lock for the SIGCONT case.
Re-order security checks so that we take tasklist_lock only if/when it is
actually needed. This is a minimal fix for now, tasklist will be removed
later.
Also change the code to use task_session() instead of task_session_nr().
Also, remove the SIGCONT check from cap_task_kill(), it is bogus (and the
whole function is bogus. Serge, Eric, why it is still alive?).
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Acked-by: Atsushi Tsuji <a-tsuji@bk.jp.nec.com>
Cc: Roland McGrath <roland@redhat.com>
Cc: Serge Hallyn <serue@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
send_signal() shouldn't call signalfd_notify() if it then fails with -EAGAIN.
Harmless, just a paranoid cleanup.
Also remove the comment. It is obsolete, signalfd_notify() was simplified and
does a simple wakeup.
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Acked-by: Davide Libenzi <davidel@xmailserver.org>
Cc: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
A couple of small comments about how CLD_CONTINUED notification works.
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Rename handle_stop_signal() to prepare_signal(), make it return a boolean, and
move the callsites of sig_ignored() into it.
No functional changes for now. But it would be nice to factor out the "should
we drop this signal" checks as much as possible, before we try to fix the bugs
with the sub-namespace init's signals (actually the global /sbin/init has some
problems with signals too).
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Move the callsite of print_fatal_signal() down, under "if
(sig_kernel_coredump(signr))", so we don't need to check signr != SIGKILL.
We are only interested in the sig_kernel_coredump() signals anyway, and due to
the previous changes we almost never can see other fatal signals here except
SIGKILL.
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
handle_stop_signal() clears SIGNAL_STOP_DEQUEUED when sig == SIGKILL. Remove
this nasty special case. It was needed to prevent the race with group stop
and exit caused by thread-specific SIGKILL. Now that we use complete_signal()
for private signals too this is not needed, complete_signal() will notice
SIGKILL and abort the soon-to-begin group stop.
Except: the target thread is dead (has PF_EXITING). But in that case we
should not just clear SIGNAL_STOP_DEQUEUED and nothing more. We should either
kill the whole thread group, or silently ignore the signal.
I suspect we are not right wrt zombie leaders, but this is another issue which
and should be fixed separately. Note that this check can't abort the group
stop if it was already started/finished, this check only adds a subtle side
effect if we race with the thread which has already dequeued sig_kernel_stop()
signal and temporary released ->siglock.
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We export send_sigqueue() and send_group_sigqueue() for the only user,
posix_timer_event(). This is a bit silly, because both are just trivial
helpers on top of do_send_sigqueue() and because the we pass the unused
.si_signo parameter.
Kill them both, rename do_send_sigqueue() to send_sigqueue(), and export it.
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Suggested by Pavel Emelyanov.
send_sigqueue/send_group_sigqueue are only differ in how they lock ->siglock.
Unify them. send_group_sigqueue() uses spin_lock() because it knows the task
can't exit, but in that case lock_task_sighand() can't fail and doesn't hurt.
Note that the "sig" argument is ignored, it is always equal to ->si_signo.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Factor out complete_signal() callsites. This change completely unifies the
helpers sending the specific/group signals.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Based on Pavel Emelyanov's suggestion.
Rename __group_complete_signal() to complete_signal() and use it to process
the specific signals too. To do this we simply add the "int group" argument.
This allows us to greatly simply the signal-sending code and adds a useful
behaviour change. We can avoid the unneeded wakeups for the private signals
because wants_signal() is more clever than sigismember(blocked), but more
importantly we now take into account the fatal specific signals too.
The latter allows us to kill some subtle checks in handle_stop_signal() and
makes the specific/group signal's behaviour more consistent. For example,
currently sigtimedwait(FATAL_SIGNAL) behaves differently depending on was the
signal sent by kill() or tkill() if the signal was not blocked.
And. This allows us to tweak/fix the behaviour when the specific signal is
sent to the dying/dead ->group_leader.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
send_signal() is used either with ->pending or with ->signal->shared_pending.
Change it to take "int group" instead, this argument will be re-used later.
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Move the unchanged definition of __group_complete_signal() so that send_signal
can see it. To simplify the reading of the next patches.
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Suggested by Roland McGrath.
Initialize signal->curr_target in copy_signal(). This way ->curr_target is
never == NULL, we can kill the check in __group_complete_signal's hot path.
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The comment in send_sig_info() is wrong, tasklist_lock can't help.
The caller must ensure the task can't go away, otherwise ->sighand can be NULL
even before we take the lock.
p->sighand could be changed by exec(), but I can't imagine how it is possible
to prevent exit(), but not exec().
Since the things seem to work, I assume all callers are correct. However,
drm_vbl_send_signals() looks broken. block_all_signals() which is solely used
by drm is definitely broken.
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Convert do_tkill() to use rcu_read_lock() + lock_task_sighand() to avoid
taking tasklist lock.
Note that we don't return an error if lock_task_sighand() fails, we pretend
the task dies after receiving the signal. Otherwise, we should fight with the
nasty races with mt-exec without having any advantage.
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Move handle_stop_signal() into send_signal(). This factors out a couple of
callsites and allows us to do further unifications.
Also, with this change specific_send_sig_info() does handle_stop_signal().
Not that this is really important, we never send STOP/CONT via send_sig() and
friends, but still this looks more consistent.
The only (afaics) special case is get_signal_to_deliver(). If the traced task
dequeues SIGCONT, it can re-send it to itself after ptrace_stop() if the
signal was blocked by debugger. In that case handle_stop_signal() is
unnecessary, but hopefully not a problem.
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
handle_stop_signal() was changed, now send_group_sigqueue() doesn't need
tasklist_lock.
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Cosmetic, cache p->signal to make the code a bit more readable.
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
send_group_sigqueue() calls handle_stop_signal(), send_sigqueue() doesn't.
This is not consistent and in fact I'd say this is (minor) bug.
Move handle_stop_signal() from send_group_sigqueue() to do_send_sigqueue(),
the latter is called by send_sigqueue() too.
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
lock_task_sighand() was changed, send_sigqueue() doesn't need rcu_read_lock()
any longer.
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Cache the values of current->signal/sighand. Shrinks .text a bit and makes
the code more readable. Also, remove "sigset_t *mask", it is pointless
because in fact we save the constant offset.
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Roland McGrath <roland@redhat.com>
Cc: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Cache the value of p->signal, and change the code to use while_each_thread()
helper.
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Roland McGrath <roland@redhat.com>
Cc: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Now that handle_stop_signal() doesn't drop ->siglock, we can't see both
->group_stop_count && SIGNAL_STOP_STOPPED. Merge two "if" branches.
As Roland pointed out, we never actually needed 2 do_notify_parent_cldstop()
calls.
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Roland McGrath <roland@redhat.com>
Cc: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Previously handle_stop_signal(SIGCONT) could drop ->siglock. That is why
kill_pid_info(SIGCONT) takes tasklist_lock to make sure the target task can't
go away after unlock. Not needed now.
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Roland McGrath <roland@redhat.com>
Cc: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Based on discussion with Jiri and Roland.
In short: currently handle_stop_signal(SIGCONT, p) sends the notification to
p->parent, with this patch p itself notifies its parent when it becomes
running.
handle_stop_signal(SIGCONT) has to drop ->siglock temporary in order to notify
the parent with do_notify_parent_cldstop(). This leads to multiple problems:
- as Jiri Kosina pointed out, the stopped task can resume without
actually seeing SIGCONT which may have a handler.
- we race with another sig_kernel_stop() signal which may come in
that window.
- we race with sig_fatal() signals which may set SIGNAL_GROUP_EXIT
in that window.
- we can't avoid taking tasklist_lock() while sending SIGCONT.
With this patch handle_stop_signal() just sets the new SIGNAL_CLD_CONTINUED
flag in p->signal->flags and returns. The notification is sent by the first
task which returns from finish_stop() (there should be at least one) or any
other signalled thread from get_signal_to_deliver().
This is a user-visible change. Say, currently kill(SIGCONT, stopped_child)
can't return without seeing SIGCHLD, with this patch SIGCHLD can be delayed
unpredictably. Another difference is that if the child is ptraced by another
process, CLD_CONTINUED may be delivered to ->real_parent after ptrace_detach()
while currently it always goes to the tracer which doesn't actually need this
notification. Hopefully not a problem.
The patch asks for the futher obvious cleanups, I'll send them separately.
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Roland McGrath <roland@redhat.com>
Cc: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Every implementation of ->task_kill() does nothing when the signal comes from
the kernel. This is correct, but means that check_kill_permission() should
call security_task_kill() only for SI_FROMUSER() case, and we can remove the
same check from ->task_kill() implementations.
(sadly, check_kill_permission() is the last user of signal->session/__session
but we can't s/task_session_nr/task_session/ here).
NOTE: Eric W. Biederman pointed out cap_task_kill() should die, and I think
he is very right.
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Serge Hallyn <serue@us.ibm.com>
Cc: Roland McGrath <roland@redhat.com>
Cc: Casey Schaufler <casey@schaufler-ca.com>
Cc: David Quigley <dpquigl@tycho.nsa.gov>
Cc: Eric Paris <eparis@redhat.com>
Cc: Harald Welte <laforge@gnumonks.org>
Cc: Pavel Emelyanov <xemul@openvz.org>
Cc: Stephen Smalley <sds@tycho.nsa.gov>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Both functions do the same thing after proper locking, but with
different sigpending structs, so move the common code into a helper.
After this we have 4 places that look very similar: send_sigqueue: calls
do_send_sigqueue and signal_wakeup send_group_sigqueue: calls
do_send_sigqueue and __group_complete_signal __group_send_sig_info:
calls send_signal and __group_complete_signal specific_send_sig_info:
calls send_signal and signal_wakeup
Besides, send_signal performs actions similar to do_send_sigqueue's
and __group_complete_signal - to signal_wakeup.
It looks like they can be consolidated gracefully.
Oleg said:
Personally, I think this change is very good. But send_sigqueue() and
send_group_sigqueue() have a very subtle difference which I was never able
to understand.
Let's suppose that sigqueue is already queued, and the signal is ignored
(the latter means we should re-schedule cpu timer or handle overrruns). In
that case send_sigqueue() returns 0, but send_group_sigqueue() returns 1.
I think this is not the problem (in fact, I think this patch makes the
behaviour more correct), but I hope Thomas can take a look and confirm.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Roland McGrath <roland@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The signr variable may be declared without initialization - it is set ro the
return value from __dequeue_signal() right at the function beginning.
Besides, after recalc_sigpending() two checks for signr to be not 0 may be
merged into one. Both if-s become easier to read.
Thanks to Oleg for pointing out mistakes in the first version of this patch.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Roland McGrath <roland@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Both sig_ignored() and do_sigaction() check for signr to be explicitly or
implicitly ignored. Introduce a helper for them.
This patch is aimed to help handling signals by pid namespace's init, and was
derived from one of Oleg's patches
https://lists.linux-foundation.org/pipermail/containers/2007-December/009308.html
so, if he doesn't mind, he should be considered as an author.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Roland McGrath <roland@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Just a trivial example, more to come.
k_getrusage() holds rcu_read_lock() because it was previously required by
lock_task_sighand(). Unneeded now.
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Most of the callers of lock_task_sighand() doesn't actually need rcu_lock().
lock_task_sighand() needs it only to safely play with tsk->sighand, it can
take the lock itself.
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
do_group_exit() checks SIGNAL_GROUP_EXIT to avoid taking sighand->siglock.
Since ed5d2cac11 exec() doesn't set this
flag, we should use signal_group_exit().
This is not needed for correctness, but can speedup the multithreaded exec
and makes the code more consistent.
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
do_signal_stop() needs signal_group_exit() but checks sig->group_exit_task.
This (optimization) is correct, SIGNAL_STOP_DEQUEUED and SIGNAL_GROUP_EXIT
are mutually exclusive, but looks confusing. Use signal_group_exit(), this
is not fastpath, the code clarity is more important.
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Two callers for send_signal() - the specific_send_sig_info and the
__group_send_sig_info - both check for sig to be ignored or already queued.
Move these checks into send_signal() and make it return 1 to indicate that the
signal is dropped, but there's no error in this.
Besides, merge comments and spell-check them.
[oleg@tv-sign.ru: simplifications]
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Cc: Roland McGrath <roland@redhat.com>
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This makes the code more readable, due to less brackets and small letters in
name.
I also move it above the send_signal() as a preparation for the 3rd patch.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Cc: Roland McGrath <roland@redhat.com>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This function doesn't change the ret's value and thus always returns 0, with a
single exception of returning -EAGAIN explicitly.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Cc: Roland McGrath <roland@redhat.com>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* 'audit.b50' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/audit-current:
[PATCH] new predicate - AUDIT_FILETYPE
[patch 2/2] Use find_task_by_vpid in audit code
[patch 1/2] audit: let userspace fully control TTY input auditing
[PATCH 2/2] audit: fix sparse shadowed variable warnings
[PATCH 1/2] audit: move extern declarations to audit.h
Audit: MAINTAINERS update
Audit: increase the maximum length of the key field
Audit: standardize string audit interfaces
Audit: stop deadlock from signals under load
Audit: save audit_backlog_limit audit messages in case auditd comes back
Audit: collect sessionid in netlink messages
Audit: end printk with newline
* 'for-linus' of git://git.kernel.dk/linux-2.6-block:
block: Skip I/O merges when disabled
block: add large command support
block: replace sizeof(rq->cmd) with BLK_MAX_CDB
ide: use blk_rq_init() to initialize the request
block: use blk_rq_init() to initialize the request
block: rename and export rq_init()
block: no need to initialize rq->cmd with blk_get_request
block: no need to initialize rq->cmd in prepare_flush_fn hook
block/blk-barrier.c:blk_ordered_cur_seq() mustn't be inline
block/elevator.c:elv_rq_merge_ok() mustn't be inline
block: make queue flags non-atomic
block: add dma alignment and padding support to blk_rq_map_kern
unexport blk_max_pfn
ps3disk: Remove superfluous cast
block: make rq_init() do a full memset()
relay: fix splice problem
The same definitions are used for the bounds logic and the asm-offsets.h
generation by kbuild. Put them into include/linux/kbuild.h file.
Also add a new feature
COMMENT("text")
which can be used to insert lines of ocmments into asm-offsets.h and
bounds.h.
Cc: Sam Ravnborg <sam@ravnborg.org>
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Jay Estabrook <jay.estabrook@hp.com>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: Richard Henderson <rth@twiddle.net>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Russell King <rmk@arm.linux.org.uk>
Cc: Chris Zankel <chris@zankel.net>
Cc: David S. Miller <davem@davemloft.net>
Cc: Haavard Skinnemoen <hskinnemoen@atmel.com>
Cc: Bryan Wu <bryan.wu@analog.com>
Cc: Mike Frysinger <vapier.adi@gmail.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Roman Zippel <zippel@linux-m68k.org>
Cc: Greg Ungerer <gerg@uclinux.org>
Cc: David Howells <dhowells@redhat.com>
Cc: Kyle McMartin <kyle@mcmartin.ca>
Cc: Grant Grundler <grundler@parisc-linux.org>
Cc: Matthew Wilcox <matthew@wil.cx>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: Miles Bader <miles@gnu.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Use vmalloc() and memset() instead of kcalloc() to allocate a page* array when
the array size is bigger than one page. This enables relayfs to support
bigger relay buffers than 64MB on 4k-page system, 512MB on 16k-page system.
[akpm@linux-foundation.org: cleanup]
Signed-off-by: Masami Hiramatsu <mhiramat@redhat.com>
Cc: David Wilder <dwilder@us.ibm.com>
Reviewed-by: Tom Zanussi <zanussi@comcast.net>
Reviewed-by: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Some drivers have duplicated unlikely() macros. IS_ERR() already has
unlikely() in itself.
This patch cleans up such pointless code.
Signed-off-by: Hirofumi Nakagawa <hnakagawa@miraclelinux.com>
Acked-by: David S. Miller <davem@davemloft.net>
Acked-by: Jeff Garzik <jeff@garzik.org>
Cc: Paul Clements <paul.clements@steeleye.com>
Cc: Richard Purdie <rpurdie@rpsys.net>
Cc: Alessandro Zummo <a.zummo@towertech.it>
Cc: David Brownell <david-b@pacbell.net>
Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
Cc: Michael Halcrow <mhalcrow@us.ibm.com>
Cc: Anton Altaparmakov <aia21@cantab.net>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Carsten Otte <cotte@de.ibm.com>
Cc: Patrick McHardy <kaber@trash.net>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: Jaroslav Kysela <perex@perex.cz>
Cc: Takashi Iwai <tiwai@suse.de>
Acked-by: Mike Frysinger <vapier@gentoo.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When reading from/writing to some table, a root, which this table came from,
may affect this table's permissions, depending on who is working with the
table.
The core hunk is at the bottom of this patch. All the rest is just pushing
the ctl_table_root argument up to the sysctl_perm() function.
This will be mostly (only?) used in the net sysctls.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Acked-by: David S. Miller <davem@davemloft.net>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Alexey Dobriyan <adobriyan@sw.ru>
Cc: Denis V. Lunev <den@openvz.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The do_sysctl_strategy isn't used outside kernel/sysctl.c, so this can be
static and without a prototype in header.
Besides, move this one and parse_table() above their callers and drop the
forward declarations of the latter call.
One more "besides" - fix two checkpatch warnings: space before a ( and an
extra space at the end of a line.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Acked-by: David S. Miller <davem@davemloft.net>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Alexey Dobriyan <adobriyan@sw.ru>
Cc: Denis V. Lunev <den@openvz.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Disable sysctl_check.c for embedded targets. This saves about about 11 kB
in .text and another 11 kB in .data on a PXA255 embedded platform.
Signed-off-by: Holger Schurig <hs4233@mail.mn-solutions.de>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Use proc_create()/proc_create_data() to make sure that ->proc_fops and ->data
be setup before gluing PDE to main tree.
Signed-off-by: Denis V. Lunev <den@openvz.org>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Remove proc_root export. Creation and removal works well if parent PDE is
supplied as NULL -- it worked always that way.
So, one useless export removed and consistency added, some drivers created
PDEs with &proc_root as parent but removed them as NULL and so on.
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The kernel implements readlink of /proc/pid/exe by getting the file from
the first executable VMA. Then the path to the file is reconstructed and
reported as the result.
Because of the VMA walk the code is slightly different on nommu systems.
This patch avoids separate /proc/pid/exe code on nommu systems. Instead of
walking the VMAs to find the first executable file-backed VMA we store a
reference to the exec'd file in the mm_struct.
That reference would prevent the filesystem holding the executable file
from being unmounted even after unmapping the VMAs. So we track the number
of VM_EXECUTABLE VMAs and drop the new reference when the last one is
unmapped. This avoids pinning the mounted filesystem.
[akpm@linux-foundation.org: improve comments]
[yamamoto@valinux.co.jp: fix dup_mmap]
Signed-off-by: Matt Helsley <matthltc@us.ibm.com>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Cc: David Howells <dhowells@redhat.com>
Cc:"Eric W. Biederman" <ebiederm@xmission.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Hugh Dickins <hugh@veritas.com>
Signed-off-by: YAMAMOTO Takashi <yamamoto@valinux.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Make the keyring quotas controllable through /proc/sys files:
(*) /proc/sys/kernel/keys/root_maxkeys
/proc/sys/kernel/keys/root_maxbytes
Maximum number of keys that root may have and the maximum total number of
bytes of data that root may have stored in those keys.
(*) /proc/sys/kernel/keys/maxkeys
/proc/sys/kernel/keys/maxbytes
Maximum number of keys that each non-root user may have and the maximum
total number of bytes of data that each of those users may have stored in
their keys.
Also increase the quotas as a number of people have been complaining that it's
not big enough. I'm not sure that it's big enough now either, but on the
other hand, it can now be set in /etc/sysctl.conf.
Signed-off-by: David Howells <dhowells@redhat.com>
Cc: <kwc@citi.umich.edu>
Cc: <arunsr@cse.iitk.ac.in>
Cc: <dwalsh@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Don't generate the per-UID user and user session keyrings unless they're
explicitly accessed. This solves a problem during a login process whereby
set*uid() is called before the SELinux PAM module, resulting in the per-UID
keyrings having the wrong security labels.
This also cures the problem of multiple per-UID keyrings sometimes appearing
due to PAM modules (including pam_keyinit) setuiding and causing user_structs
to come into and go out of existence whilst the session keyring pins the user
keyring. This is achieved by first searching for extant per-UID keyrings
before inventing new ones.
The serial bound argument is also dropped from find_keyring_by_name() as it's
not currently made use of (setting it to 0 disables the feature).
Signed-off-by: David Howells <dhowells@redhat.com>
Cc: <kwc@citi.umich.edu>
Cc: <arunsr@cse.iitk.ac.in>
Cc: <dwalsh@redhat.com>
Cc: Stephen Smalley <sds@tycho.nsa.gov>
Cc: James Morris <jmorris@namei.org>
Cc: Chris Wright <chrisw@sous-sol.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
CLONE_NEWIPC|CLONE_SYSVSEM interaction isn't handled properly. This can cause
a kernel memory corruption. CLONE_NEWIPC must detach from the existing undo
lists.
Fix, part 3: refuse clone(CLONE_SYSVSEM|CLONE_NEWIPC).
With unshare, specifying CLONE_SYSVSEM means unshare the sysvsem. So it seems
reasonable that CLONE_NEWIPC without CLONE_SYSVSEM would just imply
CLONE_SYSVSEM.
However with clone, specifying CLONE_SYSVSEM means *share* the sysvsem. So
calling clone(CLONE_SYSVSEM|CLONE_NEWIPC) is explicitly asking for something
we can't allow. So return -EINVAL in that case.
[akpm@linux-foundation.org: cleanups]
Signed-off-by: Serge E. Hallyn <serue@us.ibm.com>
Cc: Manfred Spraul <manfred@colorfullife.com>
Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Pavel Emelyanov <xemul@openvz.org>
Cc: Michael Kerrisk <mtk.manpages@googlemail.com>
Cc: Pierre Peiffer <peifferp@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
sys_unshare(CLONE_NEWIPC) doesn't handle the undo lists properly, this can
cause a kernel memory corruption. CLONE_NEWIPC must detach from the existing
undo lists.
Fix, part 2: perform an implicit CLONE_SYSVSEM in CLONE_NEWIPC. CLONE_NEWIPC
creates a new IPC namespace, the task cannot access the existing semaphore
arrays after the unshare syscall. Thus the task can/must detach from the
existing undo list entries, too.
This fixes the kernel corruption, because it makes it impossible that
undo records from two different namespaces are in sysvsem.undo_list.
Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Serge E. Hallyn <serue@us.ibm.com>
Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Pavel Emelyanov <xemul@openvz.org>
Cc: Michael Kerrisk <mtk.manpages@googlemail.com>
Cc: Pierre Peiffer <peifferp@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
sys_unshare(CLONE_NEWIPC) doesn't handle the undo lists properly, this can
cause a kernel memory corruption. CLONE_NEWIPC must detach from the existing
undo lists.
Fix, part 1: add support for sys_unshare(CLONE_SYSVSEM)
The original reason to not support it was the potential (inevitable?)
confusion due to the fact that sys_unshare(CLONE_SYSVSEM) has the
inverse meaning of clone(CLONE_SYSVSEM).
Our two most reasonable options then appear to be (1) fully support
CLONE_SYSVSEM, or (2) continue to refuse explicit CLONE_SYSVSEM,
but always do it anyway on unshare(CLONE_SYSVSEM). This patch does
(1).
Changelog:
Apr 16: SEH: switch to Manfred's alternative patch which
removes the unshare_semundo() function which
always refused CLONE_SYSVSEM.
Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Serge E. Hallyn <serue@us.ibm.com>
Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Pavel Emelyanov <xemul@openvz.org>
Cc: Michael Kerrisk <mtk.manpages@googlemail.com>
Cc: Pierre Peiffer <peifferp@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The enhancement as asked for by Yasunori: if msgmni is set to a negative
value, register it back into the ipcns notifier chain.
A new interface has been added to the notification mechanism:
notifier_chain_cond_register() registers a notifier block only if not already
registered. With that new interface we avoid taking care of the states
changes in procfs.
Signed-off-by: Nadia Derbey <Nadia.Derbey@bull.net>
Cc: Yasunori Goto <y-goto@jp.fujitsu.com>
Cc: Matt Helsley <matthltc@us.ibm.com>
Cc: Mingming Cao <cmm@us.ibm.com>
Cc: Pierre Peiffer <pierre.peiffer@bull.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
cpu_hotplug_begin() must be always called under cpu_add_remove_lock, this
means that only one process can be cpu_hotplug.active_writer. So we don't
need the cpu_hotplug.writer_queue, we can wake up the ->active_writer
directly.
Also, fix the comment.
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Dipankar Sarma <dipankar@in.ibm.com>
Acked-by: Gautham R Shenoy <ego@in.ibm.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
cleanup_workqueue_thread() doesn't need the second argument, remove it.
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When cpu_populated_map was introduced, it was supposed that cwq->thread can
survive after CPU_DEAD, that is why we never shrink cpu_populated_map.
This is not very nice, we can safely remove the already dead CPU from the map.
The only required change is that destroy_workqueue() must hold the hotplug
lock until it destroys all cwq->thread's, to protect the cpu_populated_map.
We could make the local copy of cpu mask and drop the lock, but
sizeof(cpumask_t) may be very large.
Also, fix the comment near queue_work(). Unless _cpu_down() happens we do
guarantee the cpu-affinity of the work_struct, and we have users which rely on
this.
[akpm@linux-foundation.org: repair comment]
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This flag provides the hardwalling properties of mem_exclusive, without
enforcing the exclusivity. Either mem_hardwall or mem_exclusive is sufficient
to prevent GFP_KERNEL allocations from passing outside the cpuset's assigned
nodes.
Signed-off-by: Paul Menage <menage@google.com>
Acked-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently the cpusets mem_exclusive flag is overloaded to mean both
"no-overlapping" and "no GFP_KERNEL allocations outside this cpuset".
These patches add a new mem_hardwall flag with just the allocation restriction
part of the mem_exclusive semantics, without breaking backwards-compatibility
for those who continue to use just mem_exclusive. Additionally, the cgroup
control file registration for cpusets is cleaned up to reduce boilerplate.
This patch:
This change tidies up the cpusets control file definitions, and reduces the
amount of boilerplate required to add/change control files in the future.
Signed-off-by: Paul Menage <menage@google.com>
Reviewed-by: Li Zefan <lizf@cn.fujitsu.com>
Acked-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Make the following needlessly global functions static:
- cpuset_test_cpumask()
- cpuset_change_cpumask()
- cpuset_do_move_task()
Signed-off-by: Adrian Bunk <bunk@kernel.org>
Acked-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This field is the maximal value of the usage one since the counter creation
(or since the latest reset).
To reset this to the usage value simply write anything to the appropriate
cgroup file.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Remove the mem_cgroup member from mm_struct and instead adds an owner.
This approach was suggested by Paul Menage. The advantage of this approach
is that, once the mm->owner is known, using the subsystem id, the cgroup
can be determined. It also allows several control groups that are
virtually grouped by mm_struct, to exist independent of the memory
controller i.e., without adding mem_cgroup's for each controller, to
mm_struct.
A new config option CONFIG_MM_OWNER is added and the memory resource
controller selects this config option.
This patch also adds cgroup callbacks to notify subsystems when mm->owner
changes. The mm_cgroup_changed callback is called with the task_lock() of
the new task held and is called just prior to changing the mm->owner.
I am indebted to Paul Menage for the several reviews of this patchset and
helping me make it lighter and simpler.
This patch was tested on a powerpc box, it was compiled with both the
MM_OWNER config turned on and off.
After the thread group leader exits, it's moved to init_css_state by
cgroup_exit(), thus all future charges from runnings threads would be
redirected to the init_css_set's subsystem.
Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com>
Cc: Pavel Emelianov <xemul@openvz.org>
Cc: Hugh Dickins <hugh@veritas.com>
Cc: Sudhir Kumar <skumar@linux.vnet.ibm.com>
Cc: YAMAMOTO Takashi <yamamoto@valinux.co.jp>
Cc: Hirokazu Takahashi <taka@valinux.co.jp>
Cc: David Rientjes <rientjes@google.com>,
Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Pekka Enberg <penberg@cs.helsinki.fi>
Reviewed-by: Paul Menage <menage@google.com>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Introduce a read_seq() helper in cftype, which uses seq_file to print out
lists. Use it in the devices cgroup. Also split devices.allow into two
files, so now devices.deny and devices.allow are the ones to use to manipulate
the whitelist, while devices.list outputs the cgroup's current whitelist.
Signed-off-by: Serge E. Hallyn <serue@us.ibm.com>
Acked-by: Paul Menage <menage@google.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Now we can run through the hash table instead of running through the
linked-list.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Reviewed-by: Paul Menage <menage@google.com>
Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
Cc: Pavel Emelyanov <xemul@openvz.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We are at system boot and there is only 1 cgroup group (i,e, init_css_set), so
we don't need to run through the css_set linked list. Neither do we need to
run through the task list, since no processes have been created yet.
Also referring to a comment in cgroup.h:
struct css_set
{
...
/*
* Set of subsystem states, one for each subsystem. This array
* is immutable after creation apart from the init_css_set
* during subsystem registration (at boot time).
*/
struct cgroup_subsys_state *subsys[CGROUP_SUBSYS_COUNT];
}
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Reviewed-by: Paul Menage <menage@google.com>
Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
Cc: Pavel Emelyanov <xemul@openvz.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When we attach a process to a different cgroup, the css_set linked-list will
be run through to find a suitable existing css_set to use. This patch
implements a hash table for better performance.
The following benchmarks have been tested:
For N in 1, 5, 10, 50, 100, 500, 1000, create N cgroups with one sleeping
task in each, and then move an additional task through each cgroup in
turn.
Here is a test result:
N Loop orig - Time(s) hash - Time(s)
----------------------------------------------
1 10000 1.201231728 1.196311177
5 2000 1.065743872 1.040566424
10 1000 0.991054735 0.986876440
50 200 0.976554203 0.969608733
100 100 0.998504680 0.969218270
500 20 1.157347764 0.962602963
1000 10 1.619521852 1.085140172
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Reviewed-by: Paul Menage <menage@google.com>
Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
Cc: Pavel Emelyanov <xemul@openvz.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Trigger callback can be used to receive a kick-up from the user space. The
string written is ignored.
The cftype->private is used for multiplexing events.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Acked-by: Paul Menage <menage@google.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
There is a race between create_proc_entry() and the assignment of file ops.
proc_create() is invented to fix it.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Acked-by: Paul Menage <menage@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
It is called by cgroup_init() and cgroup_init_early() only, which are
annotated with __init.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Cc: Paul Menage <menage@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This removes some filesystem boilerplate from the CFS cgroup subsystem.
Signed-off-by: Paul Menage <menage@google.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
These patches add cgroups read_s64 and write_s64 control file methods (the
signed equivalent of read_u64/write_u64) and use them to implement the
cpu.rt_runtime_us control file in the CFS cgroup subsystem.
This patch:
These are the signed equivalents of the read_u64/write_u64 methods
Signed-off-by: Paul Menage <menage@google.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The "releasable" control file provided by the cgroup framework exports the
state of a per-cgroup flag that's related to the notify-on-release feature.
This isn't really generally useful, unless you're trying to debug this
particular feature of cgroups.
This patch moves the "releasable" file to the cgroup_debug subsystem.
Signed-off-by: Paul Menage <menage@google.com>
Cc: "Li Zefan" <lizf@cn.fujitsu.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Pavel Emelyanov <xemul@openvz.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: "YAMAMOTO Takashi" <yamamoto@valinux.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Adds a new type of supported control file representation, a map from strings
to u64 values.
Each map entry is printed as a line in a similar format to /proc/vmstat, i.e.
"$key $value\n"
Signed-off-by: Paul Menage <menage@google.com>
Cc: "Li Zefan" <lizf@cn.fujitsu.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Pavel Emelyanov <xemul@openvz.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: "YAMAMOTO Takashi" <yamamoto@valinux.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Many of the cpusets control files are simple integer values, which don't
require the overhead of memory allocations for reads and writes.
Move the handlers for these control files into cpuset_read_u64() and
cpuset_write_u64().
[akpm@linux-foundation.org: ad dmissing `break']
Signed-off-by: Paul Menage <menage@google.com>
Cc: "Li Zefan" <lizf@cn.fujitsu.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Pavel Emelyanov <xemul@openvz.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: "YAMAMOTO Takashi" <yamamoto@valinux.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This removes the need for people to remember to pass the -n flag to echo when
writing values to cgroup control files.
Signed-off-by: Paul Menage <menage@google.com>
Cc: "Li Zefan" <lizf@cn.fujitsu.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Pavel Emelyanov <xemul@openvz.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: "YAMAMOTO Takashi" <yamamoto@valinux.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Adds a function for returning the value of a resource counter member, in a
form suitable for use in a cgroup read_u64 control file method.
Signed-off-by: Paul Menage <menage@google.com>
Cc: "Li Zefan" <lizf@cn.fujitsu.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Pavel Emelyanov <xemul@openvz.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: "YAMAMOTO Takashi" <yamamoto@valinux.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Several people have justifiably complained that the "_uint" suffix is
inappropriate for functions that handle u64 values, so this patch just renames
all these functions and their users to have the suffic _u64.
[peterz@infradead.org: build fix]
Signed-off-by: Paul Menage <menage@google.com>
Cc: "Li Zefan" <lizf@cn.fujitsu.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Pavel Emelyanov <xemul@openvz.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: "YAMAMOTO Takashi" <yamamoto@valinux.co.jp>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Every file should include the headers containing the externs its global
functions (in this case for ns_cgroup_clone()).
Signed-off-by: Adrian Bunk <bunk@kernel.org>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Fix a code warning: symbol 'p' shadows an earlier one
This is a reincarnation of Harvey Harrison's patch:
cpuset: sparse warnings in cpuset.c
Independently, Cliff Wickman moved the affected code,
from kernel/cpuset.c to kernel/cgroup.c, in his patch:
cpusets: update_cpumask revision
Signed-off-by: Paul Jackson <pj@sgi.com>
Cc: Harvey Harrison <harvey.harrison@gmail.com>
Cc: Cliff Wickman <cpw@sgi.com>
Acked-by: Paul Menage <menage@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Make the needlessly global cgroup_enable_task_cg_lists() static.
Signed-off-by: Adrian Bunk <bunk@kernel.org>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Paul Menage <menage@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Make eCryptfs key module subsystem respect namespaces.
Since I will be removing the netlink interface in a future patch, I just made
changes to the netlink.c code so that it will not break the build. With my
recent patches, the kernel module currently defaults to the device handle
interface rather than the netlink interface.
[akpm@linux-foundation.org: export free_user_ns()]
Signed-off-by: Michael Halcrow <mhalcrow@us.ibm.com>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Due to the rcupreempt.h WARN_ON trigged, I got 2G syslog file. For some
serious complaining of kernel, we need repeat the warnings, so here I isolate
the ratelimit part of printk.c to a standalone file.
Signed-off-by: Dave Young <hidave.darkstar@gmail.com>
Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Tested-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Following an experimental deletion of the unnecessary directive
#include <linux/slab.h>
from the header file <linux/percpu.h>, these files under kernel/ were exposed
as needing to include one of <linux/slab.h> or <linux/gfp.h>, so explicit
includes were added where necessary.
Signed-off-by: Robert P. J. Day <rpjday@crashcourse.ca>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
From the POV of synchronization, there should be no need to call
wake_up_process() with the 'kthread_create_lock' being held.
Signed-off-by: Dmitry Adamushko <dmitry.adamushko@gmail.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Andy Whitcroft <apw@shadowen.org>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Fix following warnings:
WARNING: vmlinux.o(.text+0xc60): Section mismatch in reference from the function kvm_init() to the function .cpuinit.text:register_cpu_notifier()
WARNING: vmlinux.o(.text+0x33869a): Section mismatch in reference from the function xfs_icsb_init_counters() to the function .cpuinit.text:register_cpu_notifier()
WARNING: vmlinux.o(.text+0x5556a1): Section mismatch in reference from the function acpi_processor_install_hotplug_notify() to the function .cpuinit.text:register_cpu_notifier()
WARNING: vmlinux.o(.text+0xfe6b28): Section mismatch in reference from the function cpufreq_register_driver() to the function .cpuinit.text:register_cpu_notifier()
register_cpu_notifier() are only really defined when HOTPLUG_CPU is enabled.
So references to the function are OK.
Annotate it with __ref so we do not get warnings from callers and do not get
warnings for the functions/data used by register_cpu_notifier().
Signed-off-by: Sam Ravnborg <sam@ravnborg.org>
Cc: Gautham R Shenoy <ego@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Fix following warnings:
WARNING: vmlinux.o(.text+0x75c8d): Section mismatch in reference from the function take_cpu_down() to the variable .cpuinit.data:cpu_chain
WARNING: vmlinux.o(.text+0x75d2a): Section mismatch in reference from the function _cpu_down() to the variable .cpuinit.data:cpu_chain
WARNING: vmlinux.o(.text+0x75d4d): Section mismatch in reference from the function _cpu_down() to the variable .cpuinit.data:cpu_chain
WARNING: vmlinux.o(.text+0x75de4): Section mismatch in reference from the function _cpu_down() to the variable .cpuinit.data:cpu_chain
WARNING: vmlinux.o(.text+0x75e33): Section mismatch in reference from the function _cpu_down() to the variable .cpuinit.data:cpu_chain
cpu_down is only used from code surrounded by HOTPLUG_CPU so any references to
__cpuinit is OK.
Add a few __ref to tech modpost to ignore the references.
This is just papering over the fact that the cpu hotplug code is fragile with
respect to use of HOTPLUG_CPU and in many cases rely on __cpuinit to get rid
of code when HOTPLUG_CPU is not enabled. For now this is the least invasive
change.
Signed-off-by: Sam Ravnborg <sam@ravnborg.org>
Cc: Gautham R Shenoy <ego@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Fix following warning:
WARNING: vmlinux.o(.text+0x75f4e): Section mismatch in reference from the function unregister_cpu_notifier() to the variable .cpuinit.data:cpu_chain
We know that unregister_cpu_notifier is using HOTPLUG_CPU
stuff - so ignore these references.
Annotating unregister_cpu_notifier had been another option
but this caused far more warnings since not all callers were
annotated __cpuinit.
Signed-off-by: Sam Ravnborg <sam@ravnborg.org>
Cc: Gautham R Shenoy <ego@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add the RUSAGE_THREAD option for the getrusage system call. This is
essentially Roland's patch from http://lkml.org/lkml/2008/1/18/589, but the
line about RUSAGE_LWP line has been removed, as suggested by Ulrich and
Christoph.
Signed-off-by: Roland McGrath <roland@redhat.com>
Signed-off-by: Sripathi Kodi <sripathik@in.ibm.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Michael Kerrisk <mtk.manpages@googlemail.com>
Cc: Ulrich Drepper <drepper@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The kernel is sent to tainted within the warn_on_slowpath() function, and
whenever a warning occurs the new taint flag 'W' is set. This is useful to
know if a warning occurred before a BUG by preserving the warning as a flag
in the taint state.
This does not work on architectures where WARN_ON has its own definition.
These archs are:
1. s390
2. superh
3. avr32
4. parisc
The maintainers of these architectures have been added in the Cc: list
in this email to alert them to the situation.
The documentation in oops-tracing.txt has been updated to include the
new flag.
Signed-off-by: Nur Hussein <nurhussein@gmail.com>
Cc: Arjan van de Ven <arjan@infradead.org>
Cc: "Randy.Dunlap" <rdunlap@xenotime.net>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Kyle McMartin <kyle@mcmartin.ca>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Haavard Skinnemoen <hskinnemoen@atmel.com>
Cc: Paul Mundt <lethal@linux-sh.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>