Commit graph

449434 commits

Author SHA1 Message Date
John Stultz
8195460626 printk: disable preemption for printk_sched
An earlier change in -mm (printk: remove separate printk_sched
buffers...), removed the printk_sched irqsave/restore lines since it was
safe for current users.  Since we may be expanding usage of
printk_sched(), disable preepmtion for this function to make it more
generally safe to call.

Signed-off-by: John Stultz <john.stultz@linaro.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Jiri Bohac <jbohac@suse.cz>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-04 16:54:17 -07:00
Steven Rostedt
458df9fd48 printk: remove separate printk_sched buffers and use printk buf instead
To prevent deadlocks with doing a printk inside the scheduler,
printk_sched() was created.  The issue is that printk has a console_sem
that it can grab and release.  The release does a wake up if there's a
task pending on the sem, and this wake up grabs the rq locks that is
held in the scheduler.  This leads to a possible deadlock if the wake up
uses the same rq as the one with the rq lock held already.

What printk_sched() does is to save the printk write in a per cpu buffer
and sets the PRINTK_PENDING_SCHED flag.  On a timer tick, if this flag is
set, the printk() is done against the buffer.

There's a couple of issues with this approach.

1) If two printk_sched()s are called before the tick, the second one
   will overwrite the first one.

2) The temporary buffer is 512 bytes and is per cpu.  This is a quite a
   bit of space wasted for something that is seldom used.

In order to remove this, the printk_sched() can use the printk buffer
instead, and delay the console_trylock()/console_unlock() to the queued
work.

Because printk_sched() would then be taking the logbuf_lock, the
logbuf_lock must not be held while doing anything that may call into the
scheduler functions, which includes wake ups.  Unfortunately, printk()
also has a console_sem that it uses, and on release, the up(&console_sem)
may do a wake up of any pending waiters.  This must be avoided while
holding the logbuf_lock.

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-04 16:54:17 -07:00
Jan Kara
939f04bec1 printk: enable interrupts before calling console_trylock_for_printk()
We need interrupts disabled when calling console_trylock_for_printk()
only so that cpu id we pass to can_use_console() remains valid (for
other things console_sem provides all the exclusion we need and
deadlocks on console_sem due to interrupts are impossible because we use
down_trylock()).  However if we are rescheduled, we are guaranteed to
run on an online cpu so we can easily just get the cpu id in
can_use_console().

We can lose a bit of performance when we enable interrupts in
vprintk_emit() and then disable them again in console_unlock() but OTOH
it can somewhat reduce interrupt latency caused by console_unlock()
especially since later in the patch series we will want to spin on
console_sem in console_trylock_for_printk().

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-04 16:54:17 -07:00
Jan Kara
bd8d7cf5b8 printk: fix lockdep instrumentation of console_sem
Printk calls mutex_acquire() / mutex_release() by hand to instrument
lockdep about console_sem.  However in some corner cases the
instrumentation is missing.  Fix the problem by creating helper functions
for locking / unlocking console_sem which take care of lockdep
instrumentation as well.

Signed-off-by: Jan Kara <jack@suse.cz>
Reported-by: Fabio Estevam <festevam@gmail.com>
Reported-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Tested-by: Fabio Estevam <fabio.estevam@freescale.com>
Tested-By: Valdis Kletnieks <valdis.kletnieks@vt.edu>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-04 16:54:16 -07:00
Jan Kara
608873cacb printk: release lockbuf_lock before calling console_trylock_for_printk()
There's no reason to hold lockbuf_lock when entering
console_trylock_for_printk().

The first thing this function does is to call down_trylock(console_sem)
and if that fails it immediately unlocks lockbuf_lock.  So lockbuf_lock
isn't needed for that branch.  When down_trylock() succeeds, the rest of
console_trylock() is OK without lockbuf_lock (it is called without it
from other places), and the only remaining thing in
console_trylock_for_printk() is can_use_console() call.  For that call
console_sem is enough (it iterates all consoles and checks CON_ANYTIME
flag).

So we drop logbuf_lock before entering console_trylock_for_printk() which
simplifies the code.

[akpm@linux-foundation.org: fix have_callable_console() comment]
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-04 16:54:16 -07:00
Jan Kara
ca1d432ad8 printk: remove outdated comment
Comment about interesting interlocking between lockbuf_lock and
console_sem is outdated.

It was added in 2002 by commit a880f45a48be during conversion of
console_lock to console_sem + lockbuf_lock.

At that time release_console_sem() (today's equivalent is
console_unlock()) was indeed using lockbuf_lock to avoid races between
trylock on console_sem in printk() and unlock of console_sem.  However
these days the interlocking is gone and the races are avoided by
rechecking logbuf state after releasing console_sem.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-04 16:54:16 -07:00
Petr Mladek
034633ccb2 printk: return really stored message length
I wonder if anyone uses printk return value but it is there and should be
counted correctly.

This patch modifies log_store() to return the number of really stored
bytes from the 'text' part.  Also it handles the return value in
vprintk_emit().

Note that log_store() is used also in cont_flush() but we could ignore the
return value there.  The function works with characters that were already
counted earlier.  In addition, the store could newer fail here because the
length of the printed text is limited by the "cont" buffer and "dict" is
NULL.

Signed-off-by: Petr Mladek <pmladek@suse.cz>
Cc: Jan Kara <jack@suse.cz>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Kay Sievers <kay@vrfy.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-04 16:54:16 -07:00
Petr Mladek
55bd53a4eb printk: shrink too long messages
We might want to print at least part of too long messages and add some
warning for debugging purpose.

The question is how long the shrunken message should be.  If we use the
whole buffer, it might get rotated too soon.  Let's try to use only 1/4 of
the buffer for now.

Also shrink the whole dictionary.  We do not want to parse it or break it
in the middle of some pair of values.  It would not cause any real harm
but still.

Signed-off-by: Petr Mladek <pmladek@suse.cz>
Cc: Jan Kara <jack@suse.cz>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Kay Sievers <kay@vrfy.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-04 16:54:16 -07:00
Petr Mladek
85c8704302 printk: split message size computation
We will want to recompute the message size when shrinking too long
messages.  Let's put the code into separate function.

The side effect of setting "pad_len" is not nice but it is worth removing
the code duplication.  Note that I will probably have one more usage for
this function when handling messages safe way in NMI context.

This patch does not change the existing behavior.

Signed-off-by: Petr Mladek <pmladek@suse.cz>
Cc: Jan Kara <jack@suse.cz>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Kay Sievers <kay@vrfy.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-04 16:54:16 -07:00
Petr Mladek
f40e4b9f70 printk: ignore too long messages
There was no check for too long messages.  The check for free space always
passed when first_seq and next_seq were equal.  Enough free space was not
guaranteed, though.

log_store() might be called to store messages up to 64kB + 64kB + 16B.
This is sum of maximal text_len, dict_len values, and the size of the
structure printk_log.

On the other hand, the minimal size for the main log buffer currently is
4kB and it is enforced only by Kconfig.

The good news is that the usage looks safe right now.  log_store() is
called only from vprintk_emit() and cont_flush().  Here the "text" part is
always passed via a static buffer and the length is limited to
LOG_LINE_MAX which is 1024.  The "dict" part is NULL in most cases.  The
only exceptions is when vprintk_emit() is called from printk_emit() and
dev_vprintk_emit().  But printk_emit() is currently used only in
devkmsg_writev() and here "dict" is NULL as well.  In dev_vprintk_emit(),
"dict" is limited by the static buffer "hdr" of the size 128 bytes.  It
meas that the current maximal printed text is 1024B + 128B + 16B and it
always fit the log buffer.

But it is only matter of time when someone calls printk_emit() with unsafe
parameters, especially the "dict" one.

This patch adds a check for the free space when the buffer is empty.  It
reuses the already existing log_has_space() function but it has to add an
extra parameter.  It defines whether the buffer is empty.  Note that the
same values of "first_idx" and "next_idx" might also mean that the buffer
is full.

If the buffer is empty, we must respect the current position of the
indexes.  We cannot reset them to the beginning of the buffer.  Otherwise,
the functions reading the buffer would get crazy.

The question is what to do when the message is too long.  This patch uses
the easiest solution and just ignores the problematic message.  Let's do
something better in a followup patch.

Signed-off-by: Petr Mladek <pmladek@suse.cz>
Cc: Jan Kara <jack@suse.cz>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Kay Sievers <kay@vrfy.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-04 16:54:16 -07:00
Petr Mladek
0a581694ab printk: split code for making free space in the log buffer
The check for free space in the log buffer always passes when "first_seq"
and "next_seq" are equal.  In theory, it might cause writing outside of
the log buffer.

Fortunately, the current usage looks safe because the used "text" and
"dict" buffers are quite limited.  See the second patch for more details.

Anyway, it is better to be on the safe side and add a check.  An easy
solution is done in the 2nd patch and it is improved in the 4th patch.

5th patch fixes the computation of the printed message length.

1st and 3rd patches just do some code refactoring to make the other
patches easier.

This patch (of 5):

There will be needed some fixes in the check for free space.  They will be
easier if the code is moved outside of the quite long log_store()
function.

This patch does not change the existing behavior.

Signed-off-by: Petr Mladek <pmladek@suse.cz>
Cc: Jan Kara <jack@suse.cz>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Kay Sievers <kay@vrfy.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-04 16:54:16 -07:00
Kirill A. Shutemov
b300a4ea66 kernel/user.c: drop unused field 'files' from user_struct
Nobody seems uses it for a long time. Let's drop it.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-04 16:54:16 -07:00
Fabian Frederick
b51dbec68c kernel/hung_task.c: convert simple_strtoul to kstrtouint
sysctl_hung_task_panic has been changed to unsigned int.  use kstrtouint
instead of obsolete simple_strtoul

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Cc: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-04 16:54:15 -07:00
Fabian Frederick
95583e4ab5 kernel/utsname_sysctl.c: replace obsolete __initcall by device_initcall
Also fixes checkpatch warnings on proc_dostring function parameters

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-04 16:54:15 -07:00
Fabian Frederick
616feab753 kernel/reboot.c: convert simple_strtoul to kstrtoint
Replace obsolete function.
kstrtoint is used as reboot_cpu is an integer.

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-04 16:54:15 -07:00
Fabian Frederick
6c5a53c670 kernel/res_counter.c: replace simple_strtoull by kstrtoull
[akpm@linux-foundation.org: don't overwrite kstrtoull()'s errno]
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-04 16:54:15 -07:00
Fabian Frederick
cac92ba74f kernel/tracepoint.c: kernel-doc fixes
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Cc: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-04 16:54:15 -07:00
Fabian Frederick
cf25004069 kernel/stop_machine.c: kernel-doc warning fix
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-04 16:54:15 -07:00
Fabian Frederick
eaa1809b90 kernel/latencytop.c: convert seq_printf to seq_puts
This patch also fixes one function declaration over 80 characters.

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-04 16:54:15 -07:00
Fabian Frederick
b9e5db6d2b kernel/exec_domain.c: code clean-up
Fix checkpatch warnings about EXPORT_SYMBOL and return()

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-04 16:54:15 -07:00
Fabian Frederick
a6c8c6902c kernel/capability.c: code clean-up
- EXPORT_SYMBOL

- typo: unexpectidly->unexpectedly

- function prototype over 80 characters

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Cc: Serge Hallyn <serge.hallyn@canonical.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-04 16:54:15 -07:00
Fabian Frederick
462b29b856 kernel/backtracetest.c: replace no level printk by pr_info()
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-04 16:54:14 -07:00
Fabian Frederick
84117da5b7 kernel/cpu.c: convert printk to pr_foo()
no level printk converted to pr_warn (if err)
no level printk converted to pr_info (disabling non-boot cpus)
Other printk converted to respective level.

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Cc: "Rafael J. Wysocki" <rafael.j.wysocki@intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-04 16:54:14 -07:00
James Hogan
2c0d259e0e compiler.h: avoid sparse errors in __compiletime_error_fallback()
Usually, BUG_ON and friends aren't even evaluated in sparse, but recently
compiletime_assert_atomic_type() was added, and that now results in a
sparse warning every time it is used.

The reason turns out to be the temporary variable, after it sparse no
longer considers the value to be a constant, and results in a warning and
an error.  The error is the more annoying part of this as it suppresses
any further warnings in the same file, hiding other problems.

Unfortunately the condition cannot be simply expanded out to avoid the
temporary variable since it breaks compiletime_assert on old versions of
GCC such as GCC 4.2.4 which the latest metag compiler is based on.

Therefore #ifndef __CHECKER__ out the __compiletime_error_fallback which
uses the potentially negative size array to trigger a conditional compiler
error, so that sparse doesn't see it.

Signed-off-by: James Hogan <james.hogan@imgtec.com>
Cc: Johannes Berg <johannes.berg@intel.com>
Cc: Daniel Santos <daniel.santos@pobox.com>
Cc: Luciano Coelho <luciano.coelho@intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Acked-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-04 16:54:14 -07:00
Fabian Frederick
00f01791e1 fs/exportfs/expfs.c: kernel-doc warning fixes
Fixing 2 typo in function comments.

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: "J. Bruce Fields" <bfields@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-04 16:54:14 -07:00
Fabian Frederick
e37dcbfbb2 fs/efivarfs/super.c: use static const for dentry_operations
...like other filesystems.

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Cc: Matthew Garrett <matthew.garrett@nebula.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-04 16:54:14 -07:00
Fabian Frederick
f6187769da sys_sgetmask/sys_ssetmask: add CONFIG_SGETMASK_SYSCALL
sys_sgetmask and sys_ssetmask are obsolete system calls no longer
supported in libc.

This patch replaces architecture related __ARCH_WANT_SYS_SGETMAX by expert
mode configuration.That option is enabled by default for those
architectures.

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Cc: Steven Miao <realmz6@gmail.com>
Cc: Mikael Starvik <starvik@axis.com>
Cc: Jesper Nilsson <jesper.nilsson@axis.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Koichi Yasutake <yasutake.koichi@jp.panasonic.com>
Cc: "James E.J. Bottomley" <jejb@parisc-linux.org>
Cc: Helge Deller <deller@gmx.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Greg Ungerer <gerg@uclinux.org>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-04 16:54:14 -07:00
Eric Dumazet
72d09633c9 mm/zswap: NUMA aware allocation for zswap_dstmem
zswap_dstmem is a percpu block of memory, which should be allocated using
kmalloc_node(), to get better NUMA locality.

Without it, all the blocks are allocated from a single node.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Seth Jennings <sjennings@variantweb.net>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-04 16:54:14 -07:00
Minchan Kim
d867f203b9 mm/zsmalloc: make zsmalloc module-buildable
Now, we can build zsmalloc as module because unmap_kernel_range was
exported.

Signed-off-by: Minchan Kim <minchan@kernel.org>
Cc: Nitin Gupta <ngupta@vflare.org>
Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Jerome Marchand <jmarchan@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-04 16:54:14 -07:00
Minchan Kim
93ef6d6ca1 mm/vmalloc.c: export unmap_kernel_range()
zsmalloc needs exported unmap_kernel_range for building as a module.  See
https://lkml.org/lkml/2013/1/18/487

I didn't send a patch to make unmap_kernel_range exportable at that time
because zram was staging stuff and I thought VM function exporting for
staging stuff makes no sense.

Now zsmalloc was promoted.  If we can't build zsmalloc as module, it means
we can't build zram as module, either.  Additionally, buddy map_vm_area is
already exported so let's export unmap_kernel_range to help his buddy.

Signed-off-by: Minchan Kim <minchan@kernel.org>
Cc: Nitin Gupta <ngupta@vflare.org>
Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Jerome Marchand <jmarchan@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-04 16:54:14 -07:00
Weijie Yang
7eb52512a9 zsmalloc: fixup trivial zs size classes value in comments
According to calculation, ZS_SIZE_CLASSES value is 255 on systems with 4K
page size, not 254.  The old value may forget count the ZS_MIN_ALLOC_SIZE
in.

This patch fixes this trivial issue in the comments.

Signed-off-by: Weijie Yang <weijie.yang@samsung.com>
Cc: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-04 16:54:13 -07:00
Fabian Frederick
50417c5556 mm/zbud.c: make size unsigned like unique callsite
zbud_alloc is only called by zswap_frontswap_store with unsigned int len.
Change function parameter + update >= 0 check.

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Acked-by: Seth Jennings <sjennings@variantweb.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-04 16:54:13 -07:00
Weijie Yang
38515c7339 zram: correct offset usage in zram_bio_discard
We want to skip the physical block(PAGE_SIZE) which is partially covered
by the discard bio, so we check the remaining size and subtract it if
there is a need to goto the next physical block.

The current offset usage in zram_bio_discard is incorrect, it will cause
its upper filesystem breakdown.  Consider the following scenario:

On some architecture or config, PAGE_SIZE is 64K for example, filesystem
is set up on zram disk without PAGE_SIZE aligned, a discard bio leads to a
offset = 4K and size=72K, normally, it should not really discard any
physical block as it partially cover two physical blocks.  However, with
the current offset usage, it will discard the second physical block and
free its memory, which will cause filesystem breakdown.

This patch corrects the offset usage in zram_bio_discard.

Signed-off-by: Weijie Yang <weijie.yang@samsung.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Nitin Gupta <ngupta@vflare.org>
Acked-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Bob Liu <bob.liu@oracle.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-04 16:54:13 -07:00
Hugh Dickins
2a7a0e0fdc mm, memcg: periodically schedule when emptying page list
mem_cgroup_force_empty_list() can iterate a large number of pages on an
lru and mem_cgroup_move_parent() doesn't return an errno unless certain
criteria, none of which indicate that the iteration may be taking too
long, is met.

We have encountered the following stack trace many times indicating
"need_resched set for > 51000020 ns (51 ticks) without schedule", for
example:

	scheduler_tick()
	<timer irq>
	mem_cgroup_move_account+0x4d/0x1d5
	mem_cgroup_move_parent+0x8d/0x109
	mem_cgroup_reparent_charges+0x149/0x2ba
	mem_cgroup_css_offline+0xeb/0x11b
	cgroup_offline_fn+0x68/0x16b
	process_one_work+0x129/0x350

If this iteration is taking too long, we still need to do cond_resched()
even when an individual page is not busy.

[rientjes@google.com: changelog]
Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: David Rientjes <rientjes@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-04 16:54:13 -07:00
Denys Vlasenko
4a0da71b96 Documentation/sysctl/vm.txt: clarify vfs_cache_pressure description
Existing description is worded in a way which almost encourages setting of
vfs_cache_pressure above 100, possibly way above it.

Users are left in a dark what this numeric value is - an int?  a
percentage?  what the scale is?

As a result, we are getting reports about noticeable performance
degradation from users who have set vfs_cache_pressure to ridiculously
high values - because they thought there is no downside to it.

Via code inspection it's obvious that this value is treated as a
percentage.  This patch changes text to reflect this fact, and adds a
cautionary paragraph advising against setting vfs_cache_pressure sky high.

Signed-off-by: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-04 16:54:13 -07:00
Naoya Horiguchi
3ba08129e3 mm/memory-failure.c: support use of a dedicated thread to handle SIGBUS(BUS_MCEERR_AO)
Currently memory error handler handles action optional errors in the
deferred manner by default.  And if a recovery aware application wants
to handle it immediately, it can do it by setting PF_MCE_EARLY flag.
However, such signal can be sent only to the main thread, so it's
problematic if the application wants to have a dedicated thread to
handler such signals.

So this patch adds dedicated thread support to memory error handler.  We
have PF_MCE_EARLY flags for each thread separately, so with this patch
AO signal is sent to the thread with PF_MCE_EARLY flag set, not the main
thread.  If you want to implement a dedicated thread, you call prctl()
to set PF_MCE_EARLY on the thread.

Memory error handler collects processes to be killed, so this patch lets
it check PF_MCE_EARLY flag on each thread in the collecting routines.

No behavioral change for all non-early kill cases.

Tony said:

: The old behavior was crazy - someone with a multithreaded process might
: well expect that if they call prctl(PF_MCE_EARLY) in just one thread, then
: that thread would see the SIGBUS with si_code = BUS_MCEERR_A0 - even if
: that thread wasn't the main thread for the process.

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Cc: Kamil Iskra <iskra@mcs.anl.gov>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Borislav Petkov <bp@suse.de>
Cc: Chen Gong <gong.chen@linux.jf.intel.com>
Cc: <stable@vger.kernel.org>	[3.2+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-04 16:54:13 -07:00
Tony Luck
74614de17d mm/memory-failure.c: don't let collect_procs() skip over processes for MF_ACTION_REQUIRED
When Linux sees an "action optional" machine check (where h/w has reported
an error that is not in the current execution path) we generally do not
want to signal a process, since most processes do not have a SIGBUS
handler - we'd just prematurely terminate the process for a problem that
they might never actually see.

task_early_kill() decides whether to consider a process - and it checks
whether this specific process has been marked for early signals with
"prctl", or if the system administrator has requested early signals for
all processes using /proc/sys/vm/memory_failure_early_kill.

But for MF_ACTION_REQUIRED case we must not defer.  The error is in the
execution path of the current thread so we must send the SIGBUS
immediatley.

Fix by passing a flag argument through collect_procs*() to
task_early_kill() so it knows whether we can defer or must take action.

Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Borislav Petkov <bp@suse.de>
Cc: Chen Gong <gong.chen@linux.jf.intel.com>
Cc: <stable@vger.kernel.org>	[3.2+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-04 16:54:13 -07:00
Tony Luck
a70ffcac74 mm/memory-failure.c-failure: send right signal code to correct thread
When a thread in a multi-threaded application hits a machine check because
of an uncorrectable error in memory - we want to send the SIGBUS with
si.si_code = BUS_MCEERR_AR to that thread.  Currently we fail to do that
if the active thread is not the primary thread in the process.
collect_procs() just finds primary threads and this test:

	if ((flags & MF_ACTION_REQUIRED) && t == current) {

will see that the thread we found isn't the current thread and so send a
si.si_code = BUS_MCEERR_AO to the primary (and nothing to the active
thread at this time).

We can fix this by checking whether "current" shares the same mm with the
process that collect_procs() said owned the page.  If so, we send the
SIGBUS to current (with code BUS_MCEERR_AR).

Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Reported-by: Otto Bruggeman <otto.g.bruggeman@intel.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Borislav Petkov <bp@suse.de>
Cc: Chen Gong <gong.chen@linux.jf.intel.com>
Cc: <stable@vger.kernel.org>	[3.2+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-04 16:54:13 -07:00
Jianyu Zhan
d2f3102838 mm/page-writeback.c: remove outdated comment
There is an orphaned prehistoric comment , which used to be against
get_dirty_limits(), the dawn of global_dirtyable_memory().

Back then, the implementation of get_dirty_limits() is complicated and
full of magic numbers, so this comment is necessary.  But we now use the
clear and neat global_dirtyable_memory(), which renders this comment
ambiguous and useless.  Remove it.

Signed-off-by: Jianyu Zhan <nasa4836@gmail.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-04 16:54:13 -07:00
Chen Yucong
50088c4409 mm/swapfile.c: delete the "last_in_cluster < scan_base" loop in the body of scan_swap_map()
Via commit ebc2a1a691 ("swap: make cluster allocation per-cpu"), we
can find that all SWP_SOLIDSTATE "seek is cheap"(SSD case) has already
gone to si->cluster_info scan_swap_map_try_ssd_cluster() route.  So that
the "last_in_cluster < scan_base" loop in the body of scan_swap_map()
has already become a dead code snippet, and it should have been deleted.

This patch is to delete the redundant loop as Hugh and Shaohua
suggested.

[hughd@google.com: fix comment, simplify code]
Signed-off-by: Chen Yucong <slaoub@gmail.com>
Cc: Shaohua Li <shli@kernel.org>
Acked-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-04 16:54:12 -07:00
Naoya Horiguchi
100873d7a7 hugetlb: rename hugepage_migration_support() to ..._supported()
We already have a function named hugepages_supported(), and the similar
name hugepage_migration_support() is a bit unconfortable, so let's rename
it hugepage_migration_supported().

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Acked-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-04 16:54:12 -07:00
Kirill A. Shutemov
1fdb412bd8 mm: document do_fault_around() feature
Some clarification on how faultaround works.

[akpm@linux-foundation.org: tweak comment text]
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-04 16:54:12 -07:00
Kirill A. Shutemov
a9b0f8618d mm: nominate faultaround area in bytes rather than page order
There is evidencs that the faultaround feature is less relevant on
architectures with page size bigger then 4k.  Which makes sense since page
fault overhead per byte of mapped area should be less there.

Let's rework the feature to specify faultaround area in bytes instead of
page order.  It's 64 kilobytes for now.

The patch effectively disables faultaround on architectures with page size
>= 64k (like ppc64).

It's possible that some other size of faultaround area is relevant for a
platform.  We can expose `fault_around_bytes' variable to arch-specific
code once such platforms will be found.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Hugh Dickins <hughd@google.com>
Cc: Madhavan Srinivasan <maddy@linux.vnet.ibm.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Dave Hansen <dave.hansen@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-04 16:54:12 -07:00
Zhang Zhen
7d018176e6 mm/page_alloc.c: cleanup add_active_range() related comments
add_active_range() has been repalced by memblock_set_node().  Clean up the
comments to comply with that change.

Signed-off-by: Zhang Zhen <zhenzhang.zhang@huawei.com>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-04 16:54:12 -07:00
Konstantin Khlebnikov
daa5ba768b mm/rmap.c: cleanup ttu_flags
Transform action part of ttu_flags into individiual bits.  These flags
aren't part of any uses-space visible api or even trace events.

Signed-off-by: Konstantin Khlebnikov <koct9i@gmail.com>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-04 16:54:12 -07:00
Konstantin Khlebnikov
3d92860f97 mm/rmap.c: don't call mmu_notifier_invalidate_page() during munlock
In its munmap mode, try_to_unmap_one() searches other mlocked vmas, it
never unmaps pages.  There is no reason for invalidation because ptes are
left unchanged.

Signed-off-by: Konstantin Khlebnikov <koct9i@gmail.com>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-04 16:54:12 -07:00
Konstantin Khlebnikov
226b4ccdcb mm/process_vm_access: move config option into init/Kconfig
CONFIG_CROSS_MEMORY_ATTACH adds couple syscalls: process_vm_readv and
process_vm_writev, it's a kind of IPC for copying data between processes.
Currently this option is placed inside "Processor type and features".

This patch moves it into "General setup" (where all other arch-independed
syscalls and ipc features are placed) and changes prompt string to less
cryptic.

Signed-off-by: Konstantin Khlebnikov <koct9i@gmail.com>
Cc: Christopher Yeoh <cyeoh@au1.ibm.com>
Cc: Davidlohr Bueso <davidlohr@hp.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-04 16:54:12 -07:00
Mel Gorman
1a501907bb mm: vmscan: use proportional scanning during direct reclaim and full scan at DEF_PRIORITY
Commit "mm: vmscan: obey proportional scanning requirements for kswapd"
ensured that file/anon lists were scanned proportionally for reclaim from
kswapd but ignored it for direct reclaim.  The intent was to minimse
direct reclaim latency but Yuanhan Liu pointer out that it substitutes one
long stall for many small stalls and distorts aging for normal workloads
like streaming readers/writers.  Hugh Dickins pointed out that a
side-effect of the same commit was that when one LRU list dropped to zero
that the entirety of the other list was shrunk leading to excessive
reclaim in memcgs.  This patch scans the file/anon lists proportionally
for direct reclaim to similarly age page whether reclaimed by kswapd or
direct reclaim but takes care to abort reclaim if one LRU drops to zero
after reclaiming the requested number of pages.

Based on ext4 and using the Intel VM scalability test

                                              3.15.0-rc5            3.15.0-rc5
                                                shrinker            proportion
Unit  lru-file-readonce    elapsed      5.3500 (  0.00%)      5.4200 ( -1.31%)
Unit  lru-file-readonce time_range      0.2700 (  0.00%)      0.1400 ( 48.15%)
Unit  lru-file-readonce time_stddv      0.1148 (  0.00%)      0.0536 ( 53.33%)
Unit lru-file-readtwice    elapsed      8.1700 (  0.00%)      8.1700 (  0.00%)
Unit lru-file-readtwice time_range      0.4300 (  0.00%)      0.2300 ( 46.51%)
Unit lru-file-readtwice time_stddv      0.1650 (  0.00%)      0.0971 ( 41.16%)

The test cases are running multiple dd instances reading sparse files. The results are within
the noise for the small test machine. The impact of the patch is more noticable from the vmstats

                            3.15.0-rc5  3.15.0-rc5
                              shrinker  proportion
Minor Faults                     35154       36784
Major Faults                       611        1305
Swap Ins                           394        1651
Swap Outs                         4394        5891
Allocation stalls               118616       44781
Direct pages scanned           4935171     4602313
Kswapd pages scanned          15921292    16258483
Kswapd pages reclaimed        15913301    16248305
Direct pages reclaimed         4933368     4601133
Kswapd efficiency                  99%         99%
Kswapd velocity             670088.047  682555.961
Direct efficiency                  99%         99%
Direct velocity             207709.217  193212.133
Percentage direct scans            23%         22%
Page writes by reclaim        4858.000    6232.000
Page writes file                   464         341
Page writes anon                  4394        5891

Note that there are fewer allocation stalls even though the amount
of direct reclaim scanning is very approximately the same.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Tested-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
Cc: Bob Liu <bob.liu@oracle.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Rik van Riel <riel@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-04 16:54:12 -07:00
Tim Chen
d23da150a3 fs/superblock: avoid locking counting inodes and dentries before reclaiming them
We remove the call to grab_super_passive in call to super_cache_count.
This becomes a scalability bottleneck as multiple threads are trying to do
memory reclamation, e.g.  when we are doing large amount of file read and
page cache is under pressure.  The cached objects quickly got reclaimed
down to 0 and we are aborting the cache_scan() reclaim.  But counting
creates a log jam acquiring the sb_lock.

We are holding the shrinker_rwsem which ensures the safety of call to
list_lru_count_node() and s_op->nr_cached_objects.  The shrinker is
unregistered now before ->kill_sb() so the operation is safe when we are
doing unmount.

The impact will depend heavily on the machine and the workload but for a
small machine using postmark tuned to use 4xRAM size the results were

                                  3.15.0-rc5            3.15.0-rc5
                                     vanilla         shrinker-v1r1
Ops/sec Transactions         21.00 (  0.00%)       24.00 ( 14.29%)
Ops/sec FilesCreate          39.00 (  0.00%)       44.00 ( 12.82%)
Ops/sec CreateTransact       10.00 (  0.00%)       12.00 ( 20.00%)
Ops/sec FilesDeleted       6202.00 (  0.00%)     6202.00 (  0.00%)
Ops/sec DeleteTransact       11.00 (  0.00%)       12.00 (  9.09%)
Ops/sec DataRead/MB          25.97 (  0.00%)       29.10 ( 12.05%)
Ops/sec DataWrite/MB         49.99 (  0.00%)       56.02 ( 12.06%)

ffsb running in a configuration that is meant to simulate a mail server showed

                                 3.15.0-rc5             3.15.0-rc5
                                    vanilla          shrinker-v1r1
Ops/sec readall           9402.63 (  0.00%)      9567.97 (  1.76%)
Ops/sec create            4695.45 (  0.00%)      4735.00 (  0.84%)
Ops/sec delete             173.72 (  0.00%)       179.83 (  3.52%)
Ops/sec Transactions     14271.80 (  0.00%)     14482.81 (  1.48%)
Ops/sec Read                37.00 (  0.00%)        37.60 (  1.62%)
Ops/sec Write               18.20 (  0.00%)        18.30 (  0.55%)

Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Dave Chinner <david@fromorbit.com>
Tested-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
Cc: Bob Liu <bob.liu@oracle.com>
Cc: Jan Kara <jack@suse.cz>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-04 16:54:11 -07:00
Dave Chinner
28f2cd4f6d fs/superblock: unregister sb shrinker before ->kill_sb()
This series is aimed at regressions noticed during reclaim activity.  The
first two patches are shrinker patches that were posted ages ago but never
merged for reasons that are unclear to me.  I'm posting them again to see
if there was a reason they were dropped or if they just got lost.  Dave?
Time?  The last patch adjusts proportional reclaim.  Yuanhan Liu, can you
retest the vm scalability test cases on a larger machine?  Hugh, does this
work for you on the memcg test cases?

Based on ext4, I get the following results but unfortunately my larger
test machines are all unavailable so this is based on a relatively small
machine.

postmark
                                  3.15.0-rc5            3.15.0-rc5
                                     vanilla       proportion-v1r4
Ops/sec Transactions         21.00 (  0.00%)       25.00 ( 19.05%)
Ops/sec FilesCreate          39.00 (  0.00%)       45.00 ( 15.38%)
Ops/sec CreateTransact       10.00 (  0.00%)       12.00 ( 20.00%)
Ops/sec FilesDeleted       6202.00 (  0.00%)     6202.00 (  0.00%)
Ops/sec DeleteTransact       11.00 (  0.00%)       12.00 (  9.09%)
Ops/sec DataRead/MB          25.97 (  0.00%)       30.02 ( 15.59%)
Ops/sec DataWrite/MB         49.99 (  0.00%)       57.78 ( 15.58%)

ffsb (mail server simulator)
                                 3.15.0-rc5             3.15.0-rc5
                                    vanilla        proportion-v1r4
Ops/sec readall           9402.63 (  0.00%)      9805.74 (  4.29%)
Ops/sec create            4695.45 (  0.00%)      4781.39 (  1.83%)
Ops/sec delete             173.72 (  0.00%)       177.23 (  2.02%)
Ops/sec Transactions     14271.80 (  0.00%)     14764.37 (  3.45%)
Ops/sec Read                37.00 (  0.00%)        38.50 (  4.05%)
Ops/sec Write               18.20 (  0.00%)        18.50 (  1.65%)

dd of a large file
                                3.15.0-rc5            3.15.0-rc5
                                   vanilla       proportion-v1r4
WallTime DownloadTar       75.00 (  0.00%)       61.00 ( 18.67%)
WallTime DD               423.00 (  0.00%)      401.00 (  5.20%)
WallTime Delete             2.00 (  0.00%)        5.00 (-150.00%)

stutter (times mmap latency during large amounts of IO)

                            3.15.0-rc5            3.15.0-rc5
                               vanilla       proportion-v1r4
Unit >5ms Delays  80252.0000 (  0.00%)  81523.0000 ( -1.58%)
Unit Mmap min         8.2118 (  0.00%)      8.3206 ( -1.33%)
Unit Mmap mean       17.4614 (  0.00%)     17.2868 (  1.00%)
Unit Mmap stddev     24.9059 (  0.00%)     34.6771 (-39.23%)
Unit Mmap max      2811.6433 (  0.00%)   2645.1398 (  5.92%)
Unit Mmap 90%        20.5098 (  0.00%)     18.3105 ( 10.72%)
Unit Mmap 93%        22.9180 (  0.00%)     20.1751 ( 11.97%)
Unit Mmap 95%        25.2114 (  0.00%)     22.4988 ( 10.76%)
Unit Mmap 99%        46.1430 (  0.00%)     43.5952 (  5.52%)
Unit Ideal  Tput     85.2623 (  0.00%)     78.8906 (  7.47%)
Unit Tput min        44.0666 (  0.00%)     43.9609 (  0.24%)
Unit Tput mean       45.5646 (  0.00%)     45.2009 (  0.80%)
Unit Tput stddev      0.9318 (  0.00%)      1.1084 (-18.95%)
Unit Tput max        46.7375 (  0.00%)     46.7539 ( -0.04%)

This patch (of 3):

We will like to unregister the sb shrinker before ->kill_sb().  This will
allow cached objects to be counted without call to grab_super_passive() to
update ref count on sb.  We want to avoid locking during memory
reclamation especially when we are skipping the memory reclaim when we are
out of cached objects.

This is safe because grab_super_passive does a try-lock on the
sb->s_umount now, and so if we are in the unmount process, it won't ever
block.  That means what used to be a deadlock and races we were avoiding
by using grab_super_passive() is now:

        shrinker                        umount

        down_read(shrinker_rwsem)
                                        down_write(sb->s_umount)
                                        shrinker_unregister
                                          down_write(shrinker_rwsem)
                                            <blocks>
        grab_super_passive(sb)
          down_read_trylock(sb->s_umount)
            <fails>
        <shrinker aborts>
        ....
        <shrinkers finish running>
        up_read(shrinker_rwsem)
                                          <unblocks>
                                          <removes shrinker>
                                          up_write(shrinker_rwsem)
                                        ->kill_sb()
                                        ....

So it is safe to deregister the shrinker before ->kill_sb().

Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Dave Chinner <david@fromorbit.com>
Tested-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
Cc: Bob Liu <bob.liu@oracle.com>
Cc: Jan Kara <jack@suse.cz>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-04 16:54:11 -07:00