Commit graph

16149 commits

Author SHA1 Message Date
Alex Shi
333bb864f1 sched/debug: Remove CONFIG_FAIR_GROUP_SCHED mask
Now that we are using runnable load avg in sched balance, we don't
need to keep it under CONFIG_FAIR_GROUP_SCHED.

Also align the code style to #ifdef instead of #if defined() and
reorder the tg output info.

Signed-off-by: Alex Shi <alex.shi@intel.com>
Cc: pjt@google.com
Cc: kamalesh@linux.vnet.ibm.com
Cc: peterz@infradead.org
Link: http://lkml.kernel.org/r/1372417835-4698-1-git-send-email-alex.shi@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-06-28 13:17:17 +02:00
Rafael J. Wysocki
e52cff8bdd Merge branch 'pm-assorted'
* pm-assorted:
  PM / QoS: Add pm_qos and dev_pm_qos to events-power.txt
  PM / QoS: Add dev_pm_qos_request tracepoints
  PM / QoS: Add pm_qos_request tracepoints
  PM / QoS: Add pm_qos_update_target/flags tracepoints
  PM / QoS: Update Documentation/power/pm_qos_interface.txt
  PM / Sleep: Print last wakeup source on failed wakeup_count write
  PM / QoS: correct the valid range of pm_qos_class
  PM / wakeup: Adjust messaging for wake events during suspend
  PM / Runtime: Update .runtime_idle() callback documentation
  PM / Runtime: Rework the "runtime idle" helper routine
  PM / Hibernate: print physical addresses consistently with other parts of kernel
2013-06-28 13:01:40 +02:00
Rafael J. Wysocki
207bc1181b Merge branch 'freezer'
* freezer:
  af_unix: use freezable blocking calls in read
  sigtimedwait: use freezable blocking call
  nanosleep: use freezable blocking call
  futex: use freezable blocking call
  select: use freezable blocking call
  epoll: use freezable blocking call
  binder: use freezable blocking calls
  freezer: add new freezable helpers using freezer_do_not_count()
  freezer: convert freezable helpers to static inline where possible
  freezer: convert freezable helpers to freezer_do_not_count()
  freezer: skip waking up tasks with PF_FREEZER_SKIP set
  freezer: shorten freezer sleep time using exponential backoff
  lockdep: check that no locks held at freeze time
  lockdep: remove task argument from debug_check_no_locks_held
  freezer: add unsafe versions of freezable helpers for CIFS
  freezer: add unsafe versions of freezable helpers for NFS
2013-06-28 13:00:53 +02:00
Thomas Gleixner
ccc414f839 genirq: Add the generic chip to the genirq docbook
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Randy Dunlap <rdunlap@infradead.org>
2013-06-28 12:56:04 +02:00
Fabio Estevam
d55f0cc4c9 genirq: generic-chip: Export some irq_gc_ functions
When building imx_v6_v7_defconfig with imx-drm drivers selected as
modules, we get the following build errors:

ERROR: "irq_gc_mask_clr_bit" [drivers/staging/imx-drm/ipu-v3/imx-ipu-v3.ko] undefined!
ERROR: "irq_gc_mask_set_bit" [drivers/staging/imx-drm/ipu-v3/imx-ipu-v3.ko] undefined!
ERROR: "irq_gc_ack_set_bit" [drivers/staging/imx-drm/ipu-v3/imx-ipu-v3.ko] undefined!

Export the required functions to avoid this problem.

Signed-off-by: Fabio Estevam <fabio.estevam@freescale.com>
Cc: shawn.guo@linaro.org
Cc: kernel@pengutronix.de
Link: http://lkml.kernel.org/r/1372389789-7048-1-git-send-email-festevam@gmail.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2013-06-28 12:56:04 +02:00
Ben Hutchings
2779db8d37 genirq: Fix can_request_irq() for IRQs without an action
Commit 02725e7471 ('genirq: Use irq_get/put functions'),
inadvertently changed can_request_irq() to return 0 for IRQs that have
no action.  This causes pcibios_lookup_irq() to select only IRQs that
already have an action with IRQF_SHARED set, or to fail if there are
none.  Change can_request_irq() to return 1 for IRQs that have no
action (if the first two conditions are met).

Reported-by: Bjarni Ingi Gislason <bjarniig@rhi.hi.is>
Tested-by: Bjarni Ingi Gislason <bjarniig@rhi.hi.is> (against 3.2)
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Cc: 709647@bugs.debian.org
Cc: stable@vger.kernel.org # 2.6.39+
Link: http://bugs.debian.org/709647
Link: http://lkml.kernel.org/r/1372383630.23847.40.camel@deadeye.wl.decadent.org.uk
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2013-06-28 12:56:03 +02:00
Kamalesh Babulal
add332a152 sched/debug: Fix formatting of /proc/<PID>/sched
This patch alters format string's width, to align all statistics
at par with the longest struct sched_statistic member name under
/proc/<PID>/sched.

Signed-off-by: Kamalesh Babulal <kamalesh@linux.vnet.ibm.com>
Cc: peterz@infradead.org
Link: http://lkml.kernel.org/r/20130627165005.GA15583@linux.vnet.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-06-28 11:13:35 +02:00
Tejun Heo
0ce6cba357 cgroup: CGRP_ROOT_SUBSYS_BOUND should be ignored when comparing mount options
1672d04070 ("cgroup: fix cgroupfs_root early destruction path")
introduced CGRP_ROOT_SUBSYS_BOUND which is used to mark completion of
subsys binding on a new root; however, this broke remounts.
cgroup_remount() doesn't allow changing root options via remount and
CGRP_ROOT_SUBSYS_BOUND, which is set on all fully initialized roots,
makes the function reject all remounts.

Fix it by putting the options part in the lower 16 bits of root->flags
and masking the comparions.  While at it, make cgroup_remount() emit
an error message explaining why it's rejecting a remount request, so
that it's less of a mystery.

Signed-off-by: Tejun Heo <tj@kernel.org>
2013-06-27 19:37:26 -07:00
Tejun Heo
e2bd416f62 cgroup: fix deadlock on cgroup_mutex via drop_parsed_module_refcounts()
eb178d0633 ("cgroup: grab cgroup_mutex in
drop_parsed_module_refcounts()") made drop_parsed_module_refcounts()
grab cgroup_mutex to make lockdep assertion in for_each_subsys()
happy.  Unfortunately, cgroup_remount() calls the function while
holding cgroup_mutex in its failure path leading to the following
deadlock.

# mount -t cgroup -o remount,memory,blkio cgroup blkio

 cgroup: option changes via remount are deprecated (pid=525 comm=mount)

 =============================================
 [ INFO: possible recursive locking detected ]
 3.10.0-rc4-work+ #1 Not tainted
 ---------------------------------------------
 mount/525 is trying to acquire lock:
  (cgroup_mutex){+.+.+.}, at: [<ffffffff8110a3e1>] drop_parsed_module_refcounts+0x21/0xb0

 but task is already holding lock:
  (cgroup_mutex){+.+.+.}, at: [<ffffffff8110e4e1>] cgroup_remount+0x51/0x200

 other info that might help us debug this:
  Possible unsafe locking scenario:

	CPU0
	----
   lock(cgroup_mutex);
   lock(cgroup_mutex);

  *** DEADLOCK ***

  May be due to missing lock nesting notation

 4 locks held by mount/525:
  #0:  (&type->s_umount_key#30){+.+...}, at: [<ffffffff811e9a0d>] do_mount+0x2bd/0xa30
  #1:  (&sb->s_type->i_mutex_key#9){+.+.+.}, at: [<ffffffff8110e4d3>] cgroup_remount+0x43/0x200
  #2:  (cgroup_mutex){+.+.+.}, at: [<ffffffff8110e4e1>] cgroup_remount+0x51/0x200
  #3:  (cgroup_root_mutex){+.+.+.}, at: [<ffffffff8110e4ef>] cgroup_remount+0x5f/0x200

 stack backtrace:
 CPU: 2 PID: 525 Comm: mount Not tainted 3.10.0-rc4-work+ #1
 Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
  ffffffff829651f0 ffff88000ec2fc28 ffffffff81c24bb1 ffff88000ec2fce8
  ffffffff810f420d 0000000000000006 0000000000000001 0000000000000056
  ffff8800153b4640 ffff880000000000 ffffffff81c2e468 ffff8800153b4640
 Call Trace:
  [<ffffffff81c24bb1>] dump_stack+0x19/0x1b
  [<ffffffff810f420d>] __lock_acquire+0x15dd/0x1e60
  [<ffffffff810f531c>] lock_acquire+0x9c/0x1f0
  [<ffffffff81c2a805>] mutex_lock_nested+0x65/0x410
  [<ffffffff8110a3e1>] drop_parsed_module_refcounts+0x21/0xb0
  [<ffffffff8110e63e>] cgroup_remount+0x1ae/0x200
  [<ffffffff811c9bb2>] do_remount_sb+0x82/0x190
  [<ffffffff811e9d41>] do_mount+0x5f1/0xa30
  [<ffffffff811ea203>] SyS_mount+0x83/0xc0
  [<ffffffff81c2fb82>] system_call_fastpath+0x16/0x1b

Fix it by moving the drop_parsed_module_refcounts() invocation outside
cgroup_mutex.

Signed-off-by: Tejun Heo <tj@kernel.org>
2013-06-27 19:37:23 -07:00
Shuah Khan
9dceefe483 PM / Sleep: Warn about system time after resume with pm_trace
pm_trace uses the system's Real Time Clock (RTC) to save the magic
number.  The reason for this is that the RTC is the only reliably
available piece of hardware during resume operations where a value
can be set that will survive a reboot.

Consequence is that after a resume (even if it is successful) your
system clock will have a value corresponding to the magic number
instead of the correct date/time!  It is therefore advisable to use
a program like ntp-date or rdate to reset the correct date/time from
an external time source when using this trace option.

There is no run-time message to warn users of the consequences of
enabling pm_trace.  Adding a warning message to pm_trace_store()
will serve as a reminder to users to set the system date and time
after resume.

Signed-off-by: Shuah Khan <shuah.kh@samsung.com>
Acked-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2013-06-27 22:14:46 +02:00
Kamalesh Babulal
0fc576d592 sched/fair: Fix typo describing flags in enqueue_entity
Fix spelling of 'calling' in description of se flags in
enqueue_entity().

Signed-off-by: Kamalesh Babulal <kamalesh@linux.vnet.ibm.com>
Cc: peterz@infradead.org
Link: http://lkml.kernel.org/r/20130627055418.GA18582@linux.vnet.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-06-27 10:18:25 +02:00
Kamalesh Babulal
939fd731eb sched/debug: Add load-tracking statistics to task
At present we print per-entity load-tracking statistics for
cfs_rq of cgroups/runqueues. Given that per task statistics
is maintained, it can be used to know the contribution made
by the task to its parenting cfs_rq level.

This patch adds per-task load-tracking statistics to /proc/<PID>/sched.

Signed-off-by: Kamalesh Babulal <kamalesh@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20130625080336.GA20175@linux.vnet.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-06-27 10:07:46 +02:00
Alex Shi
a9dc5d0e33 sched: Change get_rq_runnable_load() to static and inline
Based-on-patch-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Alex Shi <alex.shi@intel.com>
Tested-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1371694737-29336-14-git-send-email-alex.shi@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-06-27 10:07:44 +02:00
Alex Shi
a9cef46a10 sched/tg: Remove tg.load_weight
Since no one use it.

Signed-off-by: Alex Shi <alex.shi@intel.com>
Reviewed-by: Paul Turner <pjt@google.com>
Tested-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1371694737-29336-13-git-send-email-alex.shi@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-06-27 10:07:43 +02:00
Alex Shi
2509940fd7 sched/cfs_rq: Change atomic64_t removed_load to atomic_long_t
Similar to runnable_load_avg, blocked_load_avg variable, long type is
enough for removed_load in 64 bit or 32 bit machine.

Then we avoid the expensive atomic64 operations on 32 bit machine.

Signed-off-by: Alex Shi <alex.shi@intel.com>
Reviewed-by: Paul Turner <pjt@google.com>
Tested-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1371694737-29336-12-git-send-email-alex.shi@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-06-27 10:07:41 +02:00
Alex Shi
bf5b986ed4 sched/tg: Use 'unsigned long' for load variable in task group
Since tg->load_avg is smaller than tg->load_weight, we don't need a
atomic64_t variable for load_avg in 32 bit machine.
The same reason for cfs_rq->tg_load_contrib.

The atomic_long_t/unsigned long variable type are more efficient and
convenience for them.

Signed-off-by: Alex Shi <alex.shi@intel.com>
Tested-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1371694737-29336-11-git-send-email-alex.shi@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-06-27 10:07:40 +02:00
Alex Shi
72a4cf20cb sched: Change cfs_rq load avg to unsigned long
Since the 'u64 runnable_load_avg, blocked_load_avg' in cfs_rq struct are
smaller than 'unsigned long' cfs_rq->load.weight. We don't need u64
vaiables to describe them. unsigned long is more efficient and convenience.

Signed-off-by: Alex Shi <alex.shi@intel.com>
Reviewed-by: Paul Turner <pjt@google.com>
Tested-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1371694737-29336-10-git-send-email-alex.shi@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-06-27 10:07:38 +02:00
Alex Shi
a003a25b22 sched: Consider runnable load average in move_tasks()
Aside from using runnable load average in background, move_tasks is
also the key function in load balance. We need consider the runnable
load average in it in order to make it an apple to apple load
comparison.

Morten had caught a div u64 bug on ARM, thanks!

Thanks-to: Morten Rasmussen <morten.rasmussen@arm.com>
Signed-off-by: Alex Shi <alex.shi@intel.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1371694737-29336-8-git-send-email-alex.shi@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-06-27 10:07:36 +02:00
Alex Shi
b92486cbf2 sched: Compute runnable load avg in cpu_load and cpu_avg_load_per_task
They are the base values in load balance, update them with rq runnable
load average, then the load balance will consider runnable load avg
naturally.

We also try to include the blocked_load_avg as cpu load in balancing,
but that cause kbuild performance drop 6% on every Intel machine, and
aim7/oltp drop on some of 4 CPU sockets machines.
Or only add blocked_load_avg into get_rq_runable_load, hackbench still
drop a little on NHM EX.

Signed-off-by: Alex Shi <alex.shi@intel.com>
Reviewed-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1371694737-29336-7-git-send-email-alex.shi@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-06-27 10:07:35 +02:00
Alex Shi
83dfd5235e sched: Update cpu load after task_tick
To get the latest runnable info, we need do this cpuload update after
task_tick.

Signed-off-by: Alex Shi <alex.shi@intel.com>
Reviewed-by: Paul Turner <pjt@google.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1371694737-29336-6-git-send-email-alex.shi@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-06-27 10:07:33 +02:00
Alex Shi
282cf499f0 sched: Fix sleep time double accounting in enqueue entity
The woken migrated task will __synchronize_entity_decay(se); in
migrate_task_rq_fair, then it needs to set
`se->avg.last_runnable_update -= (-se->avg.decay_count) << 20' before
update_entity_load_avg, in order to avoid sleep time is updated twice
for se.avg.load_avg_contrib in both __syncchronize and
update_entity_load_avg.

However if the sleeping task is woken up from the same cpu, it miss
the last_runnable_update before update_entity_load_avg(se, 0, 1), then
the sleep time was used twice in both functions.  So we need to remove
the double sleep time accounting.

Paul also contributed some code comments in this commit.

Signed-off-by: Alex Shi <alex.shi@intel.com>
Reviewed-by: Paul Turner <pjt@google.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1371694737-29336-5-git-send-email-alex.shi@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-06-27 10:07:32 +02:00
Alex Shi
a75cdaa915 sched: Set an initial value of runnable avg for new forked task
We need to initialize the se.avg.{decay_count, load_avg_contrib} for a
new forked task. Otherwise random values of above variables cause a
mess when a new task is enqueued:

    enqueue_task_fair
        enqueue_entity
            enqueue_entity_load_avg

and make fork balancing imbalance due to incorrect load_avg_contrib.

Further more, Morten Rasmussen notice some tasks were not launched at
once after created. So Paul and Peter suggest giving a start value for
new task runnable avg time same as sched_slice().

PeterZ said:

> So the 'problem' is that our running avg is a 'floating' average; ie. it
> decays with time. Now we have to guess about the future of our newly
> spawned task -- something that is nigh impossible seeing these CPU
> vendors keep refusing to implement the crystal ball instruction.
>
> So there's two asymptotic cases we want to deal well with; 1) the case
> where the newly spawned program will be 'nearly' idle for its lifetime;
> and 2) the case where its cpu-bound.
>
> Since we have to guess, we'll go for worst case and assume its
> cpu-bound; now we don't want to make the avg so heavy adjusting to the
> near-idle case takes forever. We want to be able to quickly adjust and
> lower our running avg.
>
> Now we also don't want to make our avg too light, such that it gets
> decremented just for the new task not having had a chance to run yet --
> even if when it would run, it would be more cpu-bound than not.
>
> So what we do is we make the initial avg of the same duration as that we
> guess it takes to run each task on the system at least once -- aka
> sched_slice().
>
> Of course we can defeat this with wakeup/fork bombs, but in the 'normal'
> case it should be good enough.

Paul also contributed most of the code comments in this commit.

Signed-off-by: Alex Shi <alex.shi@intel.com>
Reviewed-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Reviewed-by: Paul Turner <pjt@google.com>
[peterz; added explanation of sched_slice() usage]
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1371694737-29336-4-git-send-email-alex.shi@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-06-27 10:07:30 +02:00
Alex Shi
fa6bddeb14 sched: Move a few runnable tg variables into CONFIG_SMP
The following 2 variables are only used under CONFIG_SMP, so its
better to move their definiation into CONFIG_SMP too.

        atomic64_t load_avg;
        atomic_t runnable_avg;

Signed-off-by: Alex Shi <alex.shi@intel.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1371694737-29336-3-git-send-email-alex.shi@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-06-27 10:07:29 +02:00
Alex Shi
141965c749 Revert "sched: Introduce temporary FAIR_GROUP_SCHED dependency for load-tracking"
Remove CONFIG_FAIR_GROUP_SCHED that covers the runnable info, then
we can use runnable load variables.

Also remove 2 CONFIG_FAIR_GROUP_SCHED setting which is not in reverted
patch(introduced in 9ee474f), but also need to revert.

Signed-off-by: Alex Shi <alex.shi@intel.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/51CA76A3.3050207@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-06-27 10:07:22 +02:00
Linus Torvalds
54faf77d06 Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf fixes from Ingo Molnar:
 "Three small fixlets"

* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  hw_breakpoint: Use cpu_possible_mask in {reserve,release}_bp_slot()
  hw_breakpoint: Fix cpu check in task_bp_pinned(cpu)
  kprobes: Fix arch_prepare_kprobe to handle copy insn failures
2013-06-26 08:51:44 -10:00
Tejun Heo
a4ea1cc906 cgroup: always use RCU accessors for protected accesses
kernel/cgroup.c still has places where a RCU pointer is set and
accessed directly without going through RCU_INIT_POINTER() or
rcu_dereference_protected().  They're all properly protected accesses
so nothing is broken but it leads to spurious sparse RCU address space
warnings.

Substitute direct accesses with RCU_INIT_POINTER() and
rcu_dereference_protected().  Note that %true is specified as the
extra condition for all derference updates.  This isn't ideal as all
it does is suppressing warning without actually policing
synchronization rules; however, most are scheduled to be removed
pretty soon along with css_id itself, so no reason to be more
elaborate.

Combined with the previous changes, this removes all RCU related
sparse warnings from cgroup.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Acked-by; Li Zefan <lizefan@huawei.com>
2013-06-26 10:48:48 -07:00
Tejun Heo
a8ad805cfd cgroup: fix RCU accesses around task->cgroups
There are several places in kernel/cgroup.c where task->cgroups is
accessed and modified without going through proper RCU accessors.
None is broken as they're all lock protected accesses; however, this
still triggers sparse RCU address space warnings.

* Consistently use task_css_set() for task->cgroups dereferencing.

* Use RCU_INIT_POINTER() to clear task->cgroups to &init_css_set on
  exit.

* Remove unnecessary rcu_dereference_raw() from cset->subsys[]
  dereference in cgroup_exit().

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Acked-by: Li Zefan <lizefan@huawei.com>
2013-06-26 10:48:38 -07:00
Tejun Heo
eb178d0633 cgroup: grab cgroup_mutex in drop_parsed_module_refcounts()
This isn't strictly necessary as all subsystems specified in
@subsys_mask are guaranteed to be pinned; however, it does spuriously
trigger lockdep warning.  Let's grab cgroup_mutex around it.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2013-06-26 10:40:10 -07:00
Tejun Heo
1672d04070 cgroup: fix cgroupfs_root early destruction path
cgroupfs_root used to have ->actual_subsys_mask in addition to
->subsys_mask.  a8a648c4ac ("cgroup: remove
cgroup->actual_subsys_mask") removed it noting that the subsys_mask is
essentially temporary and doesn't belong in cgroupfs_root; however,
the patch made it impossible to tell whether a cgroupfs_root actually
has the subsystems bound or just have the bits set leading to the
following BUG when trying to mount with subsystems which are already
mounted elsewhere.

 kernel BUG at kernel/cgroup.c:1038!
 invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
 ...
 CPU: 1 PID: 7973 Comm: mount Tainted: G        W    3.10.0-rc7-next-20130625-sasha-00011-g1c1dc0e #1105
 task: ffff880fc0ae8000 ti: ffff880fc0b9a000 task.ti: ffff880fc0b9a000
 RIP: 0010:[<ffffffff81249b29>]  [<ffffffff81249b29>] rebind_subsystems+0x409/0x5f0
 ...
 Call Trace:
  [<ffffffff8124bd4f>] cgroup_kill_sb+0xff/0x210
  [<ffffffff813d21af>] deactivate_locked_super+0x4f/0x90
  [<ffffffff8124f3b3>] cgroup_mount+0x673/0x6e0
  [<ffffffff81257169>] cpuset_mount+0xd9/0x110
  [<ffffffff813d2580>] mount_fs+0xb0/0x2d0
  [<ffffffff81404afd>] vfs_kern_mount+0xbd/0x180
  [<ffffffff814070b5>] do_new_mount+0x145/0x2c0
  [<ffffffff814085d6>] do_mount+0x356/0x3c0
  [<ffffffff8140873d>] SyS_mount+0xfd/0x140
  [<ffffffff854eb600>] tracesys+0xdd/0xe2

We still want rebind_subsystems() to take added/removed masks, so
let's fix it by marking whether a cgroupfs_root has finished binding
or not.  Also, document what's going on around ->subsys_mask
initialization so that similar mistakes aren't repeated.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Sasha Levin <sasha.levin@oracle.com>
Acked-by: Li Zefan <lizefan@huawei.com>
2013-06-26 10:39:46 -07:00
Daniel Vetter
2301002769 mutex: Add w/w mutex slowpath debugging
Injects EDEADLK conditions at pseudo-random interval, with
exponential backoff up to UINT_MAX (to ensure that every lock
operation still completes in a reasonable time).

This way we can test the wound slowpath even for ww mutex users
where contention is never expected, and the ww deadlock
avoidance algorithm is only needed for correctness against
malicious userspace. An example would be protecting kernel
modesetting properties, which thanks to single-threaded X isn't
really expected to contend, ever.

I've looked into using the CONFIG_FAULT_INJECTION
infrastructure, but decided against it for two reasons:

- EDEADLK handling is mandatory for ww mutex users and should
  never affect the outcome of a syscall. This is in contrast to -ENOMEM
  injection. So fine configurability isn't required.

- The fault injection framework only allows to set a simple
  probability for failure. Now the probability that a ww mutex acquire
  stage with N locks will never complete (due to too many injected
  EDEADLK backoffs) is zero. But the expected number of ww_mutex_lock
  operations for the completely uncontended case would be O(exp(N)).
  The per-acuiqire ctx exponential backoff solution choosen here only
  results in O(log N) overhead due to injection and so O(log N * N)
  lock operations. This way we can fail with high probability (and so
  have good test coverage even for fancy backoff and lock acquisition
  paths) without running into patalogical cases.

Note that EDEADLK will only ever be injected when we managed to
acquire the lock. This prevents any behaviour changes for users
which rely on the EALREADY semantics.

Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch>
Signed-off-by: Maarten Lankhorst <maarten.lankhorst@canonical.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: dri-devel@lists.freedesktop.org
Cc: linaro-mm-sig@lists.linaro.org
Cc: rostedt@goodmis.org
Cc: daniel@ffwll.ch
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20130620113117.4001.21681.stgit@patser
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-06-26 12:10:56 +02:00
Maarten Lankhorst
040a0a3710 mutex: Add support for wound/wait style locks
Wound/wait mutexes are used when other multiple lock
acquisitions of a similar type can be done in an arbitrary
order. The deadlock handling used here is called wait/wound in
the RDBMS literature: The older tasks waits until it can acquire
the contended lock. The younger tasks needs to back off and drop
all the locks it is currently holding, i.e. the younger task is
wounded.

For full documentation please read Documentation/ww-mutex-design.txt.

References: https://lwn.net/Articles/548909/
Signed-off-by: Maarten Lankhorst <maarten.lankhorst@canonical.com>
Acked-by: Daniel Vetter <daniel.vetter@ffwll.ch>
Acked-by: Rob Clark <robdclark@gmail.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: dri-devel@lists.freedesktop.org
Cc: linaro-mm-sig@lists.linaro.org
Cc: rostedt@goodmis.org
Cc: daniel@ffwll.ch
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/51C8038C.9000106@canonical.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-06-26 12:10:56 +02:00
Maarten Lankhorst
a41b56efa7 arch: Make __mutex_fastpath_lock_retval return whether fastpath succeeded or not
This will allow me to call functions that have multiple
arguments if fastpath fails. This is required to support ticket
mutexes, because they need to be able to pass an extra argument
to the fail function.

Originally I duplicated the functions, by adding
__mutex_fastpath_lock_retval_arg. This ended up being just a
duplication of the existing function, so a way to test if
fastpath was called ended up being better.

This also cleaned up the reservation mutex patch some by being
able to call an atomic_set instead of atomic_xchg, and making it
easier to detect if the wrong unlock function was previously
used.

Signed-off-by: Maarten Lankhorst <maarten.lankhorst@canonical.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: dri-devel@lists.freedesktop.org
Cc: linaro-mm-sig@lists.linaro.org
Cc: robclark@gmail.com
Cc: rostedt@goodmis.org
Cc: daniel@ffwll.ch
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20130620113105.4001.83929.stgit@patser
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-06-26 12:10:55 +02:00
JunweiZhang
d0667186eb kernel: remove unnecessary head file
ip_vs.h is not necessary for sysctl_binary.c.

prepare for the next patch to avoid compile issue.

Signed-off-by: JunweiZhang <junwei.zhang@6wind.com>
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Reviewed-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
2013-06-26 18:01:46 +09:00
Ingo Molnar
ca02c21674 Better comments so we understand our existing machine check
bank bitmaps - prelude to adding another bitmap soon.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.12 (GNU/Linux)
 
 iQIcBAABAgAGBQJRygQkAAoJEKurIx+X31iB5RsP/R6w/X4wbKlXD8K8e2qhSXU2
 oYVtaMN46M+xLRdNDdE1Y+clVKDQTuapfrlOtBLZVHIkNfqF5jAtu2O1wTERpx+F
 7lDHuWuFbddbqV2vqE6RJ6ZqsUTsrlzrqLSLWGSoHn36DlnkSm+L5vDDG75QzV+w
 fPpGpJTM3kRxPUHnmwvfniJxTiBjsli6FDZ1vGq4Ingu+xxOJDU6+FdWts08hn4Y
 +KLjs1JMJSdo3di05T6t4ARKBgW3B1lJnrIS6fGYMmax+kHQqO/zvzHs3A86LZyN
 7L0toEoZHa8VRMN6C1mw0XsFwPmyMOsrHrRpBlFgHpC7QLAzx6LkxchyDBqPL0mo
 W4bnoyu1QZUvblq1mrsnJa9yeyMjSHAeq2XTBj8Pbv20AKzmCbNl3aJ5tCDhPj4Y
 A+e8vk/tq8zWCjdf3SV/D+wjgEtkeWALZZj70maufu9AKtKXy5repa6DZCKEhyIC
 yh72c3gZ25IkxjwcHmJh+CYaKll9MgdW5toZkftBin2iOdWSaHS9QX2r4I5Awvbi
 m0Iwq5Fg8DSs3AhBLxiJi8L7N+RNa+7GoTH0z6UDtfAY3ExvepfwjPzLxAfTbPw6
 oK2wLfuPiNizl2DX1yNlizGD2YSRiWKoap3Iog84A3IMTIe1nJ+97GHvsVcGna8I
 zXjnZDNQThAgBswtNY1s
 =U/Kg
 -----END PGP SIGNATURE-----

Merge tag 'please-pull-mce-bitmap-comment' of git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras into x86/ras

Pull MCE updates from Tony Luck:

 "Better comments so we understand our existing machine check
  bank bitmaps - prelude to adding another bitmap soon."

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-06-26 10:53:45 +02:00
Colin Cross
88c8004fd3 futex: Use freezable blocking call
Avoid waking up every thread sleeping in a futex_wait call during
suspend and resume by calling a freezable blocking call.  Previous
patches modified the freezer to avoid sending wakeups to threads
that are blocked in freezable blocking calls.

This call was selected to be converted to a freezable call because
it doesn't hold any locks or release any resources when interrupted
that might be needed by another freezing task or a kernel driver
during suspend, and is a common site where idle userspace tasks are
blocked.

Signed-off-by: Colin Cross <ccross@android.com>
Cc: Rafael J. Wysocki <rjw@sisk.pl>
Cc: arve@android.com
Cc: Tejun Heo <tj@kernel.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Darren Hart <dvhart@linux.intel.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Link: http://lkml.kernel.org/r/1367458508-9133-8-git-send-email-ccross@android.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2013-06-25 23:11:19 +02:00
Zhang Yi
13d60f4b6a futex: Take hugepages into account when generating futex_key
The futex_keys of process shared futexes are generated from the page
offset, the mapping host and the mapping index of the futex user space
address. This should result in an unique identifier for each futex.

Though this is not true when futexes are located in different subpages
of an hugepage. The reason is, that the mapping index for all those
futexes evaluates to the index of the base page of the hugetlbfs
mapping. So a futex at offset 0 of the hugepage mapping and another
one at offset PAGE_SIZE of the same hugepage mapping have identical
futex_keys. This happens because the futex code blindly uses
page->index.

Steps to reproduce the bug:

1. Map a file from hugetlbfs. Initialize pthread_mutex1 at offset 0
   and pthread_mutex2 at offset PAGE_SIZE of the hugetlbfs
   mapping.

   The mutexes must be initialized as PTHREAD_PROCESS_SHARED because
   PTHREAD_PROCESS_PRIVATE mutexes are not affected by this issue as
   their keys solely depend on the user space address.

2. Lock mutex1 and mutex2

3. Create thread1 and in the thread function lock mutex1, which
   results in thread1 blocking on the locked mutex1.

4. Create thread2 and in the thread function lock mutex2, which
   results in thread2 blocking on the locked mutex2.

5. Unlock mutex2. Despite the fact that mutex2 got unlocked, thread2
   still blocks on mutex2 because the futex_key points to mutex1.

To solve this issue we need to take the normal page index of the page
which contains the futex into account, if the futex is in an hugetlbfs
mapping. In other words, we calculate the normal page mapping index of
the subpage in the hugetlbfs mapping.

Mappings which are not based on hugetlbfs are not affected and still
use page->index.

Thanks to Mel Gorman who provided a patch for adding proper evaluation
functions to the hugetlbfs code to avoid exposing hugetlbfs specific
details to the futex code.

[ tglx: Massaged changelog ]

Signed-off-by: Zhang Yi <zhang.yi20@zte.com.cn>
Reviewed-by: Jiang Biao <jiang.biao2@zte.com.cn>
Tested-by: Ma Chenggong <ma.chenggong@zte.com.cn>
Reviewed-by: 'Mel Gorman' <mgorman@suse.de>
Acked-by: 'Darren Hart' <dvhart@linux.intel.com>
Cc: 'Peter Zijlstra' <peterz@infradead.org>
Cc: stable@vger.kernel.org
Link: http://lkml.kernel.org/r/000101ce71a6%24a83c5880%24f8b50980%24@com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2013-06-25 23:11:19 +02:00
Tejun Heo
fc76df7061 cgroup: reserve ID 0 for dummy_root and 1 for unified hierarchy
Before 1a57423166 ("cgroup: make hierarchy_id use cyclic idr"),
hierarchy IDs were allocated from 0.  As the dummy hierarchy was
always the one first initialized, it got assigned 0 and all other
hierarchies from 1.  The patch accidentally changed the minimum
useable ID to 2.

Let's restore ID 0 for dummy_root and while at it reserve 1 for
unified hierarchy.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: stable@vger.kernel.org
2013-06-25 11:53:37 -07:00
Tejun Heo
30159ec7a9 cgroup: implement for_each_[builtin_]subsys()
There are quite a few places where all loaded [builtin] subsys are
iterated.  Implement for_each_[builtin_]subsys() and replace manual
iterations with those to simplify those places a bit.  The new
iterators automatically skip NULL subsystems.  This shouldn't cause
any functional difference.

Iteration loops which scan all subsystems and then skipping modular
ones explicitly are converted to use for_each_builtin_subsys().

While at it, reorder variable declarations and adjust whitespaces a
bit in the affected functions.

v2: Add lockdep_assert_held() in for_each_subsys() and add comments
    about synchronization as suggested by Li.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2013-06-25 11:53:37 -07:00
Tejun Heo
82fe9b0da0 cgroup: move init_css_set initialization inside cgroup_mutex
cgroup_init() was doing init_css_set initialization outside
cgroup_mutex, which is fine but we want to add lockdep annotation on
subsystem iterations and cgroup_init() will trigger it spuriously.
Move init_css_set initialization inside cgroup_mutex.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2013-06-25 11:53:37 -07:00
Javier Martinez Canillas
fbab62c5cd irqdomain: Use irq_get_trigger_type() to get IRQ flags
Use irq_get_trigger_type() to get the IRQ trigger type flags
instead calling irqd_get_trigger_type(irq_desc_get_irq_data(virq))

Signed-off-by: Javier Martinez Canillas <javier.martinez@collabora.co.uk>
Acked-by: Grant Likely <grant.likely@linaro.org>
Cc: Linus Walleij <linus.walleij@linaro.org>
Cc: Samuel Ortiz <sameo@linux.intel.com>
Cc: Jason Cooper <jason@lakedaemon.net>
Cc: Andrew Lunn <andrew@lunn.ch>
Cc: Russell King <linux@arm.linux.org.uk>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: linux-arm-kernel@lists.infradead.org
Cc: linux-mips@linux-mips.org
Link: http://lkml.kernel.org/r/1371228049-27080-8-git-send-email-javier.martinez@collabora.co.uk
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2013-06-25 11:48:25 +02:00
Tejun Heo
5549c49791 cgroup: s/for_each_subsys()/for_each_root_subsys()/
for_each_subsys() walks over subsystems attached to a hierarchy and
we're gonna add iterators which walk over all available subsystems.
Rename for_each_subsys() to for_each_root_subsys() so that it's more
appropriately named and for_each_subsys() can be used to iterate all
subsystems.

While at it, remove unnecessary underbar prefix from macro arguments,
put them inside parentheses, and adjust indentation for the two
for_each_*() macros.

This patch is purely cosmetic.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2013-06-24 15:21:48 -07:00
Tejun Heo
b326f9d0db cgroup: clean up find_css_set() and friends
find_css_set() passes uninitialized on-stack template[] array to
find_existing_css_set() which sets the entries for all subsystems.
Passing around an uninitialized array is a bit icky and we want to
introduce an iterator which only iterates loaded subsystems.  Let's
initialize it on definition.

While at it, also make the following cosmetic cleanups.

* Convert to proper /** comments.

* Reorder variable declarations.

* Replace comment on synchronization with lockdep_assert_held().

This patch doesn't make any functional differences.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2013-06-24 15:21:48 -07:00
Tejun Heo
a8a648c4ac cgroup: remove cgroup->actual_subsys_mask
cgroup curiously has two subsystem masks, ->subsys_mask and
->actual_subsys_mask.  The latter only exists because the new target
subsys_mask is passed into rebind_subsystems() via @root>subsys_mask.
rebind_subsystems() needs to know what the current mask is to decide
how to reach the target mask so ->actual_subsys_mask is used as the
temp location to remember the current state.

Adding a temporary field to a permanent data structure is rather silly
and can be misleading.  Update rebind_subsystems() to take @added_mask
and @removed_mask instead and remove @root->actual_subsys_mask.

This patch shouldn't introduce any behavior changes.

v2: Comment and description updated as suggested by Li.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2013-06-24 15:21:47 -07:00
Tejun Heo
9871bf9550 cgroup: prefix global variables with "cgroup_"
Global variable names in kernel/cgroup.c are asking for trouble -
subsys, roots, rootnode and so on.  Rename them to have "cgroup_"
prefix.

* s/subsys/cgroup_subsys/

* s/rootnode/cgroup_dummy_root/

* s/dummytop/cgroup_cummy_top/

* s/roots/cgroup_roots/

* s/root_count/cgroup_root_count/

This patch is purely cosmetic.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2013-06-24 15:21:47 -07:00
Greg Kroah-Hartman
b5aef682e0 Merge 3.10-rc7 into driver-core-next
We want the firmware merge fixes, and other bits, in here now.

Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-06-24 15:14:43 -07:00
Stephen Boyd
70e5975d3a clockevents: Prefer CPU local devices over global devices
On an SMP system with only one global clockevent and a dummy
clockevent per CPU we run into problems. We want the dummy
clockevents to be registered as the per CPU tick devices, but
we can only achieve that if we register the dummy clockevents
before the global clockevent or if we artificially inflate the
rating of the dummy clockevents to be higher than the rating
of the global clockevent. Failure to do so leads to boot
hangs when the dummy timers are registered on all other CPUs
besides the CPU that accepted the global clockevent as its tick
device and there is no broadcast timer to poke the dummy
devices.

If we're registering multiple clockevents and one clockevent is
global and the other is local to a particular CPU we should
choose to use the local clockevent regardless of the rating of
the device. This way, if the clockevent is a dummy it will take
the tick device duty as long as there isn't a higher rated tick
device and any global clockevent will be bumped out into
broadcast mode, fixing the problem described above.

Reported-and-tested-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Stephen Boyd <sboyd@codeaurora.org>
Tested-by: soren.brinkmann@xilinx.com
Cc: John Stultz <john.stultz@linaro.org>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: linux-arm-kernel@lists.infradead.org
Cc: John Stultz <john.stultz@linaro.org>
Link: http://lkml.kernel.org/r/20130613183950.GA32061@codeaurora.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2013-06-24 22:27:36 +02:00
James Hogan
6fff831404 genirq: Irqchip: document gcflags arg of irq_alloc_domain_generic_chips
Commit 088f40b7b0 ("genirq: Generic chip:
Add linear irq domain support") missed kerneldoc for the gcflags
argument of irq_alloc_domain_generic_chips(). Add it now.

Signed-off-by: James Hogan <james.hogan@imgtec.com>
Acked-by: Grant Likely <grant.likely@secretlab.ca>
Link: http://lkml.kernel.org/r/1371564513-4327-1-git-send-email-james.hogan@imgtec.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2013-06-24 15:38:12 +02:00
Kefeng Wang
798f0fd188 irq: fix checkpatch error
ERROR: space required before the open parenthesis '('
WARNING: Prefer pr_warn(... to pr_warning(...
Just fix above 2 issue.

Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Signed-off-by: Grant Likely <grant.likely@linaro.org>
2013-06-24 14:02:43 +01:00
Grant Likely
c12d2f42a9 irqdomain: Include hwirq number in /proc/interrupts
Add the hardware interrupt number to the output of /proc/interrupts.
It is often important to have access to the hardware interrupt number because
it identifies exactly how an interrupt signal is wired up to the interrupt
controller.  This is especially important when using irq_domains since irq
numbers get dynamically allocated in that case, and have no relation to the
actual hardware number.

Note: This output is currently conditional on whether or not the irq_domain
pointer is set; however hwirq could still be used without irq_domain.  It
may be worthwhile to always output the hwirq number regardless of the
domain pointer.

Signed-off-by: Grant Likely <grant.likely@secretlab.ca>
Tested-by: Olof Johansson <olof@lixom.net>
Cc: Ben Herrenschmidt <benh@kernel.crashing.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
2013-06-24 14:02:42 +01:00
Grant Likely
d3dcb436f6 irqdomain: make irq_linear_revmap() a fast path again
Over the years, irq_linear_revmap() gained tests and checks to make sure
callers were using it safely, which while important, also make it less
of a fast path. After the irqdomain refactoring done recently, it is now
possible to make irq_linear_revmap() a fast path again. This patch moves
irq_linear_revmap() to the header file and makes it a static inline so
that interrupt controller drivers using a linear mapping can decode the
virq from a hwirq in just a couple of instructions.

Signed-off-by: Grant Likely <grant.likely@linaro.org>
2013-06-24 14:02:41 +01:00
Grant Likely
56a3d5ac77 irqdomain: remove irq_domain_generate_simple()
Nobody calls it; remove the function

Signed-off-by: Grant Likely <grant.likely@linaro.org>
2013-06-24 14:02:41 +01:00
Grant Likely
ddaf144c61 irqdomain: Refactor irq_domain_associate_many()
Originally, irq_domain_associate_many() was designed to unwind the
mapped irqs on a failure of any individual association. However, that
proved to be a problem with certain IRQ controllers. Some of them only
support a subset of irqs, and will fail when attempting to map a
reserved IRQ. In those cases we want to map as many IRQs as possible, so
instead it is better for irq_domain_associate_many() to make a
best-effort attempt to map irqs, but not fail if any or all of them
don't succeed. If a caller really cares about how many irqs got
associated, then it should instead go back and check that all of the
irqs is cares about were mapped.

The original design open-coded the individual association code into the
body of irq_domain_associate_many(), but with no longer needing to
unwind associations, the code becomes simpler to split out
irq_domain_associate() to contain the bulk of the logic, and
irq_domain_associate_many() to be a simple loop wrapper.

This patch also adds a new error check to the associate path to make
sure it isn't called for an irq larger than the controller can handle,
and adds locking so that the irq_domain_mutex is held while setting up a
new association.

v3: Fixup missing change to irq_domain_add_tree()
v2: Fixup x86 warning. irq_domain_associate_many() no longer returns an
    error code, but reports errors to the printk log directly. In the
    majority of cases we don't actually want to fail if there is a
    problem, but rather log it and still try to boot the system.

Signed-off-by: Grant Likely <grant.likely@linaro.org>

irqdomain: Fix flubbed irq_domain_associate_many refactoring

commit d39046ec72, "irqdomain: Refactor irq_domain_associate_many()" was
missing the following hunk which causes a boot failure on anything using
irq_domain_add_tree() to allocate an irq domain.

Signed-off-by: Grant Likely <grant.likely@linaro.org>
Cc: Michael Neuling <mikey@neuling.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>,
Cc: Thomas Gleixner <tglx@linutronix.de>,
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
2013-06-24 14:01:42 +01:00
Sahara
ae8822b842 PM / QoS: Add pm_qos_request tracepoints
Adds tracepoints to pm_qos_add_request, pm_qos_update_request,
pm_qos_remove_request, and pm_qos_update_request_timeout.
It's useful for checking pm_qos_class, value, and timeout_us.

Signed-off-by: Sahara <keun-o.park@windriver.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2013-06-24 13:09:03 +02:00
Sahara
247e9ee034 PM / QoS: Add pm_qos_update_target/flags tracepoints
This patch adds tracepoints to pm_qos_update_target and
pm_qos_update_flags. It's useful for checking pm qos action,
previous value and current value.

Signed-off-by: Sahara <keun-o.park@windriver.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2013-06-24 13:09:03 +02:00
Dave Hansen
14c63f17b1 perf: Drop sample rate when sampling is too slow
This patch keeps track of how long perf's NMI handler is taking,
and also calculates how many samples perf can take a second.  If
the sample length times the expected max number of samples
exceeds a configurable threshold, it drops the sample rate.

This way, we don't have a runaway sampling process eating up the
CPU.

This patch can tend to drop the sample rate down to level where
perf doesn't work very well.  *BUT* the alternative is that my
system hangs because it spends all of its time handling NMIs.

I'll take a busted performance tool over an entire system that's
busted and undebuggable any day.

BTW, my suspicion is that there's still an underlying bug here.
Using the HPET instead of the TSC is definitely a contributing
factor, but I suspect there are some other things going on.
But, I can't go dig down on a bug like that with my machine
hanging all the time.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: paulus@samba.org
Cc: acme@ghostprotocols.net
Cc: Dave Hansen <dave@sr71.net>
[ Prettified it a bit. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-06-23 11:52:57 +02:00
Linus Torvalds
f71194a7d4 Merge branch 'x86/urgent' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fixes from Peter Anvin:
 "This series fixes a couple of build failures, and fixes MTRR cleanup
  and memory setup on very specific memory maps.

  Finally, it fixes triggering backtraces on all CPUs, which was
  inadvertently disabled on x86."

* 'x86/urgent' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/efi: Fix dummy variable buffer allocation
  x86: Fix trigger_all_cpu_backtrace() implementation
  x86: Fix section mismatch on load_ucode_ap
  x86: fix build error and kconfig for ia32_emulation and binfmt
  range: Do not add new blank slot with add_range_with_merge
  x86, mtrr: Fix original mtrr range get for mtrr_cleanup
2013-06-21 06:33:48 -10:00
Daniel Lezcano
ea8deb8dfa tick: Fix tick_broadcast_pending_mask not cleared
The recent modification in the cpuidle framework consolidated the
timer broadcast code across the different drivers by setting a new
flag in the idle state. It tells the cpuidle core code to enter/exit
the broadcast mode for the cpu when entering a deep idle state. The
broadcast timer enter/exit is no longer handled by the back-end
driver.

This change made the local interrupt to be enabled *before* calling
CLOCK_EVENT_NOTIFY_EXIT.

On a tegra114, a four cores system, when the flag has been introduced
in the driver, the following warning appeared:

WARNING: at kernel/time/tick-broadcast.c:578 tick_broadcast_oneshot_control
CPU: 2 PID: 0 Comm: swapper/2 Not tainted 3.10.0-rc3-next-20130529+ #15
[<c00667f8>] (tick_broadcast_oneshot_control+0x1a4/0x1d0) from [<c0065cd0>] (tick_notify+0x240/0x40c)
[<c0065cd0>] (tick_notify+0x240/0x40c) from [<c0044724>] (notifier_call_chain+0x44/0x84)
[<c0044724>] (notifier_call_chain+0x44/0x84) from [<c0044828>] (raw_notifier_call_chain+0x18/0x20)
[<c0044828>] (raw_notifier_call_chain+0x18/0x20) from [<c00650cc>] (clockevents_notify+0x28/0x170)
[<c00650cc>] (clockevents_notify+0x28/0x170) from [<c033f1f0>] (cpuidle_idle_call+0x11c/0x168)
[<c033f1f0>] (cpuidle_idle_call+0x11c/0x168) from [<c000ea94>] (arch_cpu_idle+0x8/0x38)
[<c000ea94>] (arch_cpu_idle+0x8/0x38) from [<c005ea80>] (cpu_startup_entry+0x60/0x134)
[<c005ea80>] (cpu_startup_entry+0x60/0x134) from [<804fe9a4>] (0x804fe9a4)

I don't have the hardware, so I wasn't able to reproduce the warning
but after looking a while at the code, I deduced the following:

 1. the CPU2 enters a deep idle state and sets the broadcast timer

 2. the timer expires, the tick_handle_oneshot_broadcast function is
    called, setting the tick_broadcast_pending_mask and waking up the
    idle cpu CPU2

 3. the CPU2 exits idle handles the interrupt and then invokes
    tick_broadcast_oneshot_control with CLOCK_EVENT_NOTIFY_EXIT which
    runs the following code:

    [...]
    if (dev->next_event.tv64 == KTIME_MAX)
            goto out;

    if (cpumask_test_and_clear_cpu(cpu,
                                 tick_broadcast_pending_mask))
            goto out;
    [...]

    So if there is no next event scheduled for CPU2, we fulfil the
    first condition and jump out without clearing the
    tick_broadcast_pending_mask.

 4. CPU2 goes to deep idle again and calls
    tick_broadcast_oneshot_control with CLOCK_NOTIFY_EVENT_ENTER but
    with the tick_broadcast_pending_mask set for CPU2, triggering the
    warning.

The issue only surfaced due to the modifications of the cpuidle
framework, which resulted in interrupts being enabled before the call
to the clockevents code. If the call happens before interrupts have
been enabled, the warning cannot trigger, because there is still the
event pending which caused the broadcast timer expiry.

Move the check for the next event below the check for the pending bit,
so the pending bit gets cleared whether an event is scheduled on the
cpu or not.

[ tglx: Massaged changelog ]

Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>
Reported-and-tested-by: Joseph Lo <josephl@nvidia.com>
Cc: Stephen Warren <swarren@nvidia.com>
Cc: linux-arm-kernel@lists.infradead.org
Cc: linaro-kernel@lists.linaro.org
Link: http://lkml.kernel.org/r/1371485735-31249-1-git-send-email-daniel.lezcano@linaro.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2013-06-21 13:10:34 +02:00
Julius Werner
bb177fedd3 PM / Sleep: Print last wakeup source on failed wakeup_count write
Commit a938da06 introduced a useful little log message to tell
users/debuggers which wakeup source aborted a suspend.  However,
this message is only printed if the abort happens during the
in-kernel suspend path (after writing /sys/power/state).

The full specification of the /sys/power/wakeup_count facility
allows user-space power managers to double-check if wakeups have
already happened before it actually tries to suspend (e.g. while it
was running user-space pre-suspend hooks), by writing the last known
wakeup_count value to /sys/power/wakeup_count.  This patch changes
the sysfs handler for that node to also print said log message if
that write fails, so that we can figure out the offending wakeup
source for both kinds of suspend aborts.

Signed-off-by: Julius Werner <jwerner@chromium.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2013-06-21 00:35:12 +02:00
Sahara
d24c2a4f91 PM / QoS: correct the valid range of pm_qos_class
The valid start index for pm_qos_array is not 0, but
PM_QOS_CPU_DMA_LATENCY. There is a null_pm_qos at index 0 of
pm_qos_array.  However, null_pm_qos is not created as misc device so
that inclusion of 0 index for checking pm_qos_class especially for
file operations is not proper here.

[rjw: Changelog, a bit]
Signed-off-by: Sahara <keun-o.park@windriver.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2013-06-21 00:24:01 +02:00
Linus Torvalds
a3d5c3460a Merge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler fixes from Ingo Molnar:
 "Two smaller fixes - plus a context tracking tracing fix that is a bit
  bigger"

* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  tracing/context-tracking: Add preempt_schedule_context() for tracing
  sched: Fix clear NOHZ_BALANCE_KICK
  sched/x86: Construct all sibling maps if smt
2013-06-20 08:18:35 -10:00
Linus Torvalds
86c76676cf Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf fixes from Ingo Molnar:
 "Four fixes.  The mmap ones are unfortunately larger than desired -
  fuzzing uncovered bugs that needed perf context life time management
  changes to fix properly"

* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  perf/x86: Fix broken PEBS-LL support on SNB-EP/IVB-EP
  perf: Fix mmap() accounting hole
  perf: Fix perf mmap bugs
  kprobes: Fix to free gone and unused optprobes
2013-06-20 08:17:36 -10:00
Linus Torvalds
805e318548 Merge branch 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull cpu idle fixes from Thomas Gleixner:
 - Add a missing irq enable. Fallout of the idle conversion
 - Fix stackprotector wreckage caused by the idle conversion

* 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  idle: Enable interrupts in the weak arch_cpu_idle() implementation
  idle: Add the stack canary init to cpu_startup_entry()
2013-06-20 08:16:07 -10:00
Linus Torvalds
4db88eb4c3 Merge branch 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer fixes from Thomas Gleixner:
 - Fix inconstinant clock usage in virtual time accounting
 - Fix a build error in KVM caused by the NOHZ work
 - Remove a pointless timekeeping duty assignment which breaks NOHZ
 - Use a proper notifier return value to avoid random behaviour

* 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  tick: Remove useless timekeeping duty attribution to broadcast source
  nohz: Fix notifier return val that enforce timekeeping
  kvm: Move guest entry/exit APIs to context_tracking
  vtime: Use consistent clocks among nohz accounting
2013-06-20 08:15:13 -10:00
Oleg Nesterov
bde96030f4 hw_breakpoint: Introduce "struct bp_cpuinfo"
This patch simply moves all per-cpu variables into the new
single per-cpu "struct bp_cpuinfo".

To me this looks more logical and clean, but this can also
simplify the further potential changes. In particular, I do not
think this memory should be per-cpu, it is never used "locally".
After this change it is trivial to turn it into, say,
bootmem[nr_cpu_ids].

Reported-by: Vince Weaver <vincent.weaver@maine.edu>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Frederic Weisbecker <fweisbec@gmail.com>
Link: http://lkml.kernel.org/r/20130620155020.GA6350@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-06-20 17:58:57 +02:00
Oleg Nesterov
e12cbc10cb hw_breakpoint: Simplify *register_wide_hw_breakpoint()
1. register_wide_hw_breakpoint() can use unregister_ if failure,
   no need to duplicate the code.

2. "struct perf_event **pevent" adds the unnecesary lever of
   indirection and complication, use per_cpu(*cpu_events, cpu).

Reported-by: Vince Weaver <vincent.weaver@maine.edu>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Frederic Weisbecker <fweisbec@gmail.com>
Link: http://lkml.kernel.org/r/20130620155018.GA6347@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-06-20 17:58:57 +02:00
Oleg Nesterov
1c10adbb92 hw_breakpoint: Introduce cpumask_of_bp()
Add the trivial helper which simply returns cpumask_of() or
cpu_possible_mask depending on bp->cpu.

Change fetch_bp_busy_slots() and toggle_bp_slot() to always do
for_each_cpu(cpumask_of_bp) to simplify the code and avoid the
code duplication.

Reported-by: Vince Weaver <vincent.weaver@maine.edu>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Frederic Weisbecker <fweisbec@gmail.com>
Link: http://lkml.kernel.org/r/20130620155015.GA6340@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-06-20 17:58:56 +02:00
Oleg Nesterov
7ab71f3244 hw_breakpoint: Simplify the "weight" usage in toggle_bp_slot() paths
Change toggle_bp_slot() to make "weight" negative if !enable.
This way we can always use "+ weight" without additional "if
(enable)" check and toggle_bp_task_slot() no longer needs this
arg.

Reported-by: Vince Weaver <vincent.weaver@maine.edu>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Frederic Weisbecker <fweisbec@gmail.com>
Link: http://lkml.kernel.org/r/20130620155013.GA6337@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-06-20 17:58:55 +02:00
Oleg Nesterov
e1ebe86203 hw_breakpoint: Simplify list/idx mess in toggle_bp_slot() paths
The enable/disable logic in toggle_bp_slot() is not symmetrical
and imho very confusing. "old_count" in toggle_bp_task_slot() is
actually new_count because this bp was already removed from the
list.

Change toggle_bp_slot() to always call list_add/list_del after
toggle_bp_task_slot(). This way old_idx is task_bp_pinned() and
this entry should be decremented, new_idx is +/-weight and we
need to increment this element. The code/logic looks obvious.

Reported-by: Vince Weaver <vincent.weaver@maine.edu>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Frederic Weisbecker <fweisbec@gmail.com>
Link: http://lkml.kernel.org/r/20130620155011.GA6330@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-06-20 17:58:55 +02:00
Ingo Molnar
f070a4dba9 Merge branch 'perf/urgent' into perf/core
Merge in two hw_breakpoint fixes, before applying another 5.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-06-20 17:57:40 +02:00
Oleg Nesterov
c790b0ad23 hw_breakpoint: Use cpu_possible_mask in {reserve,release}_bp_slot()
fetch_bp_busy_slots() and toggle_bp_slot() use
for_each_online_cpu(), this is obviously wrong wrt cpu_up() or
cpu_down(), we can over/under account the per-cpu numbers.

For example:

	# echo 0 >> /sys/devices/system/cpu/cpu1/online
	# perf record -e mem:0x10 -p 1 &
	# echo 1 >> /sys/devices/system/cpu/cpu1/online
	# perf record -e mem:0x10,mem:0x10,mem:0x10,mem:0x10 -C1 -a &
	# taskset -p 0x2 1

triggers the same WARN_ONCE("Can't find any breakpoint slot") in
arch_install_hw_breakpoint().

Reported-by: Vince Weaver <vincent.weaver@maine.edu>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: <stable@vger.kernel.org>
Link: http://lkml.kernel.org/r/20130620155009.GA6327@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-06-20 17:57:01 +02:00
Oleg Nesterov
8b4d801b2b hw_breakpoint: Fix cpu check in task_bp_pinned(cpu)
trinity fuzzer triggered WARN_ONCE("Can't find any breakpoint
slot") in arch_install_hw_breakpoint() but the problem is not
arch-specific.

The problem is, task_bp_pinned(cpu) checks "cpu == iter->cpu"
but this doesn't account the "all cpus" events with iter->cpu <
0.

This means that, say, register_user_hw_breakpoint(tsk) can
happily create the arbitrary number > HBP_NUM of breakpoints
which can not be activated. toggle_bp_task_slot() is equally
wrong by the same reason and nr_task_bp_pinned[] can have
negative entries.

Simple test:

	# perl -e 'sleep 1 while 1' &
	# perf record -e mem:0x10,mem:0x10,mem:0x10,mem:0x10,mem:0x10 -p `pidof perl`

Before this patch this triggers the same problem/WARN_ON(),
after the patch it correctly fails with -ENOSPC.

Reported-by: Vince Weaver <vincent.weaver@maine.edu>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: <stable@vger.kernel.org>
Link: http://lkml.kernel.org/r/20130620155006.GA6324@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-06-20 17:57:00 +02:00
Juri Lelli
52d85d7630 ftrace: Fix stddev calculation in function profiler
When FUNCTION_GRAPH_TRACER is enabled, ftrace can profile kernel functions
and print basic statistics about them. Unfortunately, running stddev
calculation is wrong. This patch corrects it implementing Welford’s method:

        s^2 = 1 / (n * (n-1)) * (n * \Sum (x_i)^2 - (\Sum x_i)^2) .
Link: http://lkml.kernel.org/r/1371031398-24048-1-git-send-email-juri.lelli@gmail.com

Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@redhat.com>
Signed-off-by: Juri Lelli <juri.lelli@gmail.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2013-06-19 23:32:09 -04:00
zhangwei(Jovi)
195a84d91e tracing/kprobes: Remove unnecessary checking of trace_probe_is_enabled
Since tp->flags assignment was moved into function enable_trace_probe(),
there is no need to use trace_probe_is_enabled to check flags
in the same function.

Remove the unnecessary checking.

Link: http://lkml.kernel.org/r/51BA7B9E.3040807@huawei.com

Acked-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: zhangwei(Jovi) <jovi.zhangwei@huawei.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2013-06-19 23:32:08 -04:00
Steven Rostedt (Red Hat)
de7edd3145 tracing: Disable tracing on warning
Add a traceoff_on_warning option in both the kernel command line as well
as a sysctl option. When set, any WARN*() function that is hit will cause
the tracing_on variable to be cleared, which disables writing to the
ring buffer.

This is useful especially when tracing a bug with function tracing. When
a warning is hit, the print caused by the warning can flood the trace with
the functions that producing the output for the warning. This can make the
resulting trace useless by either hiding where the bug happened, or worse,
by overflowing the buffer and losing the trace of the bug totally.

Acked-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2013-06-19 23:32:07 -04:00
Steven Rostedt
1a891cf19c tracing: Add binary '&' filter for events
There are some cases when filtering on a set flag of a field of a tracepoint
is useful. But currently the only filtering commands for numbered fields
is ==, !=, <, <=, >, >=. This does not help when you just want to trace if
a specific flag is set. For example:

 > # sudo trace-cmd record -e brcmfmac:brcmf_dbg -f 'level & 0x40000'
 > disable all
 > enable brcmfmac:brcmf_dbg
 > path = /sys/kernel/debug/tracing/events/brcmfmac/brcmf_dbg/enable
 > (level & 0x40000)
 > ^
 > parse_error: Invalid operator
 >

When trying to trace brcmf_dbg when level has its 1 << 18 bit set, the
filter fails to perform.

By allowing a binary '&' operation, this gives the user the ability to
test a bit.

Note, a binary '|' is not added, as it doesn't make sense as fields must
be compared to constants (for now), and ORing a constant will always return
true.

Link: http://lkml.kernel.org/r/1371057385.9844.261.camel@gandalf.local.home

Suggested-by: Arend van Spriel <arend@broadcom.com>
Tested-by: Arend van Spriel <arend@broadcom.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2013-06-19 23:30:40 -04:00
Ingo Molnar
d908e1ebbc Miscellaneous fixes for ACPI EINJ (error injection) code.
-----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.12 (GNU/Linux)
 
 iQIcBAABAgAGBQJRsQ4tAAoJEKurIx+X31iBAaIP+wf/lGnTtpR6vwRHbrCV2ZXl
 EC/qnxiqoy8qKe3wn0yVtmORNsBDWnMEsl6ubyHFIhkOL9zrTTQAQRWWzTYu5MWO
 4Vl/Z0zgnBmj66TFVCzDMJLOIsx6o/UHg7uiiCawUEQshWd1jYVZ8bEn+iyj36+y
 mNL74yJkkR9s5AJIQWr7lAKcgeTR8tXHCBuuW5BoY/ThKKtajDKqg8/q4HytPhwi
 u18OpTUv01jK9oS2t6wMzf4Hl0jeHt3FySSgQFPBwfcMJxDlPXnbzlO98Out9h12
 VEye9Ya1CIcVYfPhRjMFgdIvCEbK1afXkk2SRivjOzRFokwk9XR2VTMm5E+KeGZc
 bh8JGxZwGqTxZQlWgqAdZusL928LxPPF631+ZmaJUfaJMspf4/25Y2tAFWXuAycr
 foi1kw/KepE956C0cImS5chRaXPzD1+MJa1lIolD2DUZlTVHRx7rlcPIapWLizyd
 Ir/TNzYUY14oGcQNZMTNcj3sWIUaGn9iEGB+ZtHynW813Sby4bDtvhe8h6UHzADV
 mQnvpz/qcKYwEjg8HmtpmGYEioLz/6RxUU1I1VuAyOCIgxFwVHkyWf41+IPBsTqc
 a4cBF0aLXFn4rcmIVlUGACbrog83XLUNrhXfnmLoFgUcA9ljkmY2ydurcQ1GIhcP
 5WAzisHQor+F3GmOFfaD
 =Pywj
 -----END PGP SIGNATURE-----

Merge tag 'please-pull-einj' of git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras into x86/ras

Pull miscellaneous fixes for ACPI EINJ (error injection) code, from Tony Luck.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-06-19 13:54:04 +02:00
Ingo Molnar
b1fe9987b7 Merge branch 'rcu/next' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu into core/rcu
Pull RCU changes from Paul E. McKenney:

"The major changes for this series are:

 1.      Simplify RCU's grace-period and callback processing based on
         the new numbering for callbacks.  These were posted to LKML at
         https://lkml.org/lkml/2013/5/20/330.

 2.      Documentation updates.  These were posted to LKML at
         https://lkml.org/lkml/2013/5/20/348.

 3.      Miscellaneous fixes, including converting a few remaining printk()
         calls to pr_*().  These were posted to LKML at
         https://lkml.org/lkml/2013/5/20/324.

 4.      SRCU-related changes and fixes.  These were posted to LKML at
         https://lkml.org/lkml/2013/5/20/425.

 5.      Removal of TINY_PREEMPT_RCU in favor of TREE_PREEMPT_RCU for
         single-CPU low-latency systems.  These were posted to LKML at
         https://lkml.org/lkml/2013/5/20/427."

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-06-19 13:48:57 +02:00
Joe Perches
be7002e6c6 sched: Don't mix use of typedef ctl_table and struct ctl_table
Just use struct ctl_table.

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1371063336.2069.22.camel@joe-AO722
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-06-19 12:58:48 +02:00
Viresh Kumar
94c95ba69f sched: Remove WARN_ON(!sd) from init_sched_groups_power()
sd can't be NULL in init_sched_groups_power() and so checking it for NULL isn't
useful. In case it is required, then also we need to rearrange the code a bit as
we already accessed invalid pointer sd to get sg: sg = sd->groups.

Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/2bbe633cd74b431c05253a8ce61fdfd5066a531b.1370948150.git.viresh.kumar@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-06-19 12:58:47 +02:00
Viresh Kumar
cd08e9234c sched: Fix memory leakage in build_sched_groups()
In build_sched_groups() we don't need to call get_group() for cpus
which are already covered in previous iterations. Calling get_group()
would mark the group used and eventually leak it since we wouldn't
connect it and not find it again to free it.

This will happen only in cases where sg->cpumask contained more than
one cpu (For any topology level). This patch would free sg's memory
for all cpus leaving the group leader as the group isn't marked used
now.

Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/7a61e955abdcbb1dfa9fe493f11a5ec53a11ddd3.1370948150.git.viresh.kumar@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-06-19 12:58:46 +02:00
Viresh Kumar
0936629f01 sched: Use cached value of span instead of calling sched_domain_span()
In the beginning of build_sched_groups() we called sched_domain_span() and
cached its return value in span. Few statements later we are calling it again to
get the same pointer.

Lets use the cached value instead as it hasn't changed in between.

Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/834ecd507071ad88aff039352dbc7e063dd996a7.1370948150.git.viresh.kumar@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-06-19 12:58:46 +02:00
Viresh Kumar
27723a68ca sched: Create for_each_sd_topology()
For loop for traversing sched_domain_topology was used at multiple placed in
core.c. This patch removes code redundancy by creating for_each_sd_topology().

Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/e0e04542f54e9464bd9da54f5ccfe62ec6c4c0bc.1370861520.git.viresh.kumar@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-06-19 12:58:45 +02:00
Viresh Kumar
c75e01288c sched: Don't set sd->child to NULL when it is already NULL
Memory for sd is allocated with kzalloc_node() which will initialize its fields
with zero. In build_sched_domain() we are setting sd->child to child even if
child is NULL, which isn't required.

Lets do it only if child isn't NULL.

Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/f4753a1730051341003ad2ad29a3229c7356678e.1370861520.git.viresh.kumar@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-06-19 12:58:45 +02:00
Viresh Kumar
1c63216940 sched: Don't initialize alloc_state in build_sched_domains()
alloc_state will be overwritten by __visit_domain_allocation_hell() and so we
don't actually need to initialize alloc_state.

Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/df57734a075cc5ad130e1ae498702e24f2529ab8.1370861520.git.viresh.kumar@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-06-19 12:58:44 +02:00
Viresh Kumar
22da956953 sched: Optimize build_sched_domains() for saving first SD node for a cpu
We are saving first scheduling domain for a cpu in build_sched_domains() by
iterating over the nested sd->child list. We don't actually need to do it this
way.

tl will be equal to sched_domain_topology for the first iteration and so we can
set *per_cpu_ptr(d.sd, i) based on that.  So, save pointer to first SD while
running the iteration loop over tl's.

Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/fc473527cbc4dfa0b8eeef2a59db74684eb59a83.1370436120.git.viresh.kumar@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-06-19 12:58:43 +02:00
Viresh Kumar
4a850cbefa sched: Remove unused params of build_sched_domain()
build_sched_domain() never uses parameter struct s_data *d and so passing it is
useless.

Remove it.

Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/545e0b4536166a15b4475abcafe5ed0db4ad4a2c.1370436120.git.viresh.kumar@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-06-19 12:58:42 +02:00
Viresh Kumar
0a0fca9d83 sched: Rename sched.c as sched/core.c in comments and Documentation
Most of the stuff from kernel/sched.c was moved to kernel/sched/core.c long time
back and the comments/Documentation never got updated.

I figured it out when I was going through sched-domains.txt and so thought of
fixing it globally.

I haven't crossed check if the stuff that is referenced in sched/core.c by all
these files is still present and hasn't changed as that wasn't the motive behind
this patch.

Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/cdff76a265326ab8d71922a1db5be599f20aad45.1370329560.git.viresh.kumar@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-06-19 12:58:42 +02:00
Michael Wang
8404c90d05 sched: Femove the useless declaration in kernel/sched/fair.c
default_cfs_period(), do_sched_cfs_period_timer(), do_sched_cfs_slack_timer()
already defined previously, no need to declare again.

Signed-off-by: Michael Wang <wangyun@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/51AD8808.7020608@linux.vnet.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-06-19 12:58:41 +02:00
Michael Wang
22b958d8cc sched: Refine the code in unthrottle_cfs_rq()
Directly use rq to save some code.

Signed-off-by: Michael Wang <wangyun@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/51AD87EB.1070605@linux.vnet.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-06-19 12:58:41 +02:00
Kirill Tkhai
e23ee74777 sched/rt: Simplify pull_rt_task() logic and remove .leaf_rt_rq_list
[ Peter, this is based off of some of my work, I ran it though a few
  tests and it passed. I also reviewed it, and added my SOB as I am
  somewhat a co-author to it. ]

Based on the patch by Steven Rostedt from previous year:

https://lkml.org/lkml/2012/4/18/517

1)Simplify pull_rt_task() logic: search in pushable tasks of dest runqueue.
The only pullable tasks are the tasks which are pushable in their local rq,
and no others.

2)Remove .leaf_rt_rq_list member of struct rt_rq and functions connected
with it: nobody uses it since now.

Signed-off-by: Kirill Tkhai <tkhai@yandex.ru>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/287571370557898@web7d.yandex.ru
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-06-19 12:58:40 +02:00
Ingo Molnar
d81344c508 Merge branch 'sched/urgent' into sched/core
Merge in fixes before applying ongoing new work.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-06-19 12:55:31 +02:00
Steven Rostedt
29bb9e5a75 tracing/context-tracking: Add preempt_schedule_context() for tracing
Dave Jones hit the following bug report:

 ===============================
 [ INFO: suspicious RCU usage. ]
 3.10.0-rc2+ #1 Not tainted
 -------------------------------
 include/linux/rcupdate.h:771 rcu_read_lock() used illegally while idle!
 other info that might help us debug this:
 RCU used illegally from idle CPU! rcu_scheduler_active = 1, debug_locks = 0
 RCU used illegally from extended quiescent state!
 2 locks held by cc1/63645:
  #0:  (&rq->lock){-.-.-.}, at: [<ffffffff816b39fd>] __schedule+0xed/0x9b0
  #1:  (rcu_read_lock){.+.+..}, at: [<ffffffff8109d645>] cpuacct_charge+0x5/0x1f0

 CPU: 1 PID: 63645 Comm: cc1 Not tainted 3.10.0-rc2+ #1 [loadavg: 40.57 27.55 13.39 25/277 64369]
 Hardware name: Gigabyte Technology Co., Ltd. GA-MA78GM-S2H/GA-MA78GM-S2H, BIOS F12a 04/23/2010
  0000000000000000 ffff88010f78fcf8 ffffffff816ae383 ffff88010f78fd28
  ffffffff810b698d ffff88011c092548 000000000023d073 ffff88011c092500
  0000000000000001 ffff88010f78fd60 ffffffff8109d7c5 ffffffff8109d645
 Call Trace:
  [<ffffffff816ae383>] dump_stack+0x19/0x1b
  [<ffffffff810b698d>] lockdep_rcu_suspicious+0xfd/0x130
  [<ffffffff8109d7c5>] cpuacct_charge+0x185/0x1f0
  [<ffffffff8109d645>] ? cpuacct_charge+0x5/0x1f0
  [<ffffffff8108dffc>] update_curr+0xec/0x240
  [<ffffffff8108f528>] put_prev_task_fair+0x228/0x480
  [<ffffffff816b3a71>] __schedule+0x161/0x9b0
  [<ffffffff816b4721>] preempt_schedule+0x51/0x80
  [<ffffffff816b4800>] ? __cond_resched_softirq+0x60/0x60
  [<ffffffff816b6824>] ? retint_careful+0x12/0x2e
  [<ffffffff810ff3cc>] ftrace_ops_control_func+0x1dc/0x210
  [<ffffffff816be280>] ftrace_call+0x5/0x2f
  [<ffffffff816b681d>] ? retint_careful+0xb/0x2e
  [<ffffffff816b4805>] ? schedule_user+0x5/0x70
  [<ffffffff816b4805>] ? schedule_user+0x5/0x70
  [<ffffffff816b6824>] ? retint_careful+0x12/0x2e
 ------------[ cut here ]------------

What happened was that the function tracer traced the schedule_user() code
that tells RCU that the system is coming back from userspace, and to
add the CPU back to the RCU monitoring.

Because the function tracer does a preempt_disable/enable_notrace() calls
the preempt_enable_notrace() checks the NEED_RESCHED flag. If it is set,
then preempt_schedule() is called. But this is called before the user_exit()
function can inform the kernel that the CPU is no longer in user mode and
needs to be accounted for by RCU.

The fix is to create a new preempt_schedule_context() that checks if
the kernel is still in user mode and if so to switch it to kernel mode
before calling schedule. It also switches back to user mode coming back
from schedule in need be.

The only user of this currently is the preempt_enable_notrace(), which is
only used by the tracing subsystem.

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1369423420.6828.226.camel@gandalf.local.home
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-06-19 12:55:10 +02:00
Vincent Guittot
873b4c65b5 sched: Fix clear NOHZ_BALANCE_KICK
I have faced a sequence where the Idle Load Balance was sometime not
triggered for a while on my platform, in the following scenario:

 CPU 0 and CPU 1 are running tasks and CPU 2 is idle

 CPU 1 kicks the Idle Load Balance
 CPU 1 selects CPU 2 as the new Idle Load Balancer
 CPU 2 sets NOHZ_BALANCE_KICK for CPU 2
 CPU 2 sends a reschedule IPI to CPU 2

 While CPU 3 wakes up, CPU 0 or CPU 1 migrates a waking up task A on CPU 2

 CPU 2 finally wakes up, runs task A and discards the Idle Load Balance
       task A quickly goes back to sleep (before a tick occurs on CPU 2)
 CPU 2 goes back to idle with NOHZ_BALANCE_KICK set

Whenever CPU 2 will be selected as the ILB, no reschedule IPI will be sent
because NOHZ_BALANCE_KICK is already set and no Idle Load Balance will be
performed.

We must wait for the sched softirq to be raised on CPU 2 thanks to another
part the kernel to come back to clear NOHZ_BALANCE_KICK.

The proposed solution clears NOHZ_BALANCE_KICK in schedule_ipi if
we can't raise the sched_softirq for the Idle Load Balance.

Change since V1:

- move the clear of NOHZ_BALANCE_KICK in got_nohz_idle_kick if the ILB
  can't run on this CPU (as suggested by Peter)

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1370419991-13870-1-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-06-19 12:55:09 +02:00
Mischa Jonker
03d8e80beb perf: Add const qualifier to perf_pmu_register's 'name' arg
This allows us to use pdev->name for registering a PMU device.
IMO the name is not supposed to be changed anyway.

Signed-off-by: Mischa Jonker <mjonker@synopsys.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1370339148-5566-1-git-send-email-mjonker@synopsys.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-06-19 12:50:23 +02:00
Stephane Eranian
e712209a9e perf: Fix hypervisor branch sampling permission check
Commit 2b923c8 perf/x86: Check branch sampling priv level in generic code
was missing the check for the hypervisor (HV) priv level, so add it back.

With this patch, we get the following correct behavior:

  # echo 2 >/proc/sys/kernel/perf_event_paranoid

  $ perf record -j any,k noploop 1
  Error:
  You may not have permission to collect stats.
  Consider tweaking /proc/sys/kernel/perf_event_paranoid:
   -1 - Not paranoid at all
    0 - Disallow raw tracepoint access for unpriv
    1 - Disallow cpu events for unpriv
    2 - Disallow kernel profiling for unpriv

   $ perf record -j any,hv noploop 1
   Error:
   You may not have permission to collect stats.
   Consider tweaking /proc/sys/kernel/perf_event_paranoid:
    -1 - Not paranoid at all
     0 - Disallow raw tracepoint access for unpriv
     1 - Disallow cpu events for unpriv
     2 - Disallow kernel profiling for unpriv

Signed-off-by: Stephane Eranian <eranian@google.com>
Acked-by: Petr Matousek <pmatouse@redhat.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20130606090204.GA3725@quad
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-06-19 12:50:21 +02:00
Ingo Molnar
eff2108f02 Merge branch 'perf/urgent' into perf/core
Merge in the latest fixes, to avoid conflicts with ongoing work.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-06-19 12:44:41 +02:00
Peter Zijlstra
9bb5d40cd9 perf: Fix mmap() accounting hole
Vince's fuzzer once again found holes. This time it spotted a leak in
the locked page accounting.

When an event had redirected output and its close() was the last
reference to the buffer we didn't have a vm context to undo accounting.

Change the code to destroy the buffer on the last munmap() and detach
all redirected events at that time. This provides us the right context
to undo the vm accounting.

Reported-and-tested-by: Vince Weaver <vincent.weaver@maine.edu>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20130604084421.GI8923@twins.programming.kicks-ass.net
Cc: <stable@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-06-19 12:44:13 +02:00
Li Zefan
03c78cbebb cgroup: rename cont to cgrp
Cont is short for container. control group was named process container
at first, but then people found container already has a meaning in
linux kernel.

Clean up the leftover variable name @cont.

Signed-off-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2013-06-19 01:22:50 -07:00
Tejun Heo
00356bd5f0 cgroup: clean up cgroup_serial_nr_cursor
cgroup_serial_nr_cursor was created atomic64_t because I thought it
was never gonna used for anything other than assigning unique numbers
to cgroups and didn't want to worry about synchronization; however,
now we're using it as an event-stamp to distinguish cgroups created
before and after certain point which assumes that it's protected by
cgroup_mutex.

Let's make it clear by making it a u64.  Also, rename it to
cgroup_serial_nr_next and make it point to the next nr to allocate so
that where it's pointing to is clear and more conventional.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Li Zefan <lizefan@huawei.com>
2013-06-18 11:17:32 -07:00
Li Zefan
e8c82d20a9 cgroup: convert cgroup_cft_commit() to use cgroup_for_each_descendant_pre()
We used root->allcg_list to iterate cgroup hierarchy because at that time
cgroup_for_each_descendant_pre() hasn't been invented.

tj: In cgroup_cfts_commit(), s/@serial_nr/@update_upto/, move the
    assignment right above releasing cgroup_mutex and explain what's
    going on there.

Signed-off-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2013-06-18 11:14:22 -07:00
Li Zefan
794611a1df cgroup: make serial_nr_cursor available throughout cgroup.c
The next patch will use it to determine if a cgroup is newly created
while we're iterating the cgroup hierarchy.

tj: Rephrased the comment on top of cgroup_serial_nr_cursor.

Signed-off-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2013-06-18 11:14:21 -07:00
Yinghai Lu
0541881502 range: Do not add new blank slot with add_range_with_merge
Joshua reported: Commit cd7b304dfaf1 (x86, range: fix missing merge
during add range) broke mtrr cleanup on his setup in 3.9.5.
corresponding commit in upstream is fbe06b7bae.

The reason is add_range_with_merge could generate blank spot.

We could avoid that by searching new expanded start/end, that
new range should include all connected ranges in range array.
At last add the new expanded start/end to the range array.
Also move up left array so do not add new blank slot in the
range array.

-v2: move left array to avoid enhance add_range()
-v3: include fix from Joshua about memmove declaring when
     DYN_DEBUG is used.

Reported-by: Joshua Covington <joshuacov@googlemail.com>
Tested-by: Joshua Covington <joshuacov@googlemail.com>
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Link: http://lkml.kernel.org/r/1371154622-8929-3-git-send-email-yinghai@kernel.org
Cc: <stable@vger.kernel.org> v3.9
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2013-06-18 11:32:10 -05:00
Li Zefan
f57947d277 cgroup: fix memory leak in cgroup_rm_cftypes()
The memory allocated in cgroup_add_cftypes() should be freed. The
effect of this bug is we leak a bit memory everytime we unload
cfq-iosched module if blkio cgroup is enabled.

Signed-off-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2013-06-18 09:04:30 -07:00
Li Zefan
1c8158eeae cgroup: fix umount vs cgroup_event_remove() race
commit 5db9a4d99b
 Author: Tejun Heo <tj@kernel.org>
 Date:   Sat Jul 7 16:08:18 2012 -0700

     cgroup: fix cgroup hierarchy umount race

This commit fixed a race caused by the dput() in css_dput_fn(), but
the dput() in cgroup_event_remove() can also lead to the same BUG().

Signed-off-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: stable@vger.kernel.org
2013-06-18 09:04:30 -07:00
Li Zefan
084457f284 cgroup: fix umount vs cgroup_cfts_commit() race
cgroup_cfts_commit() uses dget() to keep cgroup alive after cgroup_mutex
is dropped, but dget() won't prevent cgroupfs from being umounted. When
the race happens, vfs will see some dentries with non-zero refcnt while
umount is in process.

Keep running this:
  mount -t cgroup -o blkio xxx /cgroup
  umount /cgroup

And this:
  modprobe cfq-iosched
  rmmod cfs-iosched

After a while, the BUG() in shrink_dcache_for_umount_subtree() may
be triggered:

  BUG: Dentry xxx{i=0,n=blkio.yyy} still in use (1) [umount of cgroup cgroup]

Signed-off-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: stable@vger.kernel.org
2013-06-18 09:04:30 -07:00
Tejun Heo
6db8e85c5c cgroup: disallow rename(2) if sane_behavior
cgroup's rename(2) isn't a proper migration implementation - it can't
move the cgroup to a different parent in the hierarchy.  All it can do
is swapping the name string for that cgroup.  This isn't useful and
can mislead users to think that cgroup supports proper cgroup-level
migration.  Disallow rename(2) if sane_behavior.

v2: Fail with -EPERM instead of -EINVAL so that it matches the vfs
    return value when ->rename is not implemented as suggested by Li.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2013-06-18 08:14:23 -07:00
Uwe Kleine-König
37074c5a1b irq/generic-chip: fix a few kernel-doc entries
Signed-off-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2013-06-18 13:38:34 +02:00
Greg Kroah-Hartman
bb07b00be7 Merge 3.10-rc6 into driver-core-next
We want these fixes here too.

Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-06-17 16:57:20 -07:00
Stephen Boyd
336ae1180d ARM: sched_clock: Load cycle count after epoch stabilizes
There is a small race between when the cycle count is read from
the hardware and when the epoch stabilizes. Consider this
scenario:

 CPU0                           CPU1
 ----                           ----
 cyc = read_sched_clock()
 cyc_to_sched_clock()
                                 update_sched_clock()
                                  ...
                                  cd.epoch_cyc = cyc;
  epoch_cyc = cd.epoch_cyc;
  ...
  epoch_ns + cyc_to_ns((cyc - epoch_cyc)

The cyc on cpu0 was read before the epoch changed. But we
calculate the nanoseconds based on the new epoch by subtracting
the new epoch from the old cycle count. Since epoch is most likely
larger than the old cycle count we calculate a large number that
will be converted to nanoseconds and added to epoch_ns, causing
time to jump forward too much.

Fix this problem by reading the hardware after the epoch has
stabilized.

Cc: Russell King <linux@arm.linux.org.uk>
Signed-off-by: Stephen Boyd <sboyd@codeaurora.org>
Signed-off-by: John Stultz <john.stultz@linaro.org>
2013-06-17 15:56:11 -07:00
Linus Torvalds
d0ff934881 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull VFS fixes from Al Viro:
 "Several fixes + obvious cleanup (you've missed a couple of open-coded
  can_lookup() back then)"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  snd_pcm_link(): fix a leak...
  use can_lookup() instead of direct checks of ->i_op->lookup
  move exit_task_namespaces() outside of exit_notify()
  fput: task_work_add() can fail if the caller has passed exit_task_work()
  ncpfs: fix rmdir returns Device or resource busy
2013-06-14 19:18:56 -10:00
Oleg Nesterov
8aac62706a move exit_task_namespaces() outside of exit_notify()
exit_notify() does exit_task_namespaces() after
forget_original_parent(). This was needed to ensure that ->nsproxy
can't be cleared prematurely, an exiting child we are going to
reparent can do do_notify_parent() and use the parent's (ours) pid_ns.

However, after 32084504 "pidns: use task_active_pid_ns in
do_notify_parent" ->nsproxy != NULL is no longer needed, we rely
on task_active_pid_ns().

Move exit_task_namespaces() from exit_notify() to do_exit(), after
exit_fs() and before exit_task_work().

This solves the problem reported by Andrey, free_ipc_ns()->shm_destroy()
does fput() which needs task_work_add().

Note: this particular problem can be fixed if we change fput(), and
that change makes sense anyway. But there is another reason to move
the callsite. The original reason for exit_task_namespaces() from
the middle of exit_notify() was subtle and it has already gone away,
now this looks confusing. And this allows us do simplify exit_notify(),
we can avoid unlock/lock(tasklist) and we can use ->exit_state instead
of PF_EXITING in forget_original_parent().

Reported-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-06-15 05:39:08 +04:00
James Bottomley
29ce3785b2 idle: Enable interrupts in the weak arch_cpu_idle() implementation
PARISC bootup triggers the warning at kernel/cpu/idle.c:96. That's
caused by the weak arch_cpu_idle() implementation, which is provided
to avoid that architectures implement idle_poll over and over.

The switchover to polling mode happens in the first call of the weak
arch_cpu_idle() implementation, but that code fails to reenable
interrupts and therefor triggers the warning.

Fix this by enabling interrupts in the weak arch_cpu_idle() code.

[ tglx: Made the changelog match the patch ]

Signed-off-by: James Bottomley <JBottomley@Parallels.com>
Reviewed-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Link: http://lkml.kernel.org/r/1371236142.2726.43.camel@dabdike
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2013-06-14 23:01:05 +02:00
Li Zefan
c9e5fe66f5 cpuset: rename @cont to @cgrp
Cont is short for container. control group was named process container
at first, but then people found container already has a meaning in
linux kernel.

Clean up the leftover variable name @cont.

Signed-off-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2013-06-13 20:48:19 -07:00
Tejun Heo
d3daf28da1 cgroup: use percpu refcnt for cgroup_subsys_states
A css (cgroup_subsys_state) is how each cgroup is represented to a
controller.  As such, it can be used in hot paths across the various
subsystems different controllers are associated with.

One of the common operations is reference counting, which up until now
has been implemented using a global atomic counter and can have
significant adverse impact on scalability.  For example, css refcnt
can be gotten and put multiple times by blkcg for each IO request.
For highops configurations which try to do as much per-cpu as
possible, the global frequent refcnting can be very expensive.

In general, given the various and hugely diverse paths css's end up
being used from, we need to make it cheap and highly scalable.  In its
usage, css refcnting isn't very different from module refcnting.

This patch converts css refcnting to use the recently added
percpu_ref.  css_get/tryget/put() directly maps to the matching
percpu_ref operations and the deactivation logic is no longer
necessary as percpu_ref already has refcnt killing.

The only complication is that as the refcnt is per-cpu,
percpu_ref_kill() in itself doesn't ensure that further tryget
operations will fail, which we need to guarantee before invoking
->css_offline()'s.  This is resolved collecting kill confirmation
using percpu_ref_kill_and_confirm() and initiating the offline phase
of destruction after all css refcnt's are confirmed to be seen as
killed on all CPUs.  The previous patches already splitted destruction
into two phases, so percpu_ref_kill_and_confirm() can be hooked up
easily.

This patch removes css_refcnt() which is used for rcu dereference
sanity check in css_id().  While we can add a percpu refcnt API to ask
the same question, css_id() itself is scheduled to be removed fairly
soon, so let's not bother with it.  Just drop the sanity check and use
rcu_dereference_raw() instead.

v2: - init_cgroup_css() was calling percpu_ref_init() without checking
      the return value.  This causes two problems - the obvious lack
      of error handling and percpu_ref_init() being called from
      cgroup_init_subsys() before the allocators are up, which
      triggers warnings but doesn't cause actual problems as the
      refcnt isn't used for roots anyway.  Fix both by moving
      percpu_ref_init() to cgroup_create().

    - The base references were put too early by
      percpu_ref_kill_and_confirm() and cgroup_offline_fn() put the
      refs one extra time.  This wasn't noticeable because css's go
      through another RCU grace period before being freed.  Update
      cgroup_destroy_locked() to grab an extra reference before
      killing the refcnts.  This problem was noticed by Kent.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Kent Overstreet <koverstreet@google.com>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: "Alasdair G. Kergon" <agk@redhat.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Mikulas Patocka <mpatocka@redhat.com>
Cc: Glauber Costa <glommer@gmail.com>
2013-06-13 19:43:12 -07:00
Tejun Heo
2b0e53a7c8 Merge branch 'for-3.11' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu into for-3.11
This is to receive percpu_refcount which will replace atomic_t
reference count in cgroup_subsys_state.

Signed-off-by: Tejun Heo <tj@kernel.org>
2013-06-13 19:42:22 -07:00
Tejun Heo
ea15f8ccdb cgroup: split cgroup destruction into two steps
Split cgroup_destroy_locked() into two steps and put the latter half
into cgroup_offline_fn() which is executed from a work item.  The
latter half is responsible for offlining the css's, removing the
cgroup from internal lists, and propagating release notification to
the parent.  The separation is to allow using percpu refcnt for css.

Note that this allows for other cgroup operations to happen between
the first and second halves of destruction, including creating a new
cgroup with the same name.  As the target cgroup is marked DEAD in the
first half and cgroup internals don't care about the names of cgroups,
this should be fine.  A comment explaining this will be added by the
next patch which implements the actual percpu refcnting.

As RCU freeing is guaranteed to happen after the second step of
destruction, we can use the same work item for both.  This patch
renames cgroup->free_work to ->destroy_work and uses it for both
purposes.  INIT_WORK() is now performed right before queueing the work
item.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2013-06-13 19:27:42 -07:00
Tejun Heo
455050d23e cgroup: reorder the operations in cgroup_destroy_locked()
This patch reorders the operations in cgroup_destroy_locked() such
that the userland visible parts happen before css offlining and
removal from the ->sibling list.  This will be used to make css use
percpu refcnt.

While at it, split out CGRP_DEAD related comment from the refcnt
deactivation one and correct / clarify how different guarantees are
met.

While this patch changes the specific order of operations, it
shouldn't cause any noticeable behavior difference.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2013-06-13 19:27:41 -07:00
Linus Torvalds
cb7e9704d5 Merge branch 'rcu/urgent' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu
Pull RCU fixes from Paul McKenney:
 "I must confess that this past merge window was not RCU's best showing.
  This series contains three more fixes for RCU regressions:

   1.   A fix to __DECLARE_TRACE_RCU() that causes it to act as an
        interrupt from idle rather than as a task switch from idle.
        This change is needed due to the recent use of _rcuidle()
        tracepoints that can be invoked from interrupt handlers as well
        as from idle.  Without this fix, invoking _rcuidle() tracepoints
        from interrupt handlers results in splats and (more seriously)
        confusion on RCU's part as to whether a given CPU is idle or not.
        This confusion can in turn result in too-short grace periods and
        therefore random memory corruption.

   2.   A fix to a subtle deadlock that could result due to RCU doing
        a wakeup while holding one of its rcu_node structure's locks.
        Although the probability of occurrence is low, it really
        does happen.  The fix, courtesy of Steven Rostedt, uses
        irq_work_queue() to avoid the deadlock.

   3.   A fix to a silent deadlock (invisible to lockdep) due to the
        interaction of timeouts posted by RCU debug code enabled by
        CONFIG_PROVE_RCU_DELAY=y, grace-period initialization, and CPU
        hotplug operations.  This will not occur in production kernels,
        but really does occur in randconfig testing.  Diagnosis courtesy
        of Steven Rostedt"

* 'rcu/urgent' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu:
  rcu: Fix deadlock with CPU hotplug, RCU GP init, and timer migration
  rcu: Don't call wakeup() with rcu_node structure ->lock held
  trace: Allow idle-safe tracepoints to be called from irq
2013-06-13 12:36:42 -07:00
Tejun Heo
6f3d828f0f cgroup: remove cgroup->count and use
cgroup->count tracks the number of css_sets associated with the cgroup
and used only to verify that no css_set is associated when the cgroup
is being destroyed.  It's superflous as the destruction path can
simply check whether cgroup->cset_links is empty instead.

Drop cgroup->count and check ->cset_links directly from
cgroup_destroy_locked().

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2013-06-13 10:55:18 -07:00
Tejun Heo
ddd69148bd cgroup: drop unnecessary RCU dancing from __put_css_set()
__put_css_set() does RCU read access on @cgrp across dropping
@cgrp->count so that it can continue accessing @cgrp even if the count
reached zero and destruction of the cgroup commenced.  Given that both
sides - __css_put() and cgroup_destroy_locked() - are cold paths, this
is unnecessary.  Just making cgroup_destroy_locked() grab css_set_lock
while checking @cgrp->count is enough.

Remove the RCU read locking from __put_css_set() and make
cgroup_destroy_locked() read-lock css_set_lock when checking
@cgrp->count.  This will also allow removing @cgrp->count.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2013-06-13 10:55:18 -07:00
Tejun Heo
54766d4a1d cgroup: rename CGRP_REMOVED to CGRP_DEAD
We will add another flag indicating that the cgroup is in the process
of being killed.  REMOVING / REMOVED is more difficult to distinguish
and cgroup_is_removing()/cgroup_is_removed() are a bit awkward.  Also,
later percpu_ref usage will involve "kill"ing the refcnt.

 s/CGRP_REMOVED/CGRP_DEAD/
 s/cgroup_is_removed()/cgroup_is_dead()

This patch is purely cosmetic.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2013-06-13 10:55:18 -07:00
Tejun Heo
f4f4be2bd2 cgroup: use kzalloc() instead of kmalloc()
There's no point in using kmalloc() instead of the clearing variant
for trivial stuff.  We can live dangerously elsewhere.  Use kzalloc()
instead and drop 0 inits.

While at it, do trivial code reorganization in cgroup_file_open().

This patch doesn't introduce any functional changes.

v2: I was caught in the very distant past where list_del() didn't
    poison and the initial version converted list_del()s to
    list_del_init()s too.  Li and Kent took me out of the stasis
    chamber.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Kent Overstreet <koverstreet@google.com>
Acked-by: Li Zefan <lizefan@huawei.com>
2013-06-13 10:55:17 -07:00
Tejun Heo
69d0206c79 cgroup: bring some sanity to naming around cg_cgroup_link
cgroups and css_sets are mapped M:N and this M:N mapping is
represented by struct cg_cgroup_link which forms linked lists on both
sides.  The naming around this mapping is already confusing and struct
cg_cgroup_link exacerbates the situation quite a bit.

>From cgroup side, it starts off ->css_sets and runs through
->cgrp_link_list.  From css_set side, it starts off ->cg_links and
runs through ->cg_link_list.  This is rather reversed as
cgrp_link_list is used to iterate css_sets and cg_link_list cgroups.
Also, this is the only place which is still using the confusing "cg"
for css_sets.  This patch cleans it up a bit.

* s/cgroup->css_sets/cgroup->cset_links/
  s/css_set->cg_links/css_set->cgrp_links/
  s/cgroup_iter->cg_link/cgroup_iter->cset_link/

* s/cg_cgroup_link/cgrp_cset_link/

* s/cgrp_cset_link->cg/cgrp_cset_link->cset/
  s/cgrp_cset_link->cgrp_link_list/cgrp_cset_link->cset_link/
  s/cgrp_cset_link->cg_link_list/cgrp_cset_link->cgrp_link/

* s/init_css_set_link/init_cgrp_cset_link/
  s/free_cg_links/free_cgrp_cset_links/
  s/allocate_cg_links/allocate_cgrp_cset_links/

* s/cgl[12]/link[12]/ in compare_css_sets()

* s/saved_link/tmp_link/ s/tmp/tmp_links/ and a couple similar
  adustments.

* Comment and whiteline adjustments.

After the changes, we have

	list_for_each_entry(link, &cont->cset_links, cset_link) {
		struct css_set *cset = link->cset;

instead of

	list_for_each_entry(link, &cont->css_sets, cgrp_link_list) {
		struct css_set *cset = link->cg;

This patch is purely cosmetic.

v2: Fix broken sentences in the patch description.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2013-06-13 10:55:17 -07:00
Tejun Heo
5abb885573 cgroup: consistently use @cset for struct css_set variables
cgroup.c uses @cg for most struct css_set variables, which in itself
could be a bit confusing, but made much worse by the fact that there
are places which use @cg for struct cgroup variables.
compare_css_sets() epitomizes this confusion - @[old_]cg are struct
css_set while @cg[12] are struct cgroup.

It's not like the whole deal with cgroup, css_set and cg_cgroup_link
isn't already confusing enough.  Let's give it some sanity by
uniformly using @cset for all struct css_set variables.

* s/cg/cset/ for all css_set variables.

* s/oldcg/old_cset/ s/oldcgrp/old_cgrp/.  The same for the ones
  prefixed with "new".

* s/cg/cgrp/ for cgroup variables in compare_css_sets().

* s/css/cset/ for the cgroup variable in task_cgroup_from_root().

* Whiteline adjustments.

This patch is purely cosmetic.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2013-06-13 10:55:17 -07:00
Tejun Heo
3fc3db9a3a cgroup: remove now unused css_depth()
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2013-06-13 10:55:17 -07:00
Li Zefan
f047cecf2c cpuset: fix to migrate mm correctly in a corner case
Before moving tasks out of empty cpusets, update_tasks_nodemask()
is called, which calls do_migrate_pages(xx, from, to). Then those
tasks are moved to an ancestor, and do_migrate_pages() is called
again.

The first time: from = node_to_be_offlined, to = empty.
The second time: from = empty, to = ancestor's nodemask.

so looks like no pages will be migrated.

Fix this by:

- Don't call update_tasks_nodemask() on empty cpusets.
- Pass cs->old_mems_allowed to do_migrate_pages().

v4: added comment in cpuset_hotplug_update_tasks() and rephased comment
    in cpuset_attach().

Signed-off-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2013-06-13 10:51:22 -07:00
Li Zefan
88fa523bff cpuset: allow to move tasks to empty cpusets
Currently some cpuset behaviors are not friendly when cpuset is co-mounted
with other cgroup controllers.

Now with this patchset if cpuset is mounted with sane_behavior option,
it behaves differently:

- Tasks will be kept in empty cpusets when hotplug happens and take
  masks of ancestors with non-empty cpus/mems, instead of being moved to
  an ancestor.

- A task can be moved into an empty cpuset, and again it takes masks of
  ancestors, so the user can drop a task into a newly created cgroup without
  having to do anything for it.

As tasks can reside in empy cpusets, here're some rules:

- They can be moved to another cpuset, regardless it's empty or not.

- Though it takes masks from ancestors, it takes other configs from the
  empty cpuset.

- If the ancestors' masks are changed, those tasks will also be updated
  to take new masks.

v2: add documentation in include/linux/cgroup.h

Signed-off-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2013-06-13 10:48:33 -07:00
Li Zefan
5c5cc62321 cpuset: allow to keep tasks in empty cpusets
To achieve this:

- We call update_tasks_cpumask/nodemask() for empty cpusets when
hotplug happens, instead of moving tasks out of them.

- When a cpuset's masks are changed by writing cpuset.cpus/mems,
we also update tasks in child cpusets which are empty.

v3:
- do propagation work in one place for both hotplug and unplug

v2:
- drop rcu_read_lock before calling update_task_nodemask() and
  update_task_cpumask(), instead of using workqueue.
- add documentation in include/linux/cgroup.h

Signed-off-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2013-06-13 10:48:32 -07:00
Li Zefan
070b57fcac cpuset: introduce effective_{cpumask|nodemask}_cpuset()
effective_cpumask_cpuset() returns an ancestor cpuset which has
non-empty cpumask.

If a cpuset is empty and the tasks in it need to update their
cpus_allowed, they take on the ancestor cpuset's cpumask.

This currently won't change any behavior, but it will later allow us
to keep tasks in empty cpusets.

Signed-off-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2013-06-13 10:48:32 -07:00
Li Zefan
33ad801dfb cpuset: record old_mems_allowed in struct cpuset
When we update a cpuset's mems_allowed and thus update tasks'
mems_allowed, it's required to pass the old mems_allowed and new
mems_allowed to cpuset_migrate_mm().

Currently we save old mems_allowed in a temp local variable before
changing cpuset->mems_allowed. This patch changes it by saving
old mems_allowed in cpuset->old_mems_allowed.

This currently won't change any behavior, but it will later allow
us to keep tasks in empty cpusets.

v3: restored "cpuset_attach_nodemask_to = cs->mems_allowed"

Signed-off-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2013-06-13 10:48:32 -07:00
Linus Torvalds
a568fa1c91 Merge branch 'akpm' (updates from Andrew Morton)
Merge misc fixes from Andrew Morton:
 "Bunch of fixes and one little addition to math64.h"

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (27 commits)
  include/linux/math64.h: add div64_ul()
  mm: memcontrol: fix lockless reclaim hierarchy iterator
  frontswap: fix incorrect zeroing and allocation size for frontswap_map
  kernel/audit_tree.c:audit_add_tree_rule(): protect `rule' from kill_rules()
  mm: migration: add migrate_entry_wait_huge()
  ocfs2: add missing lockres put in dlm_mig_lockres_handler
  mm/page_alloc.c: fix watermark check in __zone_watermark_ok()
  drivers/misc/sgi-gru/grufile.c: fix info leak in gru_get_config_info()
  aio: fix io_destroy() regression by using call_rcu()
  rtc-at91rm9200: use shadow IMR on at91sam9x5
  rtc-at91rm9200: add shadow interrupt mask
  rtc-at91rm9200: refactor interrupt-register handling
  rtc-at91rm9200: add configuration support
  rtc-at91rm9200: add match-table compile guard
  fs/ocfs2/namei.c: remove unecessary ERROR when removing non-empty directory
  swap: avoid read_swap_cache_async() race to deadlock while waiting on discard I/O completion
  drivers/rtc/rtc-twl.c: fix missing device_init_wakeup() when booted with device tree
  cciss: fix broken mutex usage in ioctl
  audit: wait_for_auditd() should use TASK_UNINTERRUPTIBLE
  drivers/rtc/rtc-cmos.c: fix accidentally enabling rtc channel
  ...
2013-06-12 16:29:53 -07:00
Chen Gang
736f3203a0 kernel/audit_tree.c:audit_add_tree_rule(): protect `rule' from kill_rules()
audit_add_tree_rule() must set 'rule->tree = NULL;' firstly, to protect
the rule itself freed in kill_rules().

The reason is when it is killed, the 'rule' itself may have already
released, we should not access it.  one example: we add a rule to an
inode, just at the same time the other task is deleting this inode.

The work flow for adding a rule:

    audit_receive() -> (need audit_cmd_mutex lock)
      audit_receive_skb() ->
        audit_receive_msg() ->
          audit_receive_filter() ->
            audit_add_rule() ->
              audit_add_tree_rule() -> (need audit_filter_mutex lock)
                ...
                unlock audit_filter_mutex
                get_tree()
                ...
                iterate_mounts() -> (iterate all related inodes)
                  tag_mount() ->
                    tag_trunk() ->
                      create_trunk() -> (assume it is 1st rule)
                        fsnotify_add_mark() ->
                          fsnotify_add_inode_mark() ->  (add mark to inode->i_fsnotify_marks)
                        ...
                        get_tree(); (each inode will get one)
                ...
                lock audit_filter_mutex

The work flow for deleting an inode:

    __destroy_inode() ->
     fsnotify_inode_delete() ->
       __fsnotify_inode_delete() ->
        fsnotify_clear_marks_by_inode() ->  (get mark from inode->i_fsnotify_marks)
          fsnotify_destroy_mark() ->
           fsnotify_destroy_mark_locked() ->
             audit_tree_freeing_mark() ->
               evict_chunk() ->
                 ...
                 tree->goner = 1
                 ...
                 kill_rules() ->   (assume current->audit_context == NULL)
                   call_rcu() ->   (rule->tree != NULL)
                     audit_free_rule_rcu() ->
                       audit_free_rule()
                 ...
                 audit_schedule_prune() ->  (assume current->audit_context == NULL)
                   kthread_run() ->    (need audit_cmd_mutex and audit_filter_mutex lock)
                     prune_one() ->    (delete it from prue_list)
                       put_tree(); (match the original get_tree above)

Signed-off-by: Chen Gang <gang.chen@asianux.com>
Cc: Eric Paris <eparis@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-06-12 16:29:46 -07:00
Oleg Nesterov
f000cfdde5 audit: wait_for_auditd() should use TASK_UNINTERRUPTIBLE
audit_log_start() does wait_for_auditd() in a loop until
audit_backlog_wait_time passes or audit_skb_queue has a room.

If signal_pending() is true this becomes a busy-wait loop, schedule() in
TASK_INTERRUPTIBLE won't block.

Thanks to Guy for fully investigating and explaining the problem.

(akpm: that'll cause the system to lock up on a non-preemptible
uniprocessor kernel)

(Guy: "Our customer was in fact running a uniprocessor machine, and they
reported a system hang.")

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Reported-by: Guy Streeter <streeter@redhat.com>
Cc: Eric Paris <eparis@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-06-12 16:29:45 -07:00
Kees Cook
637241a900 kmsg: honor dmesg_restrict sysctl on /dev/kmsg
The dmesg_restrict sysctl currently covers the syslog method for access
dmesg, however /dev/kmsg isn't covered by the same protections.  Most
people haven't noticed because util-linux dmesg(1) defaults to using the
syslog method for access in older versions.  With util-linux dmesg(1)
defaults to reading directly from /dev/kmsg.

To fix /dev/kmsg, let's compare the existing interfaces and what they
allow:

 - /proc/kmsg allows:
  - open (SYSLOG_ACTION_OPEN) if CAP_SYSLOG since it uses a destructive
    single-reader interface (SYSLOG_ACTION_READ).
  - everything, after an open.

 - syslog syscall allows:
  - anything, if CAP_SYSLOG.
  - SYSLOG_ACTION_READ_ALL and SYSLOG_ACTION_SIZE_BUFFER, if
    dmesg_restrict==0.
  - nothing else (EPERM).

The use-cases were:
 - dmesg(1) needs to do non-destructive SYSLOG_ACTION_READ_ALLs.
 - sysklog(1) needs to open /proc/kmsg, drop privs, and still issue the
   destructive SYSLOG_ACTION_READs.

AIUI, dmesg(1) is moving to /dev/kmsg, and systemd-journald doesn't
clear the ring buffer.

Based on the comments in devkmsg_llseek, it sounds like actions besides
reading aren't going to be supported by /dev/kmsg (i.e.
SYSLOG_ACTION_CLEAR), so we have a strict subset of the non-destructive
syslog syscall actions.

To this end, move the check as Josh had done, but also rename the
constants to reflect their new uses (SYSLOG_FROM_CALL becomes
SYSLOG_FROM_READER, and SYSLOG_FROM_FILE becomes SYSLOG_FROM_PROC).
SYSLOG_FROM_READER allows non-destructive actions, and SYSLOG_FROM_PROC
allows destructive actions after a capabilities-constrained
SYSLOG_ACTION_OPEN check.

 - /dev/kmsg allows:
  - open if CAP_SYSLOG or dmesg_restrict==0
  - reading/polling, after open

Addresses https://bugzilla.redhat.com/show_bug.cgi?id=903192

[akpm@linux-foundation.org: use pr_warn_once()]
Signed-off-by: Kees Cook <keescook@chromium.org>
Reported-by: Christian Kujau <lists@nerdbynature.de>
Tested-by: Josh Boyer <jwboyer@redhat.com>
Cc: Kay Sievers <kay@vrfy.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-06-12 16:29:44 -07:00
Robin Holt
cf7df378aa reboot: rigrate shutdown/reboot to boot cpu
We recently noticed that reboot of a 1024 cpu machine takes approx 16
minutes of just stopping the cpus.  The slowdown was tracked to commit
f96972f2dc ("kernel/sys.c: call disable_nonboot_cpus() in
kernel_restart()").

The current implementation does all the work of hot removing the cpus
before halting the system.  We are switching to just migrating to the
boot cpu and then continuing with shutdown/reboot.

This also has the effect of not breaking x86's command line parameter
for specifying the reboot cpu.  Note, this code was shamelessly copied
from arch/x86/kernel/reboot.c with bits removed pertaining to the
reboot_cpu command line parameter.

Signed-off-by: Robin Holt <holt@sgi.com>
Tested-by: Shawn Guo <shawn.guo@linaro.org>
Cc: "Srivatsa S. Bhat" <srivatsa.bhat@linux.vnet.ibm.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Russ Anderson <rja@sgi.com>
Cc: Robin Holt <holt@sgi.com>
Cc: Russell King <linux@arm.linux.org.uk>
Cc: Guan Xuetao <gxt@mprc.pku.edu.cn>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-06-12 16:29:44 -07:00
Srivatsa S. Bhat
16e53dbf10 CPU hotplug: provide a generic helper to disable/enable CPU hotplug
There are instances in the kernel where we would like to disable CPU
hotplug (from sysfs) during some important operation.  Today the freezer
code depends on this and the code to do it was kinda tailor-made for
that.

Restructure the code and make it generic enough to be useful for other
usecases too.

Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Signed-off-by: Robin Holt <holt@sgi.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Russ Anderson <rja@sgi.com>
Cc: Robin Holt <holt@sgi.com>
Cc: Russell King <linux@arm.linux.org.uk>
Cc: Guan Xuetao <gxt@mprc.pku.edu.cn>
Cc: Shawn Guo <shawn.guo@linaro.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-06-12 16:29:44 -07:00
Stephen Boyd
38ff87f77a sched_clock: Make ARM's sched_clock generic for all architectures
Nothing about the sched_clock implementation in the ARM port is
specific to the architecture. Generalize the code so that other
architectures can use it by selecting GENERIC_SCHED_CLOCK.

Signed-off-by: Stephen Boyd <sboyd@codeaurora.org>
[jstultz: Merge minor collisions with other patches in my tree]
Signed-off-by: John Stultz <john.stultz@linaro.org>
2013-06-12 14:02:13 -07:00
Marcus Gelderie
11682a4161 alarmtimer: Export symbols of functions declared in linux/alarmtimer.h
Export symbols so they can be used by
drivers/staging/android/alarm-dev.c if it is built as a module.
So far alarm-dev is built-in but module support is planned (see
drivers/staging/android/TODO).

Signed-off-by: Marcus Gelderie <redmnic@gmail.com>
[jstultz: tweaked commit message, also export newly added functions]
Signed-off-by: John Stultz <john.stultz@linaro.org>
2013-06-12 14:02:12 -07:00
Linus Torvalds
45d53766b9 Yoshihiro Yunomae fixed a regression in the output format when using
one of the counter clocks. The new multibuffer code changed the trace_clock
 file to update the trace instances tr->clock_id but the actual traces still
 used the value from the obsolete global variable trace_clock_id.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.12 (GNU/Linux)
 
 iQEcBAABAgAGBQJRt8hEAAoJEOdOSU1xswtM6+EH/jAEuOrXhkkDqcua/paAOngw
 XWSaF9Cr6ozh4hutFrVSBi3AnsDrVo0adZmMVvLS9a7goyIUdYLfXbNeyK6Nvcq5
 bGXR5sJNpjtQ7snrmGX2KlXIGix28adi49eACi4qsGSG/jJYORYlgXcNBeXtENsb
 PKTXdQ8XEyc/h7Q51YQPHIVunf+zJSoepuXZ0myPLUWzLlPX9qoy5kETEpGhh9xh
 Ianb4wLo8dn6JuVGBuXQhZ/VzUHwJT1jJxR2JfnZ0bNVplNilnumAxY8f2zPOmzT
 lvIiQjCMRvNExFShuFh9WNnGRi62zYE9e0JJpYL4W9moIcbbEUvXYt2/imAVe9Q=
 =H+Wb
 -----END PGP SIGNATURE-----

Merge tag 'trace-fixes-v3.10-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace

Pull tracing fix from Steven Rostedt:
 "Yoshihiro Yunomae fixed a regression in the output format when using
  one of the counter clocks.

  The new multibuffer code changed the trace_clock file to update the
  trace instances tr->clock_id but the actual traces still used the
  value from the obsolete global variable trace_clock_id"

* tag 'trace-fixes-v3.10-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
  tracing: Fix outputting formats of x86-tsc and counter when use trace_clock
2013-06-12 08:29:11 -07:00
Namhyung Kim
aaf6ac0f08 tracing: Do not call kmem_cache_free() on allocation failure
There's no point calling it when _alloc() failed.

Link: http://lkml.kernel.org/r/1370585268-29169-1-git-send-email-namhyung@kernel.org

Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2013-06-11 18:38:52 -04:00
Steven Rostedt
7614c3dc74 ftrace: Use schedule_on_each_cpu() as a heavy synchronize_sched()
The function tracer uses preempt_disable/enable_notrace() for
synchronization between reading registered ftrace_ops and unregistering
them.

Most of the ftrace_ops are global permanent structures that do not
require this synchronization. That is, ops may be added and removed from
the hlist but are never freed, and wont hurt if a synchronization is
missed.

But this is not true for dynamically created ftrace_ops or control_ops,
which are used by the perf function tracing.

The problem here is that the function tracer can be used to trace
kernel/user context switches as well as going to and from idle.
Basically, it can be used to trace blind spots of the RCU subsystem.
This means that even though preempt_disable() is done, a
synchronize_sched() will ignore CPUs that haven't made it out of user
space or idle. These can include functions that are being traced just
before entering or exiting the kernel sections.

To implement the RCU synchronization, instead of using
synchronize_sched() the use of schedule_on_each_cpu() is performed. This
means that when a dynamically allocated ftrace_ops, or a control ops is
being unregistered, all CPUs must be touched and execute a ftrace_sync()
stub function via the work queues. This will rip CPUs out from idle or
in dynamic tick mode. This only happens when a user disables perf
function tracing or other dynamically allocated function tracers, but it
allows us to continue to debug RCU and context tracking with function
tracing.

Link: http://lkml.kernel.org/r/1369785676.15552.55.camel@gandalf.local.home

Cc: "Paul E. McKenney" <paulmck@us.ibm.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2013-06-11 18:38:50 -04:00
Wang YanQing
238ae93d69 tracing: Fix file mode of free_buffer
Commit 4f271a2a60
(tracing: Add a proc file to stop tracing and free buffer)
implement a method to free up ring buffer in kernel memory
in the release code path of free_buffer's fd.

Then we don't need read/write support for free_buffer,
indeed we just have a dummy write fop, and don't implement read fop.

So the 0200 is more reasonable file mode for free_buffer than
the current file mode 0644.

Link: http://lkml.kernel.org/r/20130526085201.GA3183@udknight

Acked-by: Vaibhav Nagarnaik <vnagarnaik@google.com>
Acked-by: David Sharp <dhsharp@google.com>
Signed-off-by: Wang YanQing <udknight@gmail.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2013-06-11 18:38:49 -04:00
Harsh Prateek Bora
8092e808a3 tracing/trivial: Consolidate error return condition
Consolidate the checks for !enabled and !param to return -EINVAL
in event_enable_func().

Link: http://lkml.kernel.org/r/1369380137-12452-1-git-send-email-harsh@linux.vnet.ibm.com

Signed-off-by: Harsh Prateek Bora <harsh@linux.vnet.ibm.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2013-06-11 18:38:49 -04:00
Steven Rostedt (Red Hat)
90e3c03c3a tracing: Add function probe to trigger a ftrace dump of current CPU trace
Add the "cpudump" command to have the current CPU ftrace buffer dumped
to console if a function is hit. This is useful when debugging a
tripple fault, where you have an idea of a function that is called
just before the tripple fault occurs, and can tell ftrace to dump its
content out to the console before it continues.

This differs from the "dump" command as it only dumps the content of
the ring buffer for the currently executing CPU, and does not show
the contents of the other CPUs.

Format is:

  <function>:cpudump

echo 'bad_address:cpudump' > /debug/tracing/set_ftrace_filter

To remove this:

echo '!bad_address:cpudump' > /debug/tracing/set_ftrace_filter

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2013-06-11 18:38:48 -04:00
Steven Rostedt (Red Hat)
ad71d889b8 tracing: Add function probe to trigger a ftrace dump to console
Add the "dump" command to have the ftrace buffer dumped to console if
a function is hit. This is useful when debugging a tripple fault,
where you have an idea of a function that is called just before the
tripple fault occurs, and can tell ftrace to dump its content out
to the console before it continues.

Format is:

  <function>:dump

echo 'bad_address:dump' > /debug/tracing/set_ftrace_filter

To remove this:

echo '!bad_address:dump' > /debug/tracing/set_ftrace_filter

Requested-by: Luis Claudio R. Goncalves <lclaudio@uudg.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2013-06-11 18:38:46 -04:00
Bernie Thompson
9350de06be PM / wakeup: Adjust messaging for wake events during suspend
This adds in a new message to the wakeup code which adds an
indication to the log that suspend was cancelled due to a wake event
occouring during the suspend sequence. It also adjusts the message
printed in suspend.c to reflect the potential that a suspend was
aborted, as opposed to a device failing to suspend.

Without these message adjustments one can end up with a kernel log
that says that a device failed to suspend with no actual device
suspend failures, which can be confusing to the log examiner.

Signed-off-by: Bernie Thompson <bhthompson@chromium.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2013-06-11 23:53:37 +02:00
Thomas Gleixner
d7880812b3 idle: Add the stack canary init to cpu_startup_entry()
Moving x86 to the generic idle implementation (commit 7d1a9417 "x86:
Use generic idle loop") wreckaged the stack protector.

I stupidly missed that boot_init_stack_canary() must be inlined from a
function which never returns, but I put that call into
arch_cpu_idle_prepare() which of course returns.

I pondered to play tricks with arch_cpu_idle_prepare() first, but then
I noticed, that the other archs which have implemented the
stackprotector (ARM and SH) do not initialize the canary for the
non-boot cpus.

So I decided to move the boot_init_stack_canary() call into
cpu_startup_entry() ifdeffed with an CONFIG_X86 for now. This #ifdef
is just a temporary measure as I don't want to inflict the
boot_init_stack_canary() call on ARM and SH that late in the cycle.

I'll queue a patch for 3.11 which removes the #ifdef if the ARM/SH
maintainers have no objection.

Reported-by: Wouter van Kesteren <woutershep@gmail.com>
Cc: x86@kernel.org
Cc: Russell King <linux@arm.linux.org.uk>
Cc: Paul Mundt <lethal@linux-sh.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2013-06-11 22:04:47 +02:00
Yoshihiro YUNOMAE
58e8eedf18 tracing: Fix outputting formats of x86-tsc and counter when use trace_clock
Outputting formats of x86-tsc and counter should be a raw format, but after
applying the patch(2b6080f28c), the format was
changed to nanosec. This is because the global variable trace_clock_id was used.
When we use multiple buffers, clock_id of each sub-buffer should be used. Then,
this patch uses tr->clock_id instead of the global variable trace_clock_id.

[ Basically, this fixes a regression where the multibuffer code changed the
  trace_clock file to update tr->clock_id but the traces still use the old
  global trace_clock_id variable, negating the file's effect. The global
  trace_clock_id variable is obsolete and removed. - SR ]

Link: http://lkml.kernel.org/r/20130423013239.22334.7394.stgit@yunodevel

Signed-off-by: Yoshihiro YUNOMAE <yoshihiro.yunomae.ez@hitachi.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2013-06-11 13:58:46 -04:00
Ivo Sieben
ee23871389 genirq: Set irq thread to RT priority on creation
When a threaded irq handler is installed the irq thread is initially
created on normal scheduling priority. Only after the irq thread is
woken up it sets its priority to RT_FIFO MAX_USER_RT_PRIO/2 itself.

This means that interrupts that occur directly after the irq handler
is installed will be handled on a normal scheduling priority instead
of the realtime priority that one would expect.

Fix this by setting the RT priority on creation of the irq_thread.

Signed-off-by: Ivo Sieben <meltedpianoman@gmail.com>
Cc: Sebastian Andrzej Siewior  <bigeasy@linutronix.de>
Cc: Steven Rostedt <rostedt@goodmis.org>
Link: http://lkml.kernel.org/r/1370254322-17240-1-git-send-email-meltedpianoman@gmail.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2013-06-11 16:18:50 +02:00
Ben Greear
34376a50fb Fix lockup related to stop_machine being stuck in __do_softirq.
The stop machine logic can lock up if all but one of the migration
threads make it through the disable-irq step and the one remaining
thread gets stuck in __do_softirq.  The reason __do_softirq can hang is
that it has a bail-out based on jiffies timeout, but in the lockup case,
jiffies itself is not incremented.

To work around this, re-add the max_restart counter in __do_irq and stop
processing irqs after 10 restarts.

Thanks to Tejun Heo and Rusty Russell and others for helping me track
this down.

This was introduced in 3.9 by commit c10d73671a ("softirq: reduce
latencies").

It may be worth looking into ath9k to see if it has issues with its irq
handler at a later date.

The hang stack traces look something like this:

    ------------[ cut here ]------------
    WARNING: at kernel/watchdog.c:245 watchdog_overflow_callback+0x9c/0xa7()
    Watchdog detected hard LOCKUP on cpu 2
    Modules linked in: ath9k ath9k_common ath9k_hw ath mac80211 cfg80211 nfsv4 auth_rpcgss nfs fscache nf_nat_ipv4 nf_nat veth 8021q garp stp mrp llc pktgen lockd sunrpc]
    Pid: 23, comm: migration/2 Tainted: G         C   3.9.4+ #11
    Call Trace:
     <NMI>   warn_slowpath_common+0x85/0x9f
      warn_slowpath_fmt+0x46/0x48
      watchdog_overflow_callback+0x9c/0xa7
      __perf_event_overflow+0x137/0x1cb
      perf_event_overflow+0x14/0x16
      intel_pmu_handle_irq+0x2dc/0x359
      perf_event_nmi_handler+0x19/0x1b
      nmi_handle+0x7f/0xc2
      do_nmi+0xbc/0x304
      end_repeat_nmi+0x1e/0x2e
     <<EOE>>
      cpu_stopper_thread+0xae/0x162
      smpboot_thread_fn+0x258/0x260
      kthread+0xc7/0xcf
      ret_from_fork+0x7c/0xb0
    ---[ end trace 4947dfa9b0a4cec3 ]---
    BUG: soft lockup - CPU#1 stuck for 22s! [migration/1:17]
    Modules linked in: ath9k ath9k_common ath9k_hw ath mac80211 cfg80211 nfsv4 auth_rpcgss nfs fscache nf_nat_ipv4 nf_nat veth 8021q garp stp mrp llc pktgen lockd sunrpc]
    irq event stamp: 835637905
    hardirqs last  enabled at (835637904): __do_softirq+0x9f/0x257
    hardirqs last disabled at (835637905): apic_timer_interrupt+0x6d/0x80
    softirqs last  enabled at (5654720): __do_softirq+0x1ff/0x257
    softirqs last disabled at (5654725): irq_exit+0x5f/0xbb
    CPU 1
    Pid: 17, comm: migration/1 Tainted: G        WC   3.9.4+ #11 To be filled by O.E.M. To be filled by O.E.M./To be filled by O.E.M.
    RIP: tasklet_hi_action+0xf0/0xf0
    Process migration/1
    Call Trace:
     <IRQ>
      __do_softirq+0x117/0x257
      irq_exit+0x5f/0xbb
      smp_apic_timer_interrupt+0x8a/0x98
      apic_timer_interrupt+0x72/0x80
     <EOI>
      printk+0x4d/0x4f
      stop_machine_cpu_stop+0x22c/0x274
      cpu_stopper_thread+0xae/0x162
      smpboot_thread_fn+0x258/0x260
      kthread+0xc7/0xcf
      ret_from_fork+0x7c/0xb0

Signed-off-by: Ben Greear <greearb@candelatech.com>
Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Pekka Riikonen <priikone@iki.fi>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-06-10 17:46:57 -07:00