Commit graph

31625 commits

Author SHA1 Message Date
Ilya Dryomov
2bd93d4d7e libceph: introduce pg_to_raw_osds() and raw_to_up_osds() helpers
pg_to_raw_osds() helper for computing a raw (crush) set, which can
contain non-existant and down osds.

raw_to_up_osds() helper for pruning non-existant and down osds from the
raw set, therefore transforming it into an up set, and determining up
primary.

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
2014-04-04 21:08:10 -07:00
Ilya Dryomov
63a6993f52 libceph: primary_affinity decode bits
Add two helpers to decode primary_affinity (full map, vector<u32>) and
new_primary_affinity (inc map, map<u32, u32>) and switch to them.

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
2014-04-04 21:08:04 -07:00
Ilya Dryomov
2cfa34f2d6 libceph: primary_affinity infrastructure
Add primary_affinity infrastructure.  primary_affinity values are
stored in an max_osd-sized array, hanging off ceph_osdmap, similar to
a osd_weight array.

Introduce {get,set}_primary_affinity() helpers, primarily to return
CEPH_OSD_DEFAULT_PRIMARY_AFFINITY when no affinity has been set and to
abstract out osd_primary_affinity array allocation and initialization.

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
2014-04-04 21:08:02 -07:00
Ilya Dryomov
d286de796a libceph: primary_temp decode bits
Add a common helper to decode both primary_temp (full map, map<pg_t,
u32>) and new_primary_temp (inc map, same) and switch to it.

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
2014-04-04 21:08:00 -07:00
Ilya Dryomov
9686f94c8c libceph: primary_temp infrastructure
Add primary_temp mappings infrastructure.  struct ceph_pg_mapping is
overloaded, primary_temp mappings are stored in an rb-tree, rooted at
ceph_osdmap, in a manner similar to pg_temp mappings.

Dump primary_temp mappings to /sys/kernel/debug/ceph/<client>/osdmap,
one 'primary_temp <pgid> <osd>' per line, e.g:

    primary_temp 2.6 4

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
2014-04-04 21:07:58 -07:00
Ilya Dryomov
35a935d75d libceph: generalize ceph_pg_mapping
In preparation for adding support for primary_temp mappings, generalize
struct ceph_pg_mapping so it can hold mappings other than pg_temp.

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
2014-04-04 21:07:57 -07:00
Ilya Dryomov
ec7af97258 libceph: introduce get_osdmap_client_data_v()
Full and incremental osdmaps are structured identically and have
identical headers.  Add a helper to decode both "old" (16-bit version,
v6) and "new" (8-bit struct_v+struct_compat+struct_len, v7) osdmap
enconding headers and switch to it.

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
2014-04-04 21:07:55 -07:00
Ilya Dryomov
10db634e20 libceph: introduce decode{,_new}_pg_temp() and switch to them
Consolidate pg_temp (full map, map<pg_t, vector<u32>>) and new_pg_temp
(inc map, same) decoding logic into a common helper and switch to it.

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
2014-04-04 21:07:53 -07:00
Ilya Dryomov
4d60351f90 libceph: switch osdmap_set_max_osd() to krealloc()
Use krealloc() instead of rolling our own.  (krealloc() with a NULL
first argument acts as a kmalloc()).  Properly initalize the new array
elements.  This is needed to make future additions to osdmap easier.

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
2014-04-04 21:07:52 -07:00
Ilya Dryomov
433fbdd31d libceph: introduce decode{,_new}_pools() and switch to them
Consolidate pools (full map, map<u64, pg_pool_t>) and new_pools (inc
map, same) decoding logic into a common helper and switch to it.

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
2014-04-04 21:07:50 -07:00
Ilya Dryomov
0f70c7eedb libceph: rename __decode_pool{,_names}() to decode_pool{,_names}()
To be in line with all the other osdmap decode helpers.

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
2014-04-04 21:07:49 -07:00
Ilya Dryomov
53bbaba9d8 libceph: fix and clarify ceph_decode_need() sizes
Sum up sizeof(...) results instead of (incorrectly) hard-coding the
number of bytes, expressed in ints and longs.

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
2014-04-04 21:07:47 -07:00
Ilya Dryomov
9464d00862 libceph: nuke bogus encoding version check in osdmap_apply_incremental()
Only version 6 of osdmap encoding is supported, anything other than
version 6 results in an error and halts the decoding process.  Checking
if version is >= 5 is therefore bogus.

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
2014-04-04 21:07:46 -07:00
Ilya Dryomov
86f1742b94 libceph: fixup error handling in osdmap_apply_incremental()
The existing error handling scheme requires resetting err to -EINVAL
prior to calling any ceph_decode_* macro.  This is ugly and fragile,
and there already are a few places where we would return 0 on error,
due to a missing reset.  Follow osdmap_decode() and fix this by adding
a special e_inval label to be used by all ceph_decode_* macros.

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
2014-04-04 21:07:44 -07:00
Ilya Dryomov
9902e682c7 libceph: fix crush_decode() call site in osdmap_decode()
The size of the memory area feeded to crush_decode() should be limited
not only by osdmap end, but also by the crush map length.  Also, drop
unnecessary dout() (dout() in crush_decode() conveys the same info) and
step past crush map only if it is decoded successfully.

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
2014-04-04 21:07:43 -07:00
Ilya Dryomov
2d88b2e081 libceph: check length of osdmap osd arrays
Check length of osd_state, osd_weight and osd_addr arrays.  They
should all have exactly max_osd elements after the call to
osdmap_set_max_osd().

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
2014-04-04 21:07:41 -07:00
Ilya Dryomov
3977058c46 libceph: safely decode max_osd value in osdmap_decode()
max_osd value is not covered by any ceph_decode_need().  Use a safe
version of ceph_decode_* macro to decode it.

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
2014-04-04 21:07:40 -07:00
Ilya Dryomov
597b52f6ca libceph: fixup error handling in osdmap_decode()
The existing error handling scheme requires resetting err to -EINVAL
prior to calling any ceph_decode_* macro.  This is ugly and fragile,
and there already are a few places where we would return 0 on error,
due to a missing reset.  Fix this by adding a special e_inval label to
be used by all ceph_decode_* macros.

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
2014-04-04 21:07:38 -07:00
Ilya Dryomov
a2505d63ee libceph: split osdmap allocation and decode steps
Split osdmap allocation and initialization into a separate function,
ceph_osdmap_decode().

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
2014-04-04 21:07:37 -07:00
Ilya Dryomov
38a8d56023 libceph: dump osdmap and enhance output on decode errors
Dump osdmap in hex on both full and incremental decode errors, to make
it easier to match the contents with error offset.  dout() map epoch
and max_osd value on success.

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
2014-04-04 21:07:35 -07:00
Ilya Dryomov
1c00240e00 libceph: dump pg_temp mappings to debugfs
Dump pg_temp mappings to /sys/kernel/debug/ceph/<client>/osdmap,
one 'pg_temp <pgid> [<osd>, ..., <osd>]' per line, e.g:

    pg_temp 2.6 [2,3,4]

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
2014-04-04 21:07:34 -07:00
Ilya Dryomov
0a2800d728 libceph: do not prefix osd lines with \t in debugfs output
To save screen space in anticipation of more fields (e.g. primary
affinity).

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
2014-04-04 21:07:32 -07:00
Ilya Dryomov
35fea3a18a libceph: refer to osdmap directly in osdmap_show()
To make it more readable and save screen space.

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
2014-04-04 21:07:31 -07:00
Ilya Dryomov
d83ed858f1 crush: add SET_CHOOSELEAF_VARY_R step
This lets you adjust the vary_r tunable on a per-rule basis.

Reflects ceph.git commit f944ccc20aee60a7d8da7e405ec75ad1cd449fac.

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2014-04-04 21:07:28 -07:00
Ilya Dryomov
e2b149cc4b crush: add chooseleaf_vary_r tunable
The current crush_choose_firstn code will re-use the same 'r' value for
the recursive call.  That means that if we are hitting a collision or
rejection for some reason (say, an OSD that is marked out) and need to
retry, we will keep making the same (bad) choice in that recursive
selection.

Introduce a tunable that fixes that behavior by incorporating the parent
'r' value into the recursive starting point, so that a different path
will be taken in subsequent placement attempts.

Note that this was done from the get-go for the new crush_choose_indep
algorithm.

This was exposed by a user who was seeing PGs stuck in active+remapped
after reweight-by-utilization because the up set mapped to a single OSD.

Reflects ceph.git commit a8e6c9fbf88bad056dd05d3eb790e98a5e43451a.

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2014-04-04 21:07:26 -07:00
Ilya Dryomov
6ed1002f36 crush: allow crush rules to set (re)tries counts to 0
These two fields are misnomers; they are *retry* counts.

Reflects ceph.git commit f17caba8ae0cad7b6f8f35e53e5f73b444696835.

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2014-04-04 21:07:25 -07:00
Ilya Dryomov
48a163dbb5 crush: fix off-by-one errors in total_tries refactor
Back in 27f4d1f6bc32c2ed7b2c5080cbd58b14df622607 we refactored the CRUSH
code to allow adjustment of the retry counts on a per-pool basis.  That
commit had an off-by-one bug: the previous "tries" counter was a *retry*
count, not a *try* count, but the new code was passing in 1 meaning
there should be no retries.

Fix the ftotal vs tries comparison to use < instead of <= to fix the
problem.  Note that the original code used <= here, which means the
global "choose_total_tries" tunable is actually counting retries.
Compensate for that by adding 1 in crush_do_rule when we pull the tunable
into the local variable.

This was noticed looking at output from a user provided osdmap.
Unfortunately the map doesn't illustrate the change in mapping behavior
and I haven't managed to construct one yet that does.  Inspection of the
crush debug output now aligns with prior versions, though.

Reflects ceph.git commit 795704fd615f0b008dcc81aa088a859b2d075138.

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2014-04-04 21:07:23 -07:00
Yan, Zheng
d90deda69c libceph: fix oops in ceph_msg_data_{pages,pagelist}_advance()
When there is no more data, ceph_msg_data_{pages,pagelist}_advance()
should not move on to the next page.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
2014-04-04 21:07:15 -07:00
Ilya Dryomov
c647b8a8c6 libceph: add support for CEPH_OSD_OP_SETALLOCHINT osd op
This is primarily for rbd's benefit and is supposed to combat
fragmentation:

"... knowing that rbd images have a 4m size, librbd can pass a hint
that will let the osd do the xfs allocation size ioctl on new files so
that they are allocated in 1m or 4m chunks.  We've seen cases where
users with rbd workloads have very high levels of fragmentation in xfs
and this would mitigate that and probably have a pretty nice
performance benefit."

SETALLOCHINT is considered advisory, so our backwards compatibility
mechanism here is to set FAILOK flag for all SETALLOCHINT ops.

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
2014-04-03 10:33:51 +08:00
Ilya Dryomov
7b25bf5f02 libceph: encode CEPH_OSD_OP_FLAG_* op flags
Encode ceph_osd_op::flags field so that it gets sent over the wire.

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
2014-04-03 10:33:51 +08:00
Ilya Dryomov
9d521470a4 libceph: a per-osdc crush scratch buffer
With the addition of erasure coding support in the future, scratch
variable-length array in crush_do_rule_ary() is going to grow to at
least 200 bytes on average, on top of another 128 bytes consumed by
rawosd/osd arrays in the call chain.  Replace it with a buffer inside
struct osdmap and a mutex.  This shouldn't result in any contention,
because all osd requests were already serialized by request_mutex at
that point; the only unlocked caller was ceph_ioctl_get_dataloc().

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2014-04-03 10:33:50 +08:00
Vlad Yasevich
2adb956b08 vlan: Warn the user if lowerdev has bad vlan features.
Some drivers incorrectly assign vlan acceleration features to
vlan_features thus causing issues for Q-in-Q vlan configurations.
Warn the user of such cases.

Signed-off-by: Vlad Yasevich <vyasevic@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-03-28 17:16:51 -04:00
Vlad Yasevich
fc92f745f8 bridge: Fix crash with vlan filtering and tcpdump
When the vlan filtering is enabled on the bridge, but
the filter is not configured on the bridge device itself,
running tcpdump on the bridge device will result in a
an Oops with NULL pointer dereference.  The reason
is that br_pass_frame_up() will bypass the vlan
check because promisc flag is set.  It will then try
to get the table pointer and process the packet based
on the table.  Since the table pointer is NULL, we oops.
Catch this special condition in br_handle_vlan().

Reported-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
CC: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Signed-off-by: Vlad Yasevich <vyasevic@redhat.com>
Acked-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-03-28 17:14:02 -04:00
Vlad Yasevich
53d6471cef net: Account for all vlan headers in skb_mac_gso_segment
skb_network_protocol() already accounts for multiple vlan
headers that may be present in the skb.  However, skb_mac_gso_segment()
doesn't know anything about it and assumes that skb->mac_len
is set correctly to skip all mac headers.  That may not
always be the case.  If we are simply forwarding the packet (via
bridge or macvtap), all vlan headers may not be accounted for.

A simple solution is to allow skb_network_protocol to return
the vlan depth it has calculated.  This way skb_mac_gso_segment
will correctly skip all mac headers.

Signed-off-by: Vlad Yasevich <vyasevic@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-03-28 17:10:36 -04:00
Hannes Frederic Sowa
c15b1ccadb ipv6: move DAD and addrconf_verify processing to workqueue
addrconf_join_solict and addrconf_join_anycast may cause actions which
need rtnl locked, especially on first address creation.

A new DAD state is introduced which defers processing of the initial
DAD processing into a workqueue.

To get rtnl lock we need to push the code paths which depend on those
calls up to workqueues, specifically addrconf_verify and the DAD
processing.

(v2)
addrconf_dad_failure needs to be queued up to the workqueue, too. This
patch introduces a new DAD state and stop the DAD processing in the
workqueue (this is because of the possible ipv6_del_addr processing
which removes the solicited multicast address from the device).

addrconf_verify_lock is removed, too. After the transition it is not
needed any more.

As we are not processing in bottom half anymore we need to be a bit more
careful about disabling bottom half out when we lock spin_locks which are also
used in bh.

Relevant backtrace:
[  541.030090] RTNL: assertion failed at net/core/dev.c (4496)
[  541.031143] CPU: 0 PID: 0 Comm: swapper/0 Tainted: G           O 3.10.33-1-amd64-vyatta #1
[  541.031145] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2007
[  541.031146]  ffffffff8148a9f0 000000000000002f ffffffff813c98c1 ffff88007c4451f8
[  541.031148]  0000000000000000 0000000000000000 ffffffff813d3540 ffff88007fc03d18
[  541.031150]  0000880000000006 ffff88007c445000 ffffffffa0194160 0000000000000000
[  541.031152] Call Trace:
[  541.031153]  <IRQ>  [<ffffffff8148a9f0>] ? dump_stack+0xd/0x17
[  541.031180]  [<ffffffff813c98c1>] ? __dev_set_promiscuity+0x101/0x180
[  541.031183]  [<ffffffff813d3540>] ? __hw_addr_create_ex+0x60/0xc0
[  541.031185]  [<ffffffff813cfe1a>] ? __dev_set_rx_mode+0xaa/0xc0
[  541.031189]  [<ffffffff813d3a81>] ? __dev_mc_add+0x61/0x90
[  541.031198]  [<ffffffffa01dcf9c>] ? igmp6_group_added+0xfc/0x1a0 [ipv6]
[  541.031208]  [<ffffffff8111237b>] ? kmem_cache_alloc+0xcb/0xd0
[  541.031212]  [<ffffffffa01ddcd7>] ? ipv6_dev_mc_inc+0x267/0x300 [ipv6]
[  541.031216]  [<ffffffffa01c2fae>] ? addrconf_join_solict+0x2e/0x40 [ipv6]
[  541.031219]  [<ffffffffa01ba2e9>] ? ipv6_dev_ac_inc+0x159/0x1f0 [ipv6]
[  541.031223]  [<ffffffffa01c0772>] ? addrconf_join_anycast+0x92/0xa0 [ipv6]
[  541.031226]  [<ffffffffa01c311e>] ? __ipv6_ifa_notify+0x11e/0x1e0 [ipv6]
[  541.031229]  [<ffffffffa01c3213>] ? ipv6_ifa_notify+0x33/0x50 [ipv6]
[  541.031233]  [<ffffffffa01c36c8>] ? addrconf_dad_completed+0x28/0x100 [ipv6]
[  541.031241]  [<ffffffff81075c1d>] ? task_cputime+0x2d/0x50
[  541.031244]  [<ffffffffa01c38d6>] ? addrconf_dad_timer+0x136/0x150 [ipv6]
[  541.031247]  [<ffffffffa01c37a0>] ? addrconf_dad_completed+0x100/0x100 [ipv6]
[  541.031255]  [<ffffffff8105313a>] ? call_timer_fn.isra.22+0x2a/0x90
[  541.031258]  [<ffffffffa01c37a0>] ? addrconf_dad_completed+0x100/0x100 [ipv6]

Hunks and backtrace stolen from a patch by Stephen Hemminger.

Reported-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-03-28 16:54:50 -04:00
Eric Dumazet
e2a1d3e47b tcp: fix get_timewait4_sock() delay computation on 64bit
It seems I missed one change in get_timewait4_sock() to compute
the remaining time before deletion of IPV4 timewait socket.

This could result in wrong output in /proc/net/tcp for tm->when field.

Fixes: 96f817fede ("tcp: shrink tcp6_timewait_sock by one cache line")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-03-28 16:43:14 -04:00
Flavio Leitner
4f647e0a3c openvswitch: fix a possible deadlock and lockdep warning
There are two problematic situations.

A deadlock can happen when is_percpu is false because it can get
interrupted while holding the spinlock. Then it executes
ovs_flow_stats_update() in softirq context which tries to get
the same lock.

The second sitation is that when is_percpu is true, the code
correctly disables BH but only for the local CPU, so the
following can happen when locking the remote CPU without
disabling BH:

       CPU#0                            CPU#1
  ovs_flow_stats_get()
   stats_read()
 +->spin_lock remote CPU#1        ovs_flow_stats_get()
 |  <interrupted>                  stats_read()
 |  ...                       +-->  spin_lock remote CPU#0
 |                            |     <interrupted>
 |  ovs_flow_stats_update()   |     ...
 |   spin_lock local CPU#0 <--+     ovs_flow_stats_update()
 +---------------------------------- spin_lock local CPU#1

This patch disables BH for both cases fixing the deadlocks.
Acked-by: Jesse Gross <jesse@nicira.com>

=================================
[ INFO: inconsistent lock state ]
3.14.0-rc8-00007-g632b06a #1 Tainted: G          I
---------------------------------
inconsistent {SOFTIRQ-ON-W} -> {IN-SOFTIRQ-W} usage.
swapper/0/0 [HC0[0]:SC1[5]:HE1:SE0] takes:
(&(&cpu_stats->lock)->rlock){+.?...}, at: [<ffffffffa05dd8a1>] ovs_flow_stats_update+0x51/0xd0 [openvswitch]
{SOFTIRQ-ON-W} state was registered at:
[<ffffffff810f973f>] __lock_acquire+0x68f/0x1c40
[<ffffffff810fb4e2>] lock_acquire+0xa2/0x1d0
[<ffffffff817d8d9e>] _raw_spin_lock+0x3e/0x80
[<ffffffffa05dd9e4>] ovs_flow_stats_get+0xc4/0x1e0 [openvswitch]
[<ffffffffa05da855>] ovs_flow_cmd_fill_info+0x185/0x360 [openvswitch]
[<ffffffffa05daf05>] ovs_flow_cmd_build_info.constprop.27+0x55/0x90 [openvswitch]
[<ffffffffa05db41d>] ovs_flow_cmd_new_or_set+0x4dd/0x570 [openvswitch]
[<ffffffff816c245d>] genl_family_rcv_msg+0x1cd/0x3f0
[<ffffffff816c270e>] genl_rcv_msg+0x8e/0xd0
[<ffffffff816c0239>] netlink_rcv_skb+0xa9/0xc0
[<ffffffff816c0798>] genl_rcv+0x28/0x40
[<ffffffff816bf830>] netlink_unicast+0x100/0x1e0
[<ffffffff816bfc57>] netlink_sendmsg+0x347/0x770
[<ffffffff81668e9c>] sock_sendmsg+0x9c/0xe0
[<ffffffff816692d9>] ___sys_sendmsg+0x3a9/0x3c0
[<ffffffff8166a911>] __sys_sendmsg+0x51/0x90
[<ffffffff8166a962>] SyS_sendmsg+0x12/0x20
[<ffffffff817e3ce9>] system_call_fastpath+0x16/0x1b
irq event stamp: 1740726
hardirqs last  enabled at (1740726): [<ffffffff8175d5e0>] ip6_finish_output2+0x4f0/0x840
hardirqs last disabled at (1740725): [<ffffffff8175d59b>] ip6_finish_output2+0x4ab/0x840
softirqs last  enabled at (1740674): [<ffffffff8109be12>] _local_bh_enable+0x22/0x50
softirqs last disabled at (1740675): [<ffffffff8109db05>] irq_exit+0xc5/0xd0

other info that might help us debug this:
 Possible unsafe locking scenario:

       CPU0
       ----
  lock(&(&cpu_stats->lock)->rlock);
  <Interrupt>
    lock(&(&cpu_stats->lock)->rlock);

 *** DEADLOCK ***

5 locks held by swapper/0/0:
 #0:  (((&ifa->dad_timer))){+.-...}, at: [<ffffffff810a7155>] call_timer_fn+0x5/0x320
 #1:  (rcu_read_lock){.+.+..}, at: [<ffffffff81788a55>] mld_sendpack+0x5/0x4a0
 #2:  (rcu_read_lock_bh){.+....}, at: [<ffffffff8175d149>] ip6_finish_output2+0x59/0x840
 #3:  (rcu_read_lock_bh){.+....}, at: [<ffffffff8168ba75>] __dev_queue_xmit+0x5/0x9b0
 #4:  (rcu_read_lock){.+.+..}, at: [<ffffffffa05e41b5>] internal_dev_xmit+0x5/0x110 [openvswitch]

stack backtrace:
CPU: 0 PID: 0 Comm: swapper/0 Tainted: G          I  3.14.0-rc8-00007-g632b06a #1
Hardware name:                  /DX58SO, BIOS SOX5810J.86A.5599.2012.0529.2218 05/29/2012
 0000000000000000 0fcf20709903df0c ffff88042d603808 ffffffff817cfe3c
 ffffffff81c134c0 ffff88042d603858 ffffffff817cb6da 0000000000000005
 ffffffff00000001 ffff880400000000 0000000000000006 ffffffff81c134c0
Call Trace:
 <IRQ>  [<ffffffff817cfe3c>] dump_stack+0x4d/0x66
 [<ffffffff817cb6da>] print_usage_bug+0x1f4/0x205
 [<ffffffff810f7f10>] ? check_usage_backwards+0x180/0x180
 [<ffffffff810f8963>] mark_lock+0x223/0x2b0
 [<ffffffff810f96d3>] __lock_acquire+0x623/0x1c40
 [<ffffffff810f5707>] ? __lock_is_held+0x57/0x80
 [<ffffffffa05e26c6>] ? masked_flow_lookup+0x236/0x250 [openvswitch]
 [<ffffffff810fb4e2>] lock_acquire+0xa2/0x1d0
 [<ffffffffa05dd8a1>] ? ovs_flow_stats_update+0x51/0xd0 [openvswitch]
 [<ffffffff817d8d9e>] _raw_spin_lock+0x3e/0x80
 [<ffffffffa05dd8a1>] ? ovs_flow_stats_update+0x51/0xd0 [openvswitch]
 [<ffffffffa05dd8a1>] ovs_flow_stats_update+0x51/0xd0 [openvswitch]
 [<ffffffffa05dcc64>] ovs_dp_process_received_packet+0x84/0x120 [openvswitch]
 [<ffffffff810f93f7>] ? __lock_acquire+0x347/0x1c40
 [<ffffffffa05e3bea>] ovs_vport_receive+0x2a/0x30 [openvswitch]
 [<ffffffffa05e4218>] internal_dev_xmit+0x68/0x110 [openvswitch]
 [<ffffffffa05e41b5>] ? internal_dev_xmit+0x5/0x110 [openvswitch]
 [<ffffffff8168b4a6>] dev_hard_start_xmit+0x2e6/0x8b0
 [<ffffffff8168be87>] __dev_queue_xmit+0x417/0x9b0
 [<ffffffff8168ba75>] ? __dev_queue_xmit+0x5/0x9b0
 [<ffffffff8175d5e0>] ? ip6_finish_output2+0x4f0/0x840
 [<ffffffff8168c430>] dev_queue_xmit+0x10/0x20
 [<ffffffff8175d641>] ip6_finish_output2+0x551/0x840
 [<ffffffff8176128a>] ? ip6_finish_output+0x9a/0x220
 [<ffffffff8176128a>] ip6_finish_output+0x9a/0x220
 [<ffffffff8176145f>] ip6_output+0x4f/0x1f0
 [<ffffffff81788c29>] mld_sendpack+0x1d9/0x4a0
 [<ffffffff817895b8>] mld_send_initial_cr.part.32+0x88/0xa0
 [<ffffffff817691b0>] ? addrconf_dad_completed+0x220/0x220
 [<ffffffff8178e301>] ipv6_mc_dad_complete+0x31/0x50
 [<ffffffff817690d7>] addrconf_dad_completed+0x147/0x220
 [<ffffffff817691b0>] ? addrconf_dad_completed+0x220/0x220
 [<ffffffff8176934f>] addrconf_dad_timer+0x19f/0x1c0
 [<ffffffff810a71e9>] call_timer_fn+0x99/0x320
 [<ffffffff810a7155>] ? call_timer_fn+0x5/0x320
 [<ffffffff817691b0>] ? addrconf_dad_completed+0x220/0x220
 [<ffffffff810a76c4>] run_timer_softirq+0x254/0x3b0
 [<ffffffff8109d47d>] __do_softirq+0x12d/0x480

Signed-off-by: Flavio Leitner <fbl@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-03-28 16:41:53 -04:00
Toshiaki Makita
99b192da9c bridge: Fix handling stacked vlan tags
If a bridge with vlan_filtering enabled receives frames with stacked
vlan tags, i.e., they have two vlan tags, br_vlan_untag() strips not
only the outer tag but also the inner tag.

br_vlan_untag() is called only from br_handle_vlan(), and in this case,
it is enough to set skb->vlan_tci to 0 here, because vlan_tci has already
been set before calling br_handle_vlan().

Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Acked-by: Vlad Yasevich <vyasevic@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-03-28 16:33:09 -04:00
Toshiaki Makita
12464bb8de bridge: Fix inabillity to retrieve vlan tags when tx offload is disabled
Bridge vlan code (br_vlan_get_tag()) assumes that all frames have vlan_tci
if they are tagged, but if vlan tx offload is manually disabled on bridge
device and frames are sent from vlan device on the bridge device, the tags
are embedded in skb->data and they break this assumption.
Extract embedded vlan tags and move them to vlan_tci at ingress.

Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Acked-by: Vlad Yasevich <vyasevic@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-03-28 16:33:09 -04:00
Zoltan Kiss
36d5fe6a00 core, nfqueue, openvswitch: Orphan frags in skb_zerocopy and handle errors
skb_zerocopy can copy elements of the frags array between skbs, but it doesn't
orphan them. Also, it doesn't handle errors, so this patch takes care of that
as well, and modify the callers accordingly. skb_tx_error() is also added to
the callers so they will signal the failed delivery towards the creator of the
skb.

Signed-off-by: Zoltan Kiss <zoltan.kiss@citrix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-03-27 15:29:38 -04:00
Vlad Yasevich
fc0d48b8fb vlan: Set hard_header_len according to available acceleration
Currently, if the card supports CTAG acceleration we do not
account for the vlan header even if we are configuring an
8021AD vlan.  This may not be best since we'll do software
tagging for 8021AD which will cause data copy on skb head expansion
Configure the length based on available hw offload capabilities and
vlan protocol.

CC: Patrick McHardy <kaber@trash.net>
Signed-off-by: Vlad Yasevich <vyasevic@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-03-27 15:00:37 -04:00
Eric Dumazet
de14439167 net: unix: non blocking recvmsg() should not return -EINTR
Some applications didn't expect recvmsg() on a non blocking socket
could return -EINTR. This possibility was added as a side effect
of commit b3ca9b02b0 ("net: fix multithreaded signal handling in
unix recv routines").

To hit this bug, you need to be a bit unlucky, as the u->readlock
mutex is usually held for very small periods.

Fixes: b3ca9b02b0 ("net: fix multithreaded signal handling in unix recv routines")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Rainer Weikusat <rweikusat@mobileactivedefense.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-03-26 17:05:40 -04:00
Pravin B Shelar
fbd02dd405 ip_tunnel: Fix dst ref-count.
Commit 10ddceb22b (ip_tunnel:multicast process cause panic due
to skb->_skb_refdst NULL pointer) removed dst-drop call from
ip-tunnel-recv.

Following commit reintroduce dst-drop and fix the original bug by
checking loopback packet before releasing dst.
Original bug: https://bugzilla.kernel.org/show_bug.cgi?id=70681

CC: Xin Long <lucien.xin@gmail.com>
Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-03-26 15:18:40 -04:00
Erik Hugne
a5d0e7c037 tipc: fix spinlock recursion bug for failed subscriptions
If a topology event subscription fails for any reason, such as out
of memory, max number reached or because we received an invalid
request the correct behavior is to terminate the subscribers
connection to the topology server. This is currently broken and
produces the following oops:

[27.953662] tipc: Subscription rejected, illegal request
[27.955329] BUG: spinlock recursion on CPU#1, kworker/u4:0/6
[27.957066]  lock: 0xffff88003c67f408, .magic: dead4ead, .owner: kworker/u4:0/6, .owner_cpu: 1
[27.958054] CPU: 1 PID: 6 Comm: kworker/u4:0 Not tainted 3.14.0-rc6+ #5
[27.960230] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
[27.960874] Workqueue: tipc_rcv tipc_recv_work [tipc]
[27.961430]  ffff88003c67f408 ffff88003de27c18 ffffffff815c0207 ffff88003de1c050
[27.962292]  ffff88003de27c38 ffffffff815beec5 ffff88003c67f408 ffffffff817f0a8a
[27.963152]  ffff88003de27c58 ffffffff815beeeb ffff88003c67f408 ffffffffa0013520
[27.964023] Call Trace:
[27.964292]  [<ffffffff815c0207>] dump_stack+0x45/0x56
[27.964874]  [<ffffffff815beec5>] spin_dump+0x8c/0x91
[27.965420]  [<ffffffff815beeeb>] spin_bug+0x21/0x26
[27.965995]  [<ffffffff81083df6>] do_raw_spin_lock+0x116/0x140
[27.966631]  [<ffffffff815c6215>] _raw_spin_lock_bh+0x15/0x20
[27.967256]  [<ffffffffa0008540>] subscr_conn_shutdown_event+0x20/0xa0 [tipc]
[27.968051]  [<ffffffffa000fde4>] tipc_close_conn+0xa4/0xb0 [tipc]
[27.968722]  [<ffffffffa00101ba>] tipc_conn_terminate+0x1a/0x30 [tipc]
[27.969436]  [<ffffffffa00089a2>] subscr_conn_msg_event+0x1f2/0x2f0 [tipc]
[27.970209]  [<ffffffffa0010000>] tipc_receive_from_sock+0x90/0xf0 [tipc]
[27.970972]  [<ffffffffa000fa79>] tipc_recv_work+0x29/0x50 [tipc]
[27.971633]  [<ffffffff8105dbf5>] process_one_work+0x165/0x3e0
[27.972267]  [<ffffffff8105e869>] worker_thread+0x119/0x3a0
[27.972896]  [<ffffffff8105e750>] ? manage_workers.isra.25+0x2a0/0x2a0
[27.973622]  [<ffffffff810648af>] kthread+0xdf/0x100
[27.974168]  [<ffffffff810647d0>] ? kthread_create_on_node+0x1a0/0x1a0
[27.974893]  [<ffffffff815ce13c>] ret_from_fork+0x7c/0xb0
[27.975466]  [<ffffffff810647d0>] ? kthread_create_on_node+0x1a0/0x1a0

The recursion occurs when subscr_terminate tries to grab the
subscriber lock, which is already taken by subscr_conn_msg_event.
We fix this by checking if the request to establish a new
subscription was successful, and if not we initiate termination of
the subscriber after we have released the subscriber lock.

Signed-off-by: Erik Hugne <erik.hugne@ericsson.com>
Reviewed-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-03-24 15:36:56 -04:00
Li RongQing
c27f0872a3 netpoll: fix the skb check in pkt_is_ns
Neighbor Solicitation is ipv6 protocol, so we should check
skb->protocol with ETH_P_IPV6

Signed-off-by: Li RongQing <roy.qing.li@gmail.com>
Cc: WANG Cong <amwang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-03-24 15:08:40 -04:00
David S. Miller
b74d3feccc Merge branch 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jesse/openvswitch
Jesse Gross says:

====================
Open vSwitch

Four small fixes for net/3.14. I realize that these are late in the
cycle - just got back from vacation.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2014-03-20 17:29:02 -04:00
Nicolas Dichtel
f518338b16 ip6mr: fix mfc notification flags
Commit 812e44dd18 ("ip6mr: advertise new mfc entries via rtnl") reuses the
function ip6mr_fill_mroute() to notify mfc events.
But this function was used only for dump and thus was always setting the
flag NLM_F_MULTI, which is wrong in case of a single notification.

Libraries like libnl will wait forever for NLMSG_DONE.

CC: Thomas Graf <tgraf@suug.ch>
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Acked-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-03-20 16:24:28 -04:00
Nicolas Dichtel
65886f439a ipmr: fix mfc notification flags
Commit 8cd3ac9f9b ("ipmr: advertise new mfc entries via rtnl") reuses the
function ipmr_fill_mroute() to notify mfc events.
But this function was used only for dump and thus was always setting the
flag NLM_F_MULTI, which is wrong in case of a single notification.

Libraries like libnl will wait forever for NLMSG_DONE.

CC: Thomas Graf <tgraf@suug.ch>
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Acked-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-03-20 16:24:28 -04:00
Nicolas Dichtel
1c104a6beb rtnetlink: fix fdb notification flags
Commit 3ff661c38c ("net: rtnetlink notify events for FDB NTF_SELF adds and
deletes") reuses the function nlmsg_populate_fdb_fill() to notify fdb events.
But this function was used only for dump and thus was always setting the
flag NLM_F_MULTI, which is wrong in case of a single notification.

Libraries like libnl will wait forever for NLMSG_DONE.

CC: Thomas Graf <tgraf@suug.ch>
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Acked-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-03-20 16:24:28 -04:00
Ben Pfaff
f9b8c4c8ba openvswitch: Correctly report flow used times for first 5 minutes after boot.
The kernel starts out its "jiffies" timer as 5 minutes below zero, as
shown in include/linux/jiffies.h:

  /*
   * Have the 32 bit jiffies value wrap 5 minutes after boot
   * so jiffies wrap bugs show up earlier.
   */
  #define INITIAL_JIFFIES ((unsigned long)(unsigned int) (-300*HZ))

The loop in ovs_flow_stats_get() starts out with 'used' set to 0, then
takes any "later" time.  This means that for the first five minutes after
boot, flows will always be reported as never used, since 0 is greater than
any time already seen.

Signed-off-by: Ben Pfaff <blp@nicira.com>
Acked-by: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: Jesse Gross <jesse@nicira.com>
2014-03-20 10:45:21 -07:00