Commit graph

9264 commits

Author SHA1 Message Date
David Ahern
8c0aa3f690 ipv4: Return error for RTA_VIA attribute
[ Upstream commit b6e9e5df4ecf100f6a10ab2ade8e46d47a4b9779 ]

IPv4 currently does not support nexthops outside of the AF_INET family.
Specifically, it does not handle RTA_VIA attribute. If it is passed
in a route add request, the actual route added only uses the device
which is clearly not what the user intended:

  $ ip ro add 172.16.1.0/24 via inet6 2001:db8:1::1 dev eth0
  $ ip ro ls
  ...
  172.16.1.0/24 dev eth0

Catch this and fail the route add:
  $ ip ro add 172.16.1.0/24 via inet6 2001:db8:1::1 dev eth0
  Error: IPv4 does not support RTA_VIA attribute.

Fixes: 03c0566542 ("mpls: Netlink commands to add, remove, and dump routes")
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-03-10 07:17:19 +01:00
Nazarov Sergey
125bc1e67e net: avoid use IPCB in cipso_v4_error
[ Upstream commit 3da1ed7ac398f34fff1694017a07054d69c5f5c5 ]

Extract IP options in cipso_v4_error and use __icmp_send.

Signed-off-by: Sergey Nazarov <s-nazarov@yandex.ru>
Acked-by: Paul Moore <paul@paul-moore.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-03-10 07:17:19 +01:00
Nazarov Sergey
f2397468fb net: Add __icmp_send helper.
[ Upstream commit 9ef6b42ad6fd7929dd1b6092cb02014e382c6a91 ]

Add __icmp_send function having ip_options struct parameter

Signed-off-by: Sergey Nazarov <s-nazarov@yandex.ru>
Reviewed-by: Paul Moore <paul@paul-moore.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-03-10 07:17:19 +01:00
Paul Moore
e3713abc42 netlabel: fix out-of-bounds memory accesses
[ Upstream commit 5578de4834fe0f2a34fedc7374be691443396d1f ]

There are two array out-of-bounds memory accesses, one in
cipso_v4_map_lvl_valid(), the other in netlbl_bitmap_walk().  Both
errors are embarassingly simple, and the fixes are straightforward.

As a FYI for anyone backporting this patch to kernels prior to v4.8,
you'll want to apply the netlbl_bitmap_walk() patch to
cipso_v4_bitmap_walk() as netlbl_bitmap_walk() doesn't exist before
Linux v4.8.

Reported-by: Jann Horn <jannh@google.com>
Fixes: 446fda4f26 ("[NetLabel]: CIPSOv4 engine")
Fixes: 3faa8f982f ("netlabel: Move bitmap manipulation functions to the NetLabel core.")
Signed-off-by: Paul Moore <paul@paul-moore.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-03-10 07:17:18 +01:00
Hangbin Liu
99ed945821 ipv4: Add ICMPv6 support when parse route ipproto
[ Upstream commit 5e1a99eae84999a2536f50a0beaf5d5262337f40 ]

For ip rules, we need to use 'ipproto ipv6-icmp' to match ICMPv6 headers.
But for ip -6 route, currently we only support tcp, udp and icmp.

Add ICMPv6 support so we can match ipv6-icmp rules for route lookup.

v2: As David Ahern and Sabrina Dubroca suggested, Add an argument to
rtm_getroute_parse_ip_proto() to handle ICMP/ICMPv6 with different family.

Reported-by: Jianlin Shi <jishi@redhat.com>
Fixes: eacb9384a3 ("ipv6: support sport, dport and ip_proto in RTM_GETROUTE")
Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-03-10 07:17:17 +01:00
Taehee Yoo
6546e1150c netfilter: ipt_CLUSTERIP: fix sleep-in-atomic bug in clusterip_config_entry_put()
commit 2a61d8b883bbad26b06d2e6cc3777a697e78830d upstream.

A proc_remove() can sleep. so that it can't be inside of spin_lock.
Hence proc_remove() is moved to outside of spin_lock. and it also
adds mutex to sync create and remove of proc entry(config->pde).

test commands:
SHELL#1
   %while :; do iptables -A INPUT -p udp -i enp2s0 -d 192.168.1.100 \
	   --dport 9000  -j CLUSTERIP --new --hashmode sourceip \
	   --clustermac 01:00:5e:00:00:21 --total-nodes 3 --local-node 3; \
	   iptables -F; done

SHELL#2
   %while :; do echo +1 > /proc/net/ipt_CLUSTERIP/192.168.1.100; \
	   echo -1 > /proc/net/ipt_CLUSTERIP/192.168.1.100; done

[ 2949.569864] BUG: sleeping function called from invalid context at kernel/sched/completion.c:99
[ 2949.579944] in_atomic(): 1, irqs_disabled(): 0, pid: 5472, name: iptables
[ 2949.587920] 1 lock held by iptables/5472:
[ 2949.592711]  #0: 000000008f0ebcf2 (&(&cn->lock)->rlock){+...}, at: refcount_dec_and_lock+0x24/0x50
[ 2949.603307] CPU: 1 PID: 5472 Comm: iptables Tainted: G        W         4.19.0-rc5+ #16
[ 2949.604212] Hardware name: To be filled by O.E.M. To be filled by O.E.M./Aptio CRB, BIOS 5.6.5 07/08/2015
[ 2949.604212] Call Trace:
[ 2949.604212]  dump_stack+0xc9/0x16b
[ 2949.604212]  ? show_regs_print_info+0x5/0x5
[ 2949.604212]  ___might_sleep+0x2eb/0x420
[ 2949.604212]  ? set_rq_offline.part.87+0x140/0x140
[ 2949.604212]  ? _rcu_barrier_trace+0x400/0x400
[ 2949.604212]  wait_for_completion+0x94/0x710
[ 2949.604212]  ? wait_for_completion_interruptible+0x780/0x780
[ 2949.604212]  ? __kernel_text_address+0xe/0x30
[ 2949.604212]  ? __lockdep_init_map+0x10e/0x5c0
[ 2949.604212]  ? __lockdep_init_map+0x10e/0x5c0
[ 2949.604212]  ? __init_waitqueue_head+0x86/0x130
[ 2949.604212]  ? init_wait_entry+0x1a0/0x1a0
[ 2949.604212]  proc_entry_rundown+0x208/0x270
[ 2949.604212]  ? proc_reg_get_unmapped_area+0x370/0x370
[ 2949.604212]  ? __lock_acquire+0x4500/0x4500
[ 2949.604212]  ? complete+0x18/0x70
[ 2949.604212]  remove_proc_subtree+0x143/0x2a0
[ 2949.708655]  ? remove_proc_entry+0x390/0x390
[ 2949.708655]  clusterip_tg_destroy+0x27a/0x630 [ipt_CLUSTERIP]
[ ... ]

Fixes: b3e456fce9 ("netfilter: ipt_CLUSTERIP: fix a race condition of proc file creation")
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-02-27 10:09:03 +01:00
Konstantin Khlebnikov
589503cb24 inet_diag: fix reporting cgroup classid and fallback to priority
[ Upstream commit 1ec17dbd90f8b638f41ee650558609c1af63dfa0 ]

Field idiag_ext in struct inet_diag_req_v2 used as bitmap of requested
extensions has only 8 bits. Thus extensions starting from DCTCPINFO
cannot be requested directly. Some of them included into response
unconditionally or hook into some of lower 8 bits.

Extension INET_DIAG_CLASS_ID has not way to request from the beginning.

This patch bundle it with INET_DIAG_TCLASS (ipv6 tos), fixes space
reservation, and documents behavior for other extensions.

Also this patch adds fallback to reporting socket priority. This filed
is more widely used for traffic classification because ipv4 sockets
automatically maps TOS to priority and default qdisc pfifo_fast knows
about that. But priority could be changed via setsockopt SO_PRIORITY so
INET_DIAG_TOS isn't enough for predicting class.

Also cgroup2 obsoletes net_cls classid (it always zero), but we cannot
reuse this field for reporting cgroup2 id because it is 64-bit (ino+gen).

So, after this patch INET_DIAG_CLASS_ID will report socket priority
for most common setup when net_cls isn't set and/or cgroup2 in use.

Fixes: 0888e372c3 ("net: inet: diag: expose sockets cgroup classid")
Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-02-27 10:08:58 +01:00
Jann Horn
6a3f723787 netfilter: nf_nat_snmp_basic: add missing length checks in ASN.1 cbs
commit c4c07b4d6fa1f11880eab8e076d3d060ef3f55fc upstream.

The generic ASN.1 decoder infrastructure doesn't guarantee that callbacks
will get as much data as they expect; callbacks have to check the `datalen`
parameter before looking at `data`. Make sure that snmp_version() and
snmp_helper() don't read/write beyond the end of the packet data.

(Also move the assignment to `pdata` down below the check to make it clear
that it isn't necessarily a pointer we can use before the `datalen` check.)

Fixes: cc2d58634e ("netfilter: nf_nat_snmp_basic: use asn1 decoder library")
Signed-off-by: Jann Horn <jannh@google.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-02-23 09:07:27 +01:00
Eric Dumazet
1bdb1675f3 tcp: tcp_v4_err() should be more careful
[ Upstream commit 2c4cc9712364c051b1de2d175d5fbea6be948ebf ]

ICMP handlers are not very often stressed, we should
make them more resilient to bugs that might surface in
the future.

If there is no packet in retransmit queue, we should
avoid a NULL deref.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: soukjin bae <soukjin.bae@samsung.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-02-23 09:07:26 +01:00
Eric Dumazet
df1b583c16 tcp: clear icsk_backoff in tcp_write_queue_purge()
[ Upstream commit 04c03114be82194d4a4858d41dba8e286ad1787c ]

soukjin bae reported a crash in tcp_v4_err() handling
ICMP_DEST_UNREACH after tcp_write_queue_head(sk)
returned a NULL pointer.

Current logic should have prevented this :

  if (seq != tp->snd_una  || !icsk->icsk_retransmits ||
      !icsk->icsk_backoff || fastopen)
      break;

Problem is the write queue might have been purged
and icsk_backoff has not been cleared.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: soukjin bae <soukjin.bae@samsung.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-02-23 09:07:26 +01:00
Lorenzo Bianconi
1764111c99 net: ipv4: use a dedicated counter for icmp_v4 redirect packets
[ Upstream commit c09551c6ff7fe16a79a42133bcecba5fc2fc3291 ]

According to the algorithm described in the comment block at the
beginning of ip_rt_send_redirect, the host should try to send
'ip_rt_redirect_number' ICMP redirect packets with an exponential
backoff and then stop sending them at all assuming that the destination
ignores redirects.
If the device has previously sent some ICMP error packets that are
rate-limited (e.g TTL expired) and continues to receive traffic,
the redirect packets will never be transmitted. This happens since
peer->rate_tokens will be typically greater than 'ip_rt_redirect_number'
and so it will never be reset even if the redirect silence timeout
(ip_rt_redirect_silence) has elapsed without receiving any packet
requiring redirects.

Fix it by using a dedicated counter for the number of ICMP redirect
packets that has been sent by the host

I have not been able to identify a given commit that introduced the
issue since ip_rt_send_redirect implements the same rate-limiting
algorithm from commit 1da177e4c3 ("Linux-2.6.12-rc2")

Signed-off-by: Lorenzo Bianconi <lorenzo.bianconi@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-02-23 09:07:24 +01:00
Lorenzo Bianconi
0a198e0bb8 net: ip_gre: use erspan key field for tunnel lookup
[ Upstream commit cb73ee40b1b381eaf3749e6dbeed567bb38e5258 ]

Use ERSPAN key header field as tunnel key in gre_parse_header routine
since ERSPAN protocol sets the key field of the external GRE header to
0 resulting in a tunnel lookup fail in ip6gre_err.
In addition remove key field parsing and pskb_may_pull check in
erspan_rcv and ip6erspan_rcv

Fixes: 5a963eb61b ("ip6_gre: Add ERSPAN native tunnel support")
Signed-off-by: Lorenzo Bianconi <lorenzo.bianconi@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-02-06 17:30:06 +01:00
Lorenzo Bianconi
897ea28bd2 net: ip_gre: always reports o_key to userspace
[ Upstream commit feaf5c796b3f0240f10d0d6d0b686715fd58a05b ]

Erspan protocol (version 1 and 2) relies on o_key to configure
session id header field. However TUNNEL_KEY bit is cleared in
erspan_xmit since ERSPAN protocol does not set the key field
of the external GRE header and so the configured o_key is not reported
to userspace. The issue can be triggered with the following reproducer:

$ip link add erspan1 type erspan local 192.168.0.1 remote 192.168.0.2 \
    key 1 seq erspan_ver 1
$ip link set erspan1 up
$ip -d link sh erspan1

erspan1@NONE: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc pfifo_fast state UNKNOWN mode DEFAULT
  link/ether 52:aa:99:95:9a:b5 brd ff:ff:ff:ff:ff:ff promiscuity 0 minmtu 68 maxmtu 1500
  erspan remote 192.168.0.2 local 192.168.0.1 ttl inherit ikey 0.0.0.1 iseq oseq erspan_index 0

Fix the issue adding TUNNEL_KEY bit to the o_flags parameter in
ipgre_fill_info

Fixes: 84e54fe0a5 ("gre: introduce native tunnel support for ERSPAN")
Signed-off-by: Lorenzo Bianconi <lorenzo.bianconi@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-02-06 17:30:06 +01:00
Greg Kroah-Hartman
8c763a3cf5 Fix "net: ipv4: do not handle duplicate fragments as overlapping"
ade446403bfb ("net: ipv4: do not handle duplicate fragments as
overlapping") was backported to many stable trees, but it had a problem
that was "accidentally" fixed by the upstream commit 0ff89efb5246 ("ip:
fail fast on IP defrag errors")

This is the fixup for that problem as we do not want the larger patch in
the older stable trees.

Fixes: ade446403bfb ("net: ipv4: do not handle duplicate fragments as overlapping")
Reported-by: Ivan Babrou <ivan@cloudflare.com>
Reported-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-02-06 17:30:05 +01:00
Willem de Bruijn
2cade15d58 tcp: allow MSG_ZEROCOPY transmission also in CLOSE_WAIT state
[ Upstream commit 13d7f46386e060df31b727c9975e38306fa51e7a ]

TCP transmission with MSG_ZEROCOPY fails if the peer closes its end of
the connection and so transitions this socket to CLOSE_WAIT state.

Transmission in close wait state is acceptable. Other similar tests in
the stack (e.g., in FastOpen) accept both states. Relax this test, too.

Link: https://www.mail-archive.com/netdev@vger.kernel.org/msg276886.html
Link: https://www.mail-archive.com/netdev@vger.kernel.org/msg227390.html
Fixes: f214f915e7 ("tcp: enable MSG_ZEROCOPY")
Reported-by: Marek Majkowski <marek@cloudflare.com>
Signed-off-by: Willem de Bruijn <willemb@google.com>
CC: Yuchung Cheng <ycheng@google.com>
CC: Neal Cardwell <ncardwell@google.com>
CC: Soheil Hassas Yeganeh <soheil@google.com>
CC: Alexey Kodanev <alexey.kodanev@oracle.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-01-31 08:14:33 +01:00
Xin Long
552cd931b4 erspan: build the header with the right proto according to erspan_ver
[ Upstream commit 20704bd1633dd5afb29a321d3a615c9c8e9c9d05 ]

As said in draft-foschiano-erspan-03#section4:

   Different frame variants known as "ERSPAN Types" can be
   distinguished based on the GRE "Protocol Type" field value: Type I
   and II's value is 0x88BE while Type III's is 0x22EB [ETYPES].

So set it properly in erspan_xmit() according to erspan_ver. While at
it, also remove the unused parameter 'proto' in erspan_fb_xmit().

Fixes: 94d7d8f292 ("ip6_gre: add erspan v2 support")
Reported-by: Jianlin Shi <jishi@redhat.com>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-01-31 08:14:33 +01:00
Willem de Bruijn
e3fa624ee7 udp: with udp_segment release on error path
[ Upstream commit 0f149c9fec3cd720628ecde83bfc6f64c1e7dcb6 ]

Failure __ip_append_data triggers udp_flush_pending_frames, but these
tests happen later. The skb must be freed directly.

Fixes: bec1f6f697 ("udp: generate gso with UDP_SEGMENT")
Reported-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-01-31 08:14:33 +01:00
Ido Schimmel
adbf7e5809 net: ipv4: Fix memory leak in network namespace dismantle
[ Upstream commit f97f4dd8b3bb9d0993d2491e0f22024c68109184 ]

IPv4 routing tables are flushed in two cases:

1. In response to events in the netdev and inetaddr notification chains
2. When a network namespace is being dismantled

In both cases only routes associated with a dead nexthop group are
flushed. However, a nexthop group will only be marked as dead in case it
is populated with actual nexthops using a nexthop device. This is not
the case when the route in question is an error route (e.g.,
'blackhole', 'unreachable').

Therefore, when a network namespace is being dismantled such routes are
not flushed and leaked [1].

To reproduce:
# ip netns add blue
# ip -n blue route add unreachable 192.0.2.0/24
# ip netns del blue

Fix this by not skipping error routes that are not marked with
RTNH_F_DEAD when flushing the routing tables.

To prevent the flushing of such routes in case #1, add a parameter to
fib_table_flush() that indicates if the table is flushed as part of
namespace dismantle or not.

Note that this problem does not exist in IPv6 since error routes are
associated with the loopback device.

[1]
unreferenced object 0xffff888066650338 (size 56):
  comm "ip", pid 1206, jiffies 4294786063 (age 26.235s)
  hex dump (first 32 bytes):
    00 00 00 00 00 00 00 00 b0 1c 62 61 80 88 ff ff  ..........ba....
    e8 8b a1 64 80 88 ff ff 00 07 00 08 fe 00 00 00  ...d............
  backtrace:
    [<00000000856ed27d>] inet_rtm_newroute+0x129/0x220
    [<00000000fcdfc00a>] rtnetlink_rcv_msg+0x397/0xa20
    [<00000000cb85801a>] netlink_rcv_skb+0x132/0x380
    [<00000000ebc991d2>] netlink_unicast+0x4c0/0x690
    [<0000000014f62875>] netlink_sendmsg+0x929/0xe10
    [<00000000bac9d967>] sock_sendmsg+0xc8/0x110
    [<00000000223e6485>] ___sys_sendmsg+0x77a/0x8f0
    [<000000002e94f880>] __sys_sendmsg+0xf7/0x250
    [<00000000ccb1fa72>] do_syscall_64+0x14d/0x610
    [<00000000ffbe3dae>] entry_SYSCALL_64_after_hwframe+0x49/0xbe
    [<000000003a8b605b>] 0xffffffffffffffff
unreferenced object 0xffff888061621c88 (size 48):
  comm "ip", pid 1206, jiffies 4294786063 (age 26.235s)
  hex dump (first 32 bytes):
    6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk
    6b 6b 6b 6b 6b 6b 6b 6b d8 8e 26 5f 80 88 ff ff  kkkkkkkk..&_....
  backtrace:
    [<00000000733609e3>] fib_table_insert+0x978/0x1500
    [<00000000856ed27d>] inet_rtm_newroute+0x129/0x220
    [<00000000fcdfc00a>] rtnetlink_rcv_msg+0x397/0xa20
    [<00000000cb85801a>] netlink_rcv_skb+0x132/0x380
    [<00000000ebc991d2>] netlink_unicast+0x4c0/0x690
    [<0000000014f62875>] netlink_sendmsg+0x929/0xe10
    [<00000000bac9d967>] sock_sendmsg+0xc8/0x110
    [<00000000223e6485>] ___sys_sendmsg+0x77a/0x8f0
    [<000000002e94f880>] __sys_sendmsg+0xf7/0x250
    [<00000000ccb1fa72>] do_syscall_64+0x14d/0x610
    [<00000000ffbe3dae>] entry_SYSCALL_64_after_hwframe+0x49/0xbe
    [<000000003a8b605b>] 0xffffffffffffffff

Fixes: 8cced9eff1 ("[NETNS]: Enable routing configuration in non-initial namespace.")
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-01-31 08:14:32 +01:00
Ross Lagerwall
40f2f08030 net: Fix usage of pskb_trim_rcsum
[ Upstream commit 6c57f0458022298e4da1729c67bd33ce41c14e7a ]

In certain cases, pskb_trim_rcsum() may change skb pointers.
Reinitialize header pointers afterwards to avoid potential
use-after-frees. Add a note in the documentation of
pskb_trim_rcsum(). Found by KASAN.

Signed-off-by: Ross Lagerwall <ross.lagerwall@citrix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-01-31 08:14:31 +01:00
Taehee Yoo
9d51378a68 netfilter: ipt_CLUSTERIP: fix deadlock in netns exit routine
[ Upstream commit 5a86d68bcf02f2d1e9a5897dd482079fd5f75e7f ]

When network namespace is destroyed, cleanup_net() is called.
cleanup_net() holds pernet_ops_rwsem then calls each ->exit callback.
So that clusterip_tg_destroy() is called by cleanup_net().
And clusterip_tg_destroy() calls unregister_netdevice_notifier().

But both cleanup_net() and clusterip_tg_destroy() hold same
lock(pernet_ops_rwsem). hence deadlock occurrs.

After this patch, only 1 notifier is registered when module is inserted.
And all of configs are added to per-net list.

test commands:
   %ip netns add vm1
   %ip netns exec vm1 bash
   %ip link set lo up
   %iptables -A INPUT -p tcp -i lo -d 192.168.0.5 --dport 80 \
	-j CLUSTERIP --new --hashmode sourceip \
	--clustermac 01:00:5e:00:00:20 --total-nodes 2 --local-node 1
   %exit
   %ip netns del vm1

splat looks like:
[  341.809674] ============================================
[  341.809674] WARNING: possible recursive locking detected
[  341.809674] 4.19.0-rc5+ #16 Tainted: G        W
[  341.809674] --------------------------------------------
[  341.809674] kworker/u4:2/87 is trying to acquire lock:
[  341.809674] 000000005da2d519 (pernet_ops_rwsem){++++}, at: unregister_netdevice_notifier+0x8c/0x460
[  341.809674]
[  341.809674] but task is already holding lock:
[  341.809674] 000000005da2d519 (pernet_ops_rwsem){++++}, at: cleanup_net+0x119/0x900
[  341.809674]
[  341.809674] other info that might help us debug this:
[  341.809674]  Possible unsafe locking scenario:
[  341.809674]
[  341.809674]        CPU0
[  341.809674]        ----
[  341.809674]   lock(pernet_ops_rwsem);
[  341.809674]   lock(pernet_ops_rwsem);
[  341.809674]
[  341.809674]  *** DEADLOCK ***
[  341.809674]
[  341.809674]  May be due to missing lock nesting notation
[  341.809674]
[  341.809674] 3 locks held by kworker/u4:2/87:
[  341.809674]  #0: 00000000d9df6c92 ((wq_completion)"%s""netns"){+.+.}, at: process_one_work+0xafe/0x1de0
[  341.809674]  #1: 00000000c2cbcee2 (net_cleanup_work){+.+.}, at: process_one_work+0xb60/0x1de0
[  341.809674]  #2: 000000005da2d519 (pernet_ops_rwsem){++++}, at: cleanup_net+0x119/0x900
[  341.809674]
[  341.809674] stack backtrace:
[  341.809674] CPU: 1 PID: 87 Comm: kworker/u4:2 Tainted: G        W         4.19.0-rc5+ #16
[  341.809674] Workqueue: netns cleanup_net
[  341.809674] Call Trace:
[ ... ]
[  342.070196]  down_write+0x93/0x160
[  342.070196]  ? unregister_netdevice_notifier+0x8c/0x460
[  342.070196]  ? down_read+0x1e0/0x1e0
[  342.070196]  ? sched_clock_cpu+0x126/0x170
[  342.070196]  ? find_held_lock+0x39/0x1c0
[  342.070196]  unregister_netdevice_notifier+0x8c/0x460
[  342.070196]  ? register_netdevice_notifier+0x790/0x790
[  342.070196]  ? __local_bh_enable_ip+0xe9/0x1b0
[  342.070196]  ? __local_bh_enable_ip+0xe9/0x1b0
[  342.070196]  ? clusterip_tg_destroy+0x372/0x650 [ipt_CLUSTERIP]
[  342.070196]  ? trace_hardirqs_on+0x93/0x210
[  342.070196]  ? __bpf_trace_preemptirq_template+0x10/0x10
[  342.070196]  ? clusterip_tg_destroy+0x372/0x650 [ipt_CLUSTERIP]
[  342.123094]  clusterip_tg_destroy+0x3ad/0x650 [ipt_CLUSTERIP]
[  342.123094]  ? clusterip_net_init+0x3d0/0x3d0 [ipt_CLUSTERIP]
[  342.123094]  ? cleanup_match+0x17d/0x200 [ip_tables]
[  342.123094]  ? xt_unregister_table+0x215/0x300 [x_tables]
[  342.123094]  ? kfree+0xe2/0x2a0
[  342.123094]  cleanup_entry+0x1d5/0x2f0 [ip_tables]
[  342.123094]  ? cleanup_match+0x200/0x200 [ip_tables]
[  342.123094]  __ipt_unregister_table+0x9b/0x1a0 [ip_tables]
[  342.123094]  iptable_filter_net_exit+0x43/0x80 [iptable_filter]
[  342.123094]  ops_exit_list.isra.10+0x94/0x140
[  342.123094]  cleanup_net+0x45b/0x900
[ ... ]

Fixes: 202f59afd4 ("netfilter: ipt_CLUSTERIP: do not hold dev")
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-01-26 09:32:40 +01:00
Taehee Yoo
bb7b6c49cc netfilter: ipt_CLUSTERIP: remove wrong WARN_ON_ONCE in netns exit routine
[ Upstream commit b12f7bad5ad3724d19754390a3e80928525c0769 ]

When network namespace is destroyed, both clusterip_tg_destroy() and
clusterip_net_exit() are called. and clusterip_net_exit() is called
before clusterip_tg_destroy().
Hence cleanup check code in clusterip_net_exit() doesn't make sense.

test commands:
   %ip netns add vm1
   %ip netns exec vm1 bash
   %ip link set lo up
   %iptables -A INPUT -p tcp -i lo -d 192.168.0.5 --dport 80 \
	-j CLUSTERIP --new --hashmode sourceip \
	--clustermac 01:00:5e:00:00:20 --total-nodes 2 --local-node 1
   %exit
   %ip netns del vm1

splat looks like:
[  341.184508] WARNING: CPU: 1 PID: 87 at net/ipv4/netfilter/ipt_CLUSTERIP.c:840 clusterip_net_exit+0x319/0x380 [ipt_CLUSTERIP]
[  341.184850] Modules linked in: ipt_CLUSTERIP nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 xt_tcpudp iptable_filter bpfilter ip_tables x_tables
[  341.184850] CPU: 1 PID: 87 Comm: kworker/u4:2 Not tainted 4.19.0-rc5+ #16
[  341.227509] Workqueue: netns cleanup_net
[  341.227509] RIP: 0010:clusterip_net_exit+0x319/0x380 [ipt_CLUSTERIP]
[  341.227509] Code: 0f 85 7f fe ff ff 48 c7 c2 80 64 2c c0 be a8 02 00 00 48 c7 c7 a0 63 2c c0 c6 05 18 6e 00 00 01 e8 bc 38 ff f5 e9 5b fe ff ff <0f> 0b e9 33 ff ff ff e8 4b 90 50 f6 e9 2d fe ff ff 48 89 df e8 de
[  341.227509] RSP: 0018:ffff88011086f408 EFLAGS: 00010202
[  341.227509] RAX: dffffc0000000000 RBX: 1ffff1002210de85 RCX: 0000000000000000
[  341.227509] RDX: 1ffff1002210de85 RSI: ffff880110813be8 RDI: ffffed002210de58
[  341.227509] RBP: ffff88011086f4d0 R08: 0000000000000000 R09: 0000000000000000
[  341.227509] R10: 0000000000000000 R11: 0000000000000000 R12: 1ffff1002210de81
[  341.227509] R13: ffff880110625a48 R14: ffff880114cec8c8 R15: 0000000000000014
[  341.227509] FS:  0000000000000000(0000) GS:ffff880116600000(0000) knlGS:0000000000000000
[  341.227509] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  341.227509] CR2: 00007f11fd38e000 CR3: 000000013ca16000 CR4: 00000000001006e0
[  341.227509] Call Trace:
[  341.227509]  ? __clusterip_config_find+0x460/0x460 [ipt_CLUSTERIP]
[  341.227509]  ? default_device_exit+0x1ca/0x270
[  341.227509]  ? remove_proc_entry+0x1cd/0x390
[  341.227509]  ? dev_change_net_namespace+0xd00/0xd00
[  341.227509]  ? __init_waitqueue_head+0x130/0x130
[  341.227509]  ops_exit_list.isra.10+0x94/0x140
[  341.227509]  cleanup_net+0x45b/0x900
[ ... ]

Fixes: 613d0776d3 ("netfilter: exit_net cleanup check added")
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-01-26 09:32:40 +01:00
Taehee Yoo
744383c88e netfilter: ipt_CLUSTERIP: check MAC address when duplicate config is set
[ Upstream commit 06aa151ad1fc74a49b45336672515774a678d78d ]

If same destination IP address config is already existing, that config is
just used. MAC address also should be same.
However, there is no MAC address checking routine.
So that MAC address checking routine is added.

test commands:
   %iptables -A INPUT -p tcp -i lo -d 192.168.0.5 --dport 80 \
	   -j CLUSTERIP --new --hashmode sourceip \
	   --clustermac 01:00:5e:00:00:20 --total-nodes 2 --local-node 1
   %iptables -A INPUT -p tcp -i lo -d 192.168.0.5 --dport 80 \
	   -j CLUSTERIP --new --hashmode sourceip \
	   --clustermac 01:00:5e:00:00:21 --total-nodes 2 --local-node 1

After this patch, above commands are disallowed.

Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-01-26 09:32:40 +01:00
Willem de Bruijn
eb02c17fcc ip: on queued skb use skb_header_pointer instead of pskb_may_pull
[ Upstream commit 4a06fa67c4da20148803525151845276cdb995c1 ]

Commit 2efd4fca70 ("ip: in cmsg IP(V6)_ORIGDSTADDR call
pskb_may_pull") avoided a read beyond the end of the skb linear
segment by calling pskb_may_pull.

That function can trigger a BUG_ON in pskb_expand_head if the skb is
shared, which it is when when peeking. It can also return ENOMEM.

Avoid both by switching to safer skb_header_pointer.

Fixes: 2efd4fca70 ("ip: in cmsg IP(V6)_ORIGDSTADDR call pskb_may_pull")
Reported-by: syzbot <syzkaller@googlegroups.com>
Suggested-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-01-22 21:40:31 +01:00
Yuchung Cheng
d7fe54c17a tcp: change txhash on SYN-data timeout
[ Upstream commit c5715b8fabfca0ef85903f8bad2189940ed41cc8 ]

Previously upon SYN timeouts the sender recomputes the txhash to
try a different path. However this does not apply on the initial
timeout of SYN-data (active Fast Open). Therefore an active IPv6
Fast Open connection may incur one second RTO penalty to take on
a new path after the second SYN retransmission uses a new flow label.

This patch removes this undesirable behavior so Fast Open changes
the flow label just like the regular connections. This also helps
avoid falsely disabling Fast Open on the sender which triggers
after two consecutive SYN timeouts on Fast Open.

Signed-off-by: Yuchung Cheng <ycheng@google.com>
Reviewed-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-01-22 21:40:30 +01:00
Eric Dumazet
e521703487 tcp: fix a race in inet_diag_dump_icsk()
[ Upstream commit f0c928d878e7d01b613c9ae5c971a6b1e473a938 ]

Alexei reported use after frees in inet_diag_dump_icsk() [1]

Because we use refcount_set() when various sockets are setup and
inserted into ehash, we also need to make sure inet_diag_dump_icsk()
wont race with the refcount_set() operations.

Jonathan Lemon sent a patch changing net_twsk_hashdance() but
other spots would need risky changes.

Instead, fix inet_diag_dump_icsk() as this bug came with
linux-4.10 only.

[1] Quoting Alexei :

First something iterating over sockets finds already freed tw socket:

refcount_t: increment on 0; use-after-free.
WARNING: CPU: 2 PID: 2738 at lib/refcount.c:153 refcount_inc+0x26/0x30
RIP: 0010:refcount_inc+0x26/0x30
RSP: 0018:ffffc90004c8fbc0 EFLAGS: 00010282
RAX: 000000000000002b RBX: 0000000000000000 RCX: 0000000000000000
RDX: ffff88085ee9d680 RSI: ffff88085ee954c8 RDI: ffff88085ee954c8
RBP: ffff88010ecbd2c0 R08: 0000000000000000 R09: 000000000000174c
R10: ffffffff81e7c5a0 R11: 0000000000000000 R12: 0000000000000000
R13: ffff8806ba9bf210 R14: ffffffff82304600 R15: ffff88010ecbd328
FS:  00007f81f5a7d700(0000) GS:ffff88085ee80000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f81e2a95000 CR3: 000000069b2eb006 CR4: 00000000003606e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
 inet_diag_dump_icsk+0x2b3/0x4e0 [inet_diag]  // sock_hold(sk); in net/ipv4/inet_diag.c:1002
 ? kmalloc_large_node+0x37/0x70
 ? __kmalloc_node_track_caller+0x1cb/0x260
 ? __alloc_skb+0x72/0x1b0
 ? __kmalloc_reserve.isra.40+0x2e/0x80
 __inet_diag_dump+0x3b/0x80 [inet_diag]
 netlink_dump+0x116/0x2a0
 netlink_recvmsg+0x205/0x3c0
 sock_read_iter+0x89/0xd0
 __vfs_read+0xf7/0x140
 vfs_read+0x8a/0x140
 SyS_read+0x3f/0xa0
 do_syscall_64+0x5a/0x100

then a minute later twsk timer fires and hits two bad refcnts
for this freed socket:

refcount_t: decrement hit 0; leaking memory.
WARNING: CPU: 31 PID: 0 at lib/refcount.c:228 refcount_dec+0x2e/0x40
Modules linked in:
RIP: 0010:refcount_dec+0x2e/0x40
RSP: 0018:ffff88085f5c3ea8 EFLAGS: 00010296
RAX: 000000000000002c RBX: ffff88010ecbd2c0 RCX: 000000000000083f
RDX: 0000000000000000 RSI: 00000000000000f6 RDI: 000000000000003f
RBP: ffffc90003c77280 R08: 0000000000000000 R09: 00000000000017d3
R10: ffffffff81e7c5a0 R11: 0000000000000000 R12: ffffffff82ad2d80
R13: ffffffff8182de00 R14: ffff88085f5c3ef8 R15: 0000000000000000
FS:  0000000000000000(0000) GS:ffff88085f5c0000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007fbe42685250 CR3: 0000000002209001 CR4: 00000000003606e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
 <IRQ>
 inet_twsk_kill+0x9d/0xc0  // inet_twsk_bind_unhash(tw, hashinfo);
 call_timer_fn+0x29/0x110
 run_timer_softirq+0x36b/0x3a0

refcount_t: underflow; use-after-free.
WARNING: CPU: 31 PID: 0 at lib/refcount.c:187 refcount_sub_and_test+0x46/0x50
RIP: 0010:refcount_sub_and_test+0x46/0x50
RSP: 0018:ffff88085f5c3eb8 EFLAGS: 00010296
RAX: 0000000000000026 RBX: ffff88010ecbd2c0 RCX: 000000000000083f
RDX: 0000000000000000 RSI: 00000000000000f6 RDI: 000000000000003f
RBP: ffff88010ecbd358 R08: 0000000000000000 R09: 000000000000185b
R10: ffffffff81e7c5a0 R11: 0000000000000000 R12: ffff88010ecbd358
R13: ffffffff8182de00 R14: ffff88085f5c3ef8 R15: 0000000000000000
FS:  0000000000000000(0000) GS:ffff88085f5c0000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007fbe42685250 CR3: 0000000002209001 CR4: 00000000003606e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
 <IRQ>
 inet_twsk_put+0x12/0x20  // inet_twsk_put(tw);
 call_timer_fn+0x29/0x110
 run_timer_softirq+0x36b/0x3a0

Fixes: 67db3e4bfb ("tcp: no longer hold ehash lock while calling tcp_get_info()")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Alexei Starovoitov <ast@kernel.org>
Cc: Jonathan Lemon <jonathan.lemon@gmail.com>
Acked-by: Jonathan Lemon <jonathan.lemon@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-01-09 17:38:33 +01:00
Michal Kubecek
d5f9565c8d net: ipv4: do not handle duplicate fragments as overlapping
[ Upstream commit ade446403bfb79d3528d56071a84b15351a139ad ]

Since commit 7969e5c40d ("ip: discard IPv4 datagrams with overlapping
segments.") IPv4 reassembly code drops the whole queue whenever an
overlapping fragment is received. However, the test is written in a way
which detects duplicate fragments as overlapping so that in environments
with many duplicate packets, fragmented packets may be undeliverable.

Add an extra test and for (potentially) duplicate fragment, only drop the
new fragment rather than the whole queue. Only starting offset and length
are checked, not the contents of the fragments as that would be too
expensive. For similar reason, linear list ("run") of a rbtree node is not
iterated, we only check if the new fragment is a subset of the interval
covered by existing consecutive fragments.

v2: instead of an exact check iterating through linear list of an rbtree
node, only check if the new fragment is subset of the "run" (suggested
by Eric Dumazet)

Fixes: 7969e5c40d ("ip: discard IPv4 datagrams with overlapping segments.")
Signed-off-by: Michal Kubecek <mkubecek@suse.cz>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-01-09 17:38:32 +01:00
Eric Dumazet
281731c817 net: clear skb->tstamp in forwarding paths
[ Upstream commit 8203e2d844d34af247a151d8ebd68553a6e91785 ]

Sergey reported that forwarding was no longer working
if fq packet scheduler was used.

This is caused by the recent switch to EDT model, since incoming
packets might have been timestamped by __net_timestamp()

__net_timestamp() uses ktime_get_real(), while fq expects packets
using CLOCK_MONOTONIC base.

The fix is to clear skb->tstamp in forwarding paths.

Fixes: 80b14dee2b ("net: Add a new socket option for a future transmit time.")
Fixes: fb420d5d91c1 ("tcp/fq: move back to CLOCK_MONOTONIC")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Sergey Matyukevich <geomatsi@gmail.com>
Tested-by: Sergey Matyukevich <geomatsi@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-01-09 17:38:31 +01:00
Willem de Bruijn
cde81154f8 ip: validate header length on virtual device xmit
[ Upstream commit cb9f1b783850b14cbd7f87d061d784a666dfba1f ]

KMSAN detected read beyond end of buffer in vti and sit devices when
passing truncated packets with PF_PACKET. The issue affects additional
ip tunnel devices.

Extend commit 76c0ddd8c3 ("ip6_tunnel: be careful when accessing the
inner header") and commit ccfec9e5cb ("ip_tunnel: be careful when
accessing the inner header").

Move the check to a separate helper and call at the start of each
ndo_start_xmit function in net/ipv4 and net/ipv6.

Minor changes:
- convert dev_kfree_skb to kfree_skb on error path,
  as dev_kfree_skb calls consume_skb which is not for error paths.
- use pskb_network_may_pull even though that is pedantic here,
  as the same as pskb_may_pull for devices without llheaders.
- do not cache ipv6 hdrs if used only once
  (unsafe across pskb_may_pull, was more relevant to earlier patch)

Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-01-09 17:38:31 +01:00
Gustavo A. R. Silva
360fb1db92 ipv4: Fix potential Spectre v1 vulnerability
[ Upstream commit 5648451e30a0d13d11796574919a359025d52cce ]

vr.vifi is indirectly controlled by user-space, hence leading to
a potential exploitation of the Spectre variant 1 vulnerability.

This issue was detected with the help of Smatch:

net/ipv4/ipmr.c:1616 ipmr_ioctl() warn: potential spectre issue 'mrt->vif_table' [r] (local cap)
net/ipv4/ipmr.c:1690 ipmr_compat_ioctl() warn: potential spectre issue 'mrt->vif_table' [r] (local cap)

Fix this by sanitizing vr.vifi before using it to index mrt->vif_table'

Notice that given that speculation windows are large, the policy is
to kill the speculation on the first load and not worry if it can be
completed with a dependent load/store [1].

[1] https://marc.info/?l=linux-kernel&m=152449131114778&w=2

Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-01-09 17:38:31 +01:00
Eric Dumazet
d265655ae4 tcp: lack of available data can also cause TSO defer
commit f9bfe4e6a9d08d405fe7b081ee9a13e649c97ecf upstream.

tcp_tso_should_defer() can return true in three different cases :

 1) We are cwnd-limited
 2) We are rwnd-limited
 3) We are application limited.

Neal pointed out that my recent fix went too far, since
it assumed that if we were not in 1) case, we must be rwnd-limited

Fix this by properly populating the is_cwnd_limited and
is_rwnd_limited booleans.

After this change, we can finally move the silly check for FIN
flag only for the application-limited case.

The same move for EOR bit will be handled in net-next,
since commit 1c09f7d073b1 ("tcp: do not try to defer skbs
with eor mark (MSG_EOR)") is scheduled for linux-4.21

Tested by running 200 concurrent netperf -t TCP_RR -- -r 60000,100
and checking none of them was rwnd_limited in the chrono_stat
output from "ss -ti" command.

Fixes: 41727549de3e ("tcp: Do not underestimate rwnd_limited")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Suggested-by: Neal Cardwell <ncardwell@google.com>
Reviewed-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Reviewed-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-12-17 09:24:42 +01:00
Taehee Yoo
5517d4c6dc netfilter: nat: fix double register in masquerade modules
[ Upstream commit 095faf45e64be00bff4da2d6182dface3d69c9b7 ]

There is a reference counter to ensure that masquerade modules register
notifiers only once. However, the existing reference counter approach is
not safe, test commands are:

   while :
   do
   	   modprobe ip6t_MASQUERADE &
	   modprobe nft_masq_ipv6 &
	   modprobe -rv ip6t_MASQUERADE &
	   modprobe -rv nft_masq_ipv6 &
   done

numbers below represent the reference counter.
--------------------------------------------------------
CPU0        CPU1        CPU2        CPU3        CPU4
[insmod]    [insmod]    [rmmod]     [rmmod]     [insmod]
--------------------------------------------------------
0->1
register    1->2
            returns     2->1
			returns     1->0
                                                0->1
                                                register <--
                                    unregister
--------------------------------------------------------

The unregistation of CPU3 should be processed before the
registration of CPU4.

In order to fix this, use a mutex instead of reference counter.

splat looks like:
[  323.869557] watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [modprobe:1381]
[  323.869574] Modules linked in: nf_tables(+) nf_nat_ipv6(-) nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 n]
[  323.869574] irq event stamp: 194074
[  323.898930] hardirqs last  enabled at (194073): [<ffffffff90004a0d>] trace_hardirqs_on_thunk+0x1a/0x1c
[  323.898930] hardirqs last disabled at (194074): [<ffffffff90004a29>] trace_hardirqs_off_thunk+0x1a/0x1c
[  323.898930] softirqs last  enabled at (182132): [<ffffffff922006ec>] __do_softirq+0x6ec/0xa3b
[  323.898930] softirqs last disabled at (182109): [<ffffffff90193426>] irq_exit+0x1a6/0x1e0
[  323.898930] CPU: 0 PID: 1381 Comm: modprobe Not tainted 4.20.0-rc2+ #27
[  323.898930] RIP: 0010:raw_notifier_chain_register+0xea/0x240
[  323.898930] Code: 3c 03 0f 8e f2 00 00 00 44 3b 6b 10 7f 4d 49 bc 00 00 00 00 00 fc ff df eb 22 48 8d 7b 10 488
[  323.898930] RSP: 0018:ffff888101597218 EFLAGS: 00000206 ORIG_RAX: ffffffffffffff13
[  323.898930] RAX: 0000000000000000 RBX: ffffffffc04361c0 RCX: 0000000000000000
[  323.898930] RDX: 1ffffffff26132ae RSI: ffffffffc04aa3c0 RDI: ffffffffc04361d0
[  323.898930] RBP: ffffffffc04361c8 R08: 0000000000000000 R09: 0000000000000001
[  323.898930] R10: ffff8881015972b0 R11: fffffbfff26132c4 R12: dffffc0000000000
[  323.898930] R13: 0000000000000000 R14: 1ffff110202b2e44 R15: ffffffffc04aa3c0
[  323.898930] FS:  00007f813ed41540(0000) GS:ffff88811ae00000(0000) knlGS:0000000000000000
[  323.898930] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  323.898930] CR2: 0000559bf2c9f120 CR3: 000000010bc80000 CR4: 00000000001006f0
[  323.898930] Call Trace:
[  323.898930]  ? atomic_notifier_chain_register+0x2d0/0x2d0
[  323.898930]  ? down_read+0x150/0x150
[  323.898930]  ? sched_clock_cpu+0x126/0x170
[  323.898930]  ? nf_tables_core_module_init+0xe4/0xe4 [nf_tables]
[  323.898930]  ? nf_tables_core_module_init+0xe4/0xe4 [nf_tables]
[  323.898930]  register_netdevice_notifier+0xbb/0x790
[  323.898930]  ? __dev_close_many+0x2d0/0x2d0
[  323.898930]  ? __mutex_unlock_slowpath+0x17f/0x740
[  323.898930]  ? wait_for_completion+0x710/0x710
[  323.898930]  ? nf_tables_core_module_init+0xe4/0xe4 [nf_tables]
[  323.898930]  ? up_write+0x6c/0x210
[  323.898930]  ? nf_tables_core_module_init+0xe4/0xe4 [nf_tables]
[  324.127073]  ? nf_tables_core_module_init+0xe4/0xe4 [nf_tables]
[  324.127073]  nft_chain_filter_init+0x1e/0xe8a [nf_tables]
[  324.127073]  nf_tables_module_init+0x37/0x92 [nf_tables]
[ ... ]

Fixes: 8dd33cc93e ("netfilter: nf_nat: generalize IPv4 masquerading support for nf_tables")
Fixes: be6b635cd6 ("netfilter: nf_nat: generalize IPv6 masquerading support for nf_tables")
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

Signed-off-by: Sasha Levin <sashal@kernel.org>
2018-12-17 09:24:37 +01:00
Taehee Yoo
18218f827e netfilter: add missing error handling code for register functions
[ Upstream commit 584eab291c67894cb17cc87544b9d086228ea70f ]

register_{netdevice/inetaddr/inet6addr}_notifier may return an error
value, this patch adds the code to handle these error paths.

Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2018-12-17 09:24:36 +01:00
Yuchung Cheng
bbc83e8d08 tcp: fix NULL ref in tail loss probe
[ Upstream commit b2b7af861122a0c0f6260155c29a1b2e594cd5b5 ]

TCP loss probe timer may fire when the retranmission queue is empty but
has a non-zero tp->packets_out counter. tcp_send_loss_probe will call
tcp_rearm_rto which triggers NULL pointer reference by fetching the
retranmission queue head in its sub-routines.

Add a more detailed warning to help catch the root cause of the inflight
accounting inconsistency.

Reported-by: Rafael Tinoco <rafael.tinoco@linaro.org>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-12-17 09:24:28 +01:00
Eric Dumazet
03b271cb91 tcp: Do not underestimate rwnd_limited
[ Upstream commit 41727549de3e7281feb174d568c6e46823db8684 ]

If available rwnd is too small, tcp_tso_should_defer()
can decide it is worth waiting before splitting a TSO packet.

This really means we are rwnd limited.

Fixes: 5615f88614 ("tcp: instrument how long TCP is limited by receive window")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Reviewed-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-12-17 09:24:28 +01:00
Edward Cree
7fafda16bb net: use skb_list_del_init() to remove from RX sublists
[ Upstream commit 22f6bbb7bcfcef0b373b0502a7ff390275c575dd ]

list_del() leaves the skb->next pointer poisoned, which can then lead to
 a crash in e.g. OVS forwarding.  For example, setting up an OVS VXLAN
 forwarding bridge on sfc as per:

========
$ ovs-vsctl show
5dfd9c47-f04b-4aaa-aa96-4fbb0a522a30
    Bridge "br0"
        Port "br0"
            Interface "br0"
                type: internal
        Port "enp6s0f0"
            Interface "enp6s0f0"
        Port "vxlan0"
            Interface "vxlan0"
                type: vxlan
                options: {key="1", local_ip="10.0.0.5", remote_ip="10.0.0.4"}
    ovs_version: "2.5.0"
========
(where 10.0.0.5 is an address on enp6s0f1)
and sending traffic across it will lead to the following panic:
========
general protection fault: 0000 [#1] SMP PTI
CPU: 5 PID: 0 Comm: swapper/5 Not tainted 4.20.0-rc3-ehc+ #701
Hardware name: Dell Inc. PowerEdge R710/0M233H, BIOS 6.4.0 07/23/2013
RIP: 0010:dev_hard_start_xmit+0x38/0x200
Code: 53 48 89 fb 48 83 ec 20 48 85 ff 48 89 54 24 08 48 89 4c 24 18 0f 84 ab 01 00 00 48 8d 86 90 00 00 00 48 89 f5 48 89 44 24 10 <4c> 8b 33 48 c7 03 00 00 00 00 48 8b 05 c7 d1 b3 00 4d 85 f6 0f 95
RSP: 0018:ffff888627b437e0 EFLAGS: 00010202
RAX: 0000000000000000 RBX: dead000000000100 RCX: ffff88862279c000
RDX: ffff888614a342c0 RSI: 0000000000000000 RDI: 0000000000000000
RBP: ffff888618a88000 R08: 0000000000000001 R09: 00000000000003e8
R10: 0000000000000000 R11: ffff888614a34140 R12: 0000000000000000
R13: 0000000000000062 R14: dead000000000100 R15: ffff888616430000
FS:  0000000000000000(0000) GS:ffff888627b40000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f6d2bc6d000 CR3: 000000000200a000 CR4: 00000000000006e0
Call Trace:
 <IRQ>
 __dev_queue_xmit+0x623/0x870
 ? masked_flow_lookup+0xf7/0x220 [openvswitch]
 ? ep_poll_callback+0x101/0x310
 do_execute_actions+0xaba/0xaf0 [openvswitch]
 ? __wake_up_common+0x8a/0x150
 ? __wake_up_common_lock+0x87/0xc0
 ? queue_userspace_packet+0x31c/0x5b0 [openvswitch]
 ovs_execute_actions+0x47/0x120 [openvswitch]
 ovs_dp_process_packet+0x7d/0x110 [openvswitch]
 ovs_vport_receive+0x6e/0xd0 [openvswitch]
 ? dst_alloc+0x64/0x90
 ? rt_dst_alloc+0x50/0xd0
 ? ip_route_input_slow+0x19a/0x9a0
 ? __udp_enqueue_schedule_skb+0x198/0x1b0
 ? __udp4_lib_rcv+0x856/0xa30
 ? __udp4_lib_rcv+0x856/0xa30
 ? cpumask_next_and+0x19/0x20
 ? find_busiest_group+0x12d/0xcd0
 netdev_frame_hook+0xce/0x150 [openvswitch]
 __netif_receive_skb_core+0x205/0xae0
 __netif_receive_skb_list_core+0x11e/0x220
 netif_receive_skb_list+0x203/0x460
 ? __efx_rx_packet+0x335/0x5e0 [sfc]
 efx_poll+0x182/0x320 [sfc]
 net_rx_action+0x294/0x3c0
 __do_softirq+0xca/0x297
 irq_exit+0xa6/0xb0
 do_IRQ+0x54/0xd0
 common_interrupt+0xf/0xf
 </IRQ>
========
So, in all listified-receive handling, instead pull skbs off the lists with
 skb_list_del_init().

Fixes: 9af86f9338 ("net: core: fix use-after-free in __netif_receive_skb_list_core")
Fixes: 7da517a3bc ("net: core: Another step of skb receive list processing")
Fixes: a4ca8b7df7 ("net: ipv4: fix drop handling in ip_list_rcv() and ip_list_rcv_finish()")
Fixes: d8269e2cbf ("net: ipv6: listify ipv6_rcv() and ip6_rcv_finish()")
Signed-off-by: Edward Cree <ecree@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-12-17 09:24:27 +01:00
Jiri Wiesner
ffe5754d28 ipv4: ipv6: netfilter: Adjust the frag mem limit when truesize changes
[ Upstream commit ebaf39e6032faf77218220707fc3fa22487784e0 ]

The *_frag_reasm() functions are susceptible to miscalculating the byte
count of packet fragments in case the truesize of a head buffer changes.
The truesize member may be changed by the call to skb_unclone(), leaving
the fragment memory limit counter unbalanced even if all fragments are
processed. This miscalculation goes unnoticed as long as the network
namespace which holds the counter is not destroyed.

Should an attempt be made to destroy a network namespace that holds an
unbalanced fragment memory limit counter the cleanup of the namespace
never finishes. The thread handling the cleanup gets stuck in
inet_frags_exit_net() waiting for the percpu counter to reach zero. The
thread is usually in running state with a stacktrace similar to:

 PID: 1073   TASK: ffff880626711440  CPU: 1   COMMAND: "kworker/u48:4"
  #5 [ffff880621563d48] _raw_spin_lock at ffffffff815f5480
  #6 [ffff880621563d48] inet_evict_bucket at ffffffff8158020b
  #7 [ffff880621563d80] inet_frags_exit_net at ffffffff8158051c
  #8 [ffff880621563db0] ops_exit_list at ffffffff814f5856
  #9 [ffff880621563dd8] cleanup_net at ffffffff814f67c0
 #10 [ffff880621563e38] process_one_work at ffffffff81096f14

It is not possible to create new network namespaces, and processes
that call unshare() end up being stuck in uninterruptible sleep state
waiting to acquire the net_mutex.

The bug was observed in the IPv6 netfilter code by Per Sundstrom.
I thank him for his analysis of the problem. The parts of this patch
that apply to IPv4 and IPv6 fragment reassembly are preemptive measures.

Signed-off-by: Jiri Wiesner <jwiesner@suse.com>
Reported-by: Per Sundstrom <per.sundstrom@redqube.se>
Acked-by: Peter Oskolkov <posk@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-12-17 09:24:26 +01:00
Eric Dumazet
aaa7e45c00 tcp: defer SACK compression after DupThresh
[ Upstream commit 86de5921a3d5dd246df661e09bdd0a6131b39ae3 ]

Jean-Louis reported a TCP regression and bisected to recent SACK
compression.

After a loss episode (receiver not able to keep up and dropping
packets because its backlog is full), linux TCP stack is sending
a single SACK (DUPACK).

Sender waits a full RTO timer before recovering losses.

While RFC 6675 says in section 5, "Algorithm Details",

   (2) If DupAcks < DupThresh but IsLost (HighACK + 1) returns true --
       indicating at least three segments have arrived above the current
       cumulative acknowledgment point, which is taken to indicate loss
       -- go to step (4).
...
   (4) Invoke fast retransmit and enter loss recovery as follows:

there are old TCP stacks not implementing this strategy, and
still counting the dupacks before starting fast retransmit.

While these stacks probably perform poorly when receivers implement
LRO/GRO, we should be a little more gentle to them.

This patch makes sure we do not enable SACK compression unless
3 dupacks have been sent since last rcv_nxt update.

Ideally we should even rearm the timer to send one or two
more DUPACK if no more packets are coming, but that will
be work aiming for linux-4.21.

Many thanks to Jean-Louis for bisecting the issue, providing
packet captures and testing this patch.

Fixes: 5d9f4262b7 ("tcp: add SACK compression")
Reported-by: Jean-Louis Dupond <jean-louis@dupond.be>
Tested-by: Jean-Louis Dupond <jean-louis@dupond.be>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-12-05 19:32:00 +01:00
Eric Dumazet
cddcc9959a tcp: do not release socket ownership in tcp_close()
commit 8873c064d1de579ea23412a6d3eee972593f142b upstream.

syzkaller was able to hit the WARN_ON(sock_owned_by_user(sk));
in tcp_close()

While a socket is being closed, it is very possible other
threads find it in rtnetlink dump.

tcp_get_info() will acquire the socket lock for a short amount
of time (slow = lock_sock_fast(sk)/unlock_sock_fast(sk, slow);),
enough to trigger the warning.

Fixes: 67db3e4bfb ("tcp: no longer hold ehash lock while calling tcp_get_info()")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-12-01 09:37:28 +01:00
Eric Dumazet
115973a608 inet: frags: better deal with smp races
[ Upstream commit 0d5b9311baf27bb545f187f12ecfd558220c607d ]

Multiple cpus might attempt to insert a new fragment in rhashtable,
if for example RPS is buggy, as reported by 배석진 in
https://patchwork.ozlabs.org/patch/994601/

We use rhashtable_lookup_get_insert_key() instead of
rhashtable_insert_fast() to let cpus losing the race
free their own inet_frag_queue and use the one that
was inserted by another cpu.

Fixes: 648700f76b ("inet: frags: use rhashtables for reassembly units")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: 배석진 <soukjin.bae@samsung.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-11-23 08:17:05 +01:00
Stephen Mallon
7e6782271f tcp: Fix SOF_TIMESTAMPING_RX_HARDWARE to use the latest timestamp during TCP coalescing
[ Upstream commit cadf9df27e7cf40e390e060a1c71bb86ecde798b ]

During tcp coalescing ensure that the skb hardware timestamp refers to the
highest sequence number data.
Previously only the software timestamp was updated during coalescing.

Signed-off-by: Stephen Mallon <stephen.mallon@sydney.edu.au>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-11-23 08:17:03 +01:00
Sabrina Dubroca
60258098e8 ip_tunnel: don't force DF when MTU is locked
[ Upstream commit 16f7eb2b77b55da816c4e207f3f9440a8cafc00a ]

The various types of tunnels running over IPv4 can ask to set the DF
bit to do PMTU discovery. However, PMTU discovery is subject to the
threshold set by the net.ipv4.route.min_pmtu sysctl, and is also
disabled on routes with "mtu lock". In those cases, we shouldn't set
the DF bit.

This patch makes setting the DF bit conditional on the route's MTU
locking state.

This issue seems to be older than git history.

Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Reviewed-by: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-11-23 08:17:02 +01:00
Stefan Nuernberger
247a9fa4b4 net/ipv4: defensive cipso option parsing
commit 076ed3da0c9b2f88d9157dbe7044a45641ae369e upstream.

commit 40413955ee ("Cipso: cipso_v4_optptr enter infinite loop") fixed
a possible infinite loop in the IP option parsing of CIPSO. The fix
assumes that ip_options_compile filtered out all zero length options and
that no other one-byte options beside IPOPT_END and IPOPT_NOOP exist.
While this assumption currently holds true, add explicit checks for zero
length and invalid length options to be safe for the future. Even though
ip_options_compile should have validated the options, the introduction of
new one-byte options can still confuse this code without the additional
checks.

Signed-off-by: Stefan Nuernberger <snu@amazon.com>
Cc: David Woodhouse <dwmw@amazon.co.uk>
Cc: Simon Veith <sveith@amazon.de>
Cc: stable@vger.kernel.org
Acked-by: Paul Moore <paul@paul-moore.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-11-13 11:08:41 -08:00
Cong Wang
54d6a82d05 net: drop skb on failure in ip_check_defrag()
[ Upstream commit 7de414a9dd91426318df7b63da024b2b07e53df5 ]

Most callers of pskb_trim_rcsum() simply drop the skb when
it fails, however, ip_check_defrag() still continues to pass
the skb up to stack. This is suspicious.

In ip_check_defrag(), after we learn the skb is an IP fragment,
passing the skb to callers makes no sense, because callers expect
fragments are defrag'ed on success. So, dropping the skb when we
can't defrag it is reasonable.

Note, prior to commit 88078d98d1, this is not a big problem as
checksum will be fixed up anyway. After it, the checksum is not
correct on failure.

Found this during code review.

Fixes: 88078d98d1 ("net: pskb_trim_rcsum() and CHECKSUM_COMPLETE are friends")
Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-11-04 14:50:53 +01:00
Karsten Graul
fd54c188b3 Revert "net: simplify sock_poll_wait"
[ Upstream commit 89ab066d4229acd32e323f1569833302544a4186 ]

This reverts commit dd979b4df8.

This broke tcp_poll for SMC fallback: An AF_SMC socket establishes an
internal TCP socket for the initial handshake with the remote peer.
Whenever the SMC connection can not be established this TCP socket is
used as a fallback. All socket operations on the SMC socket are then
forwarded to the TCP socket. In case of poll, the file->private_data
pointer references the SMC socket because the TCP socket has no file
assigned. This causes tcp_poll to wait on the wrong socket.

Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-11-04 14:50:51 +01:00
Sean Tranchetti
4fb0dc97de net: udp: fix handling of CHECKSUM_COMPLETE packets
[ Upstream commit db4f1be3ca9b0ef7330763d07bf4ace83ad6f913 ]

Current handling of CHECKSUM_COMPLETE packets by the UDP stack is
incorrect for any packet that has an incorrect checksum value.

udp4/6_csum_init() will both make a call to
__skb_checksum_validate_complete() to initialize/validate the csum
field when receiving a CHECKSUM_COMPLETE packet. When this packet
fails validation, skb->csum will be overwritten with the pseudoheader
checksum so the packet can be fully validated by software, but the
skb->ip_summed value will be left as CHECKSUM_COMPLETE so that way
the stack can later warn the user about their hardware spewing bad
checksums. Unfortunately, leaving the SKB in this state can cause
problems later on in the checksum calculation.

Since the the packet is still marked as CHECKSUM_COMPLETE,
udp_csum_pull_header() will SUBTRACT the checksum of the UDP header
from skb->csum instead of adding it, leaving us with a garbage value
in that field. Once we try to copy the packet to userspace in the
udp4/6_recvmsg(), we'll make a call to skb_copy_and_csum_datagram_msg()
to checksum the packet data and add it in the garbage skb->csum value
to perform our final validation check.

Since the value we're validating is not the proper checksum, it's possible
that the folded value could come out to 0, causing us not to drop the
packet. Instead, we believe that the packet was checksummed incorrectly
by hardware since skb->ip_summed is still CHECKSUM_COMPLETE, and we attempt
to warn the user with netdev_rx_csum_fault(skb->dev);

Unfortunately, since this is the UDP path, skb->dev has been overwritten
by skb->dev_scratch and is no longer a valid pointer, so we end up
reading invalid memory.

This patch addresses this problem in two ways:
	1) Do not use the dev pointer when calling netdev_rx_csum_fault()
	   from skb_copy_and_csum_datagram_msg(). Since this gets called
	   from the UDP path where skb->dev has been overwritten, we have
	   no way of knowing if the pointer is still valid. Also for the
	   sake of consistency with the other uses of
	   netdev_rx_csum_fault(), don't attempt to call it if the
	   packet was checksummed by software.

	2) Add better CHECKSUM_COMPLETE handling to udp4/6_csum_init().
	   If we receive a packet that's CHECKSUM_COMPLETE that fails
	   verification (i.e. skb->csum_valid == 0), check who performed
	   the calculation. It's possible that the checksum was done in
	   software by the network stack earlier (such as Netfilter's
	   CONNTRACK module), and if that says the checksum is bad,
	   we can drop the packet immediately instead of waiting until
	   we try and copy it to userspace. Otherwise, we need to
	   mark the SKB as CHECKSUM_NONE, since the skb->csum field
	   no longer contains the full packet checksum after the
	   call to __skb_checksum_validate_complete().

Fixes: e6afc8ace6 ("udp: remove headers from UDP packets before queueing")
Fixes: c84d949057 ("udp: copy skb->truesize in the first cache line")
Cc: Sam Kumar <samanthakumar@google.com>
Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: Sean Tranchetti <stranche@codeaurora.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-11-04 14:50:51 +01:00
Nikolay Aleksandrov
eddf016b91 net: ipmr: fix unresolved entry dumps
If the skb space ends in an unresolved entry while dumping we'll miss
some unresolved entries. The reason is due to zeroing the entry counter
between dumping resolved and unresolved mfc entries. We should just
keep counting until the whole table is dumped and zero when we move to
the next as we have a separate table counter.

Reported-by: Colin Ian King <colin.king@canonical.com>
Fixes: 8fb472c09b ("ipmr: improve hash scalability")
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-10-17 22:35:42 -07:00
Sabrina Dubroca
28d35bcdd3 net: ipv4: don't let PMTU updates increase route MTU
When an MTU update with PMTU smaller than net.ipv4.route.min_pmtu is
received, we must clamp its value. However, we can receive a PMTU
exception with PMTU < old_mtu < ip_rt_min_pmtu, which would lead to an
increase in PMTU.

To fix this, take the smallest of the old MTU and ip_rt_min_pmtu.

Before this patch, in case of an update, the exception's MTU would
always change. Now, an exception can have only its lock flag updated,
but not the MTU, so we need to add a check on locking to the following
"is this exception getting updated, or close to expiring?" test.

Fixes: d52e5a7e7c ("ipv4: lock mtu in fnhe when received PMTU < net.ipv4.route.min_pmtu")
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Reviewed-by: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-10-10 22:44:46 -07:00
Sabrina Dubroca
af7d6cce53 net: ipv4: update fnhe_pmtu when first hop's MTU changes
Since commit 5aad1de5ea ("ipv4: use separate genid for next hop
exceptions"), exceptions get deprecated separately from cached
routes. In particular, administrative changes don't clear PMTU anymore.

As Stefano described in commit e9fa1495d7 ("ipv6: Reflect MTU changes
on PMTU of exceptions for MTU-less routes"), the PMTU discovered before
the local MTU change can become stale:
 - if the local MTU is now lower than the PMTU, that PMTU is now
   incorrect
 - if the local MTU was the lowest value in the path, and is increased,
   we might discover a higher PMTU

Similarly to what commit e9fa1495d7 did for IPv6, update PMTU in those
cases.

If the exception was locked, the discovered PMTU was smaller than the
minimal accepted PMTU. In that case, if the new local MTU is smaller
than the current PMTU, let PMTU discovery figure out if locking of the
exception is still needed.

To do this, we need to know the old link MTU in the NETDEV_CHANGEMTU
notifier. By the time the notifier is called, dev->mtu has been
changed. This patch adds the old MTU as additional information in the
notifier structure, and a new call_netdevice_notifiers_u32() function.

Fixes: 5aad1de5ea ("ipv4: use separate genid for next hop exceptions")
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Reviewed-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-10-10 22:44:46 -07:00
Jiri Kosina
7e823644b6 udp: Unbreak modules that rely on external __skb_recv_udp() availability
Commit 2276f58ac5 ("udp: use a separate rx queue for packet reception")
turned static inline __skb_recv_udp() from being a trivial helper around
__skb_recv_datagram() into a UDP specific implementaion, making it
EXPORT_SYMBOL_GPL() at the same time.

There are external modules that got broken by __skb_recv_udp() not being
visible to them. Let's unbreak them by making __skb_recv_udp EXPORT_SYMBOL().

Rationale (one of those) why this is actually "technically correct" thing
to do: __skb_recv_udp() used to be an inline wrapper around
__skb_recv_datagram(), which itself (still, and correctly so, I believe)
is EXPORT_SYMBOL().

Cc: Paolo Abeni <pabeni@redhat.com>
Cc: Eric Dumazet <edumazet@google.com>
Fixes: 2276f58ac5 ("udp: use a separate rx queue for packet reception")
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-10-07 20:33:10 -07:00
Eric Dumazet
64199fc0a4 ipv4: fix use-after-free in ip_cmsg_recv_dstaddr()
Caching ip_hdr(skb) before a call to pskb_may_pull() is buggy,
do not do it.

Fixes: 2efd4fca70 ("ip: in cmsg IP(V6)_ORIGDSTADDR call pskb_may_pull")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Willem de Bruijn <willemb@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-10-02 22:32:05 -07:00